RAGE-QUANT: 3x Faster LLM Inference on CPU with Pure Rust Quantized GEMV

rust dev.to

Skip dequantization. Save 57% RAM. Get 3x faster decode. No GPU required.


Every LLM framework (llama.cpp, candle, burn) does this:

GGUF quantized weights → dequantize to f32 → f32 GEMV → result
             4x DRAM bandwidth wasted ^     ^ 3.2 GB RAM for dense cache
Enter fullscreen mode Exit fullscreen mode

RAGE-QUANT does this instead:

GGUF quantized weights → quantized GEMV → result
         reads 1.06 bytes/element instead of 4 bytes = 3.76x less DRAM traffic
Enter fullscreen mode Exit fullscreen mode

No dequantization step. No f32 cache. 57% less RAM. 3x faster decode.


Real Benchmarks (not theoretical)

Tested on Qwen3-0.6B-Q8_0.gguf | CPU-only | AMD Ryzen 9 9900X | 12 threads

What we measured Before After Improvement
Decode latency per token 42 ms 14 ms 3.0x faster
From naive Rust 120,000 ms 466 ms 257x faster
From sgemm baseline 74,758 ms 466 ms 160x faster
Peak RAM usage 3.2 GB 1.38 GB 57% less
Throughput ~24 tok/s 67-71 tok/s ~3x more

These numbers are real, measured, reproducible. See the full methodology.


Why is it faster?

On modern CPUs, LLM decode (batch=1) is DRAM bandwidth-limited, not compute-limited. By reading 1 byte (quantized) instead of 4 bytes (f32), you move 3.76x less data through the memory bus. The speedup follows directly.

Additionally: LLVM cannot auto-vectorize the i8-to-f32 widening path. It tries i8→i16→i32→f32, wasting registers. Manual vpmovsxbd (i8→i32 direct) via _mm256_cvtepi8_epi32 is required. This is why hand-written AVX2 intrinsics beat the compiler here.


Quick Start

[dependencies]
rage-quant = "0.1"
Enter fullscreen mode Exit fullscreen mode
use rage_quant::dot_q8_0_f32;

let result = dot_q8_0_f32(&quantized_weights, &input_vector, num_elements);
// Auto-detects AVX2+FMA at runtime; falls back to scalar on older CPUs
Enter fullscreen mode Exit fullscreen mode

Supported formats: Q8_0, Q6_K, Q4_K (GGUF-native blocks).


Why not just use llama.cpp?

llama.cpp is excellent, but:

  • It is C/C++ — integrating into a Rust project requires unsafe FFI bindings
  • It is monolithic — you cannot extract just the quantized dot product without pulling the entire engine
  • rage-quant is a standalone Rust cratecargo add rage-quant and you have the kernels

CPU Optimization Findings (T1-T9)

This crate embodies 9 validated CPU inference optimizations discovered during development:

ID What was optimized Measured result
T1 GEMV on quantized data (skip f32) decode 42ms → 18ms = 2.3x
T2 Eliminate dense f32 weight caches RSS 3.2GB → 1.38GB = -57% RAM
T3 AVX2 widening i8→f32 intrinsics +18.8% on top of T1
T4 Memory-bound diagnosis Proved DRAM is the bottleneck
T7 GEMV vs sgemm for m=1 decode sgemm 180ms vs GEMV 18ms = 10x
T8 QKV fusion (decode-only path) 1.8x per-layer QKV compute
T9 Column-tiling for GEMM prefill 5091ms → 3057ms = 1.67x

Hardware Requirements

  • Minimum: Any x86_64 CPU (scalar fallback works everywhere)
  • Recommended: AVX2+FMA support (Intel Haswell 2013+ / AMD Zen 2017+)
  • Tested on: AMD Ryzen 9 9900X (Zen 5), DDR5, 12 threads

ARM NEON and AVX-512 support are planned.


Links


License

Dual-licensed:

  • AGPL-3.0 — free for open-source, personal, and academic use
  • Commercial — for proprietary/closed-source use (contact: the@angriestboy.com)

Published from RAGE-QUANT v0.1.0 — pure Rust, zero dependencies, 3x faster.

Source: dev.to

arrow_back Back to Tutorials