<aside> 📢
<aside> 🎯
</aside>
<aside>
Quantization maps high-precision floating-point parameters (e.g., FP32 or FP16) to lower-precision representations (e.g., INT8 or INT4). In the context of Large Language Models (LLMs), this primarily targets the memory-bandwidth bottleneck during autoregressive decoding, where inference is often constrained by the rate at which weights and KV-cache data can be fetched from HBM rather than by raw compute throughput.
By compressing model weights, quantization reduces the number of bytes transferred per memory access, thereby lowering memory traffic per token generation step.
The process provides several practical advantages for inference deployment:
Memory Bandwidth Reduction:
Lower-bit representations reduce memory transfer volume approximately proportional to bit-width (e.g., INT4 vs FP16 yields ~4× reduction in weight bandwidth). This can improve tokens-per-second in memory-bound decoding regimes by reducing HBM pressure, rather than simply “preventing compute idling,” which is workload-dependent.
VRAM Efficiency:
Reducing precision decreases model storage footprint. For example, a 70B parameter model requires ~140GB in FP16. Under ideal INT4 weight-only quantization, this can be reduced to ~35GB. However, end-to-end memory usage is higher in practice due to KV cache (typically FP16/BF16), activation buffers, and runtime metadata.
Compute Acceleration:
Many modern accelerators (e.g., Tensor Cores) support lower-precision matrix multiplication (INT8/FP16/BF16). While lower precision can increase theoretical throughput, real-world speedups depend on kernel fusion, memory movement overhead, and whether computation or memory is the dominant bottleneck.
⚠️ Limitations & Caveats
Information loss is unavoidable:
Quantization introduces rounding error due to finite representational capacity. This error can accumulate across layers, potentially affecting calibration-sensitive behaviors such as long-horizon reasoning and instruction-following robustness. However, degradation is highly method-dependent (e.g., naive PTQ vs GPTQ/AWQ differ substantially).
Dequantization and mixed-precision overhead:
Even in INT8/INT4 kernels, many operations (especially nonlinearities and accumulations) are executed in higher precision (FP16/FP32). This introduces partial dequantization overhead and reduces the idealized “pure integer pipeline” assumption.
$$
\begin{aligned} & \text{[Affine Mapping]: } X_{int} = \text{clip}\left(\text{round}\left(\frac{X_{float}}{s}\right) + z, \, q_{min}, \, q_{max}\right) \\ & \text{[Dequantization]: } \tilde{X}{float} = s \cdot (X{int} - z) \\ & \text{[Symmetric Scaling Factor]: } s = \frac{\max(|X_{float}|)}{2^{b-1} - 1}, \quad z = 0 \\ & \text{[Quantization Error]: } \mathcal{E} = \|X_{float} - \tilde{X}{float}\|2^2 \\ & \text{[Straight-Through Estimator]: } \frac{\partial \mathcal{L}}{\partial X{float}} \approx \frac{\partial \mathcal{L}}{\partial X{int}} \\ & \text{[SmoothQuant Migration]: } \tilde{W}{ij} = W{ij} \cdot \alpha_j, \quad \tilde{X}{ij} = X{ij} / \alpha_j \end{aligned}
$$
Post-Training Quantization (PTQ) applies quantization after full-precision training is completed. It typically relies on a calibration dataset to estimate activation statistics (e.g., min/max, or histogram-based estimators) in order to determine scaling factors s and zero-points z. Importantly, PTQ does not modify model weights via gradient-based optimization.
Modern PTQ methods for LLMs (e.g., GPTQ, AWQ) go beyond naive min–max scaling by incorporating second-order or activation-aware approximations, significantly improving low-bit robustness compared to earlier PTQ approaches.
Quantization-Aware Training (QAT) incorporates quantization effects during training. Since rounding operations are non-differentiable, QAT typically employs a Straight-Through Estimator (STE) in the backward pass to approximate gradients through discrete operations. This allows the optimizer to adapt weights to compensate for quantization-induced error.
⚠️ Limitations & Caveats
Weight-only quantization (e.g., W4A16) reduces memory footprint but does not fully unlock compute acceleration. Full hardware acceleration requires both weights and activations to be quantized (e.g., W8A8). However, LLM activations exhibit heavy-tailed distributions, where a small subset of channels dominate magnitude (activation outliers), making uniform quantization inefficient.
When applying naive W8A8 quantization, a single large outlier can inflate the scaling factor s, compressing the dynamic range of non-outlier values and causing significant information loss due to effective bin collapse.
SmoothQuant addresses this issue via a static reparameterization between activations and weights. It introduces a per-channel scaling factor $\alpha$, redistributing magnitude:
This transformation preserves functional equivalence: $W X = \tilde{W} \tilde{X}$
and shifts quantization difficulty from activations (which are dynamic and heavy-tailed) to weights (which are more stationary).
| Dimension | Standard PTQ (W8A8) | SmoothQuant (W8A8) | Weight-Only (W4A16) |
|---|---|---|---|
| Compute Efficiency | High (integer-friendly kernels) | High | Moderate (FP16 compute remains) |
| Calibration Requirement | Small dataset | Small dataset | Minimal |
| Outlier Sensitivity | High | Reduced via rebalancing | Not applicable |
| Failure Modes | Accuracy degradation under heavy-tailed activations | Sensitivity to distribution shift in activation statistics | Compute bottleneck remains in FP16 ops |
⚠️ Limitations & Caveats
<aside>
PagedAttention addresses a critical inefficiency in auto-regressive LLM serving: the dynamic and unpredictable memory footprint of the Key-Value (KV) cache. In many naive inference systems, memory for the KV cache is either pre-allocated contiguously based on the maximum possible sequence length or managed via inefficient dynamic allocation. Because most requests finish generating long before reaching this maximum, a significant portion of VRAM can be wasted due to internal fragmentation (reserved but unused memory) and external fragmentation (small, unusable gaps between allocations).
Drawing inspiration from operating system virtual memory, PagedAttention (Kwon et al., 2023, arXiv:2309.06180) divides the KV cache into fixed-size physical blocks (e.g., chunks of 16 or 32 tokens). It uses a Block Table to map the continuous logical sequence of tokens to these non-contiguous physical blocks in GPU memory. Conceptually, this virtualizes KV cache memory, decoupling logical sequence layout from physical memory allocation.
⚠️ Limitations & Caveats
$$
\begin{aligned} & \text{[Physical Block Size]: } S_{block} = B_{size} \times n_{heads} \times d_{head} \times p \\ & \quad\quad p = \text{bytes per parameter} \\ & \text{[Logical to Physical Mapping]: } \text{PhysAddr}i = \text{BT}\!\left[\lfloor i / B{size} \rfloor\right] \times B_{size} + (i \bmod B_{size}) \\ & \text{[Memory Waste Naive]: } \mathcal{W}{naive} = \sum{j=1}^{N} (\text{MaxLen} - \text{ActualLen}j) \times S{token} \\ & \text{[Memory Waste Paged]: } \mathcal{W}{paged} = \sum{j=1}^{N} (B_{size} - \text{ActualLen}j \bmod B{size}) \times S_{token} \leq N \times S_{block} \\ & \text{[Paged Attention]: } \text{Attn}(Q_t, K) = \sum_{b=0}^{\lfloor t/B_{size} \rfloor} Q_t \cdot K_{b}^{T}, \quad K_b = \text{block } b \text{ from BT} \end{aligned}
$$
During decoding, the model generates a new Query vector for the current token and must compute attention scores against all historical Keys. In PagedAttention, the CUDA kernel does not iterate over a single contiguous memory region. Instead, it consults the Block Table to determine the physical locations of the blocks corresponding to the sequence.
The kernel then iterates over these blocks, loading Key and Value chunks into on-chip SRAM. It computes partial attention contributions for each block and aggregates them using numerically stable softmax accumulation. This ensures that the final attention output is mathematically identical to standard attention, while the underlying memory layout remains non-contiguous.
⚠️ Limitations & Caveats
While PagedAttention enables memory sharing for sequences generated simultaneously (e.g., Beam Search), it does not natively reuse KV cache across independent requests. RadixAttention, introduced in SGLang (Zheng et al., 2023, arXiv:2312.07104), extends this idea by organizing KV cache blocks into a global Radix Tree.
Completed requests are not immediately evicted. Instead, their KV cache blocks are retained in GPU memory and indexed by their token prefixes. When a new request arrives, the system traverses the Radix Tree to identify shared prefixes and reuses existing KV cache blocks, computing only the new suffix tokens. An eviction policy (e.g., Least Recently Used) is applied when memory pressure increases.
| --- | --- | --- |
⚠️ Limitations & Caveats
</aside>
<aside> <img src="/icons/reorder_gray.svg" alt="/icons/reorder_gray.svg" width="40px" />
</aside>