<aside> 📢

<aside> 🎯

Master AI & ML with Educatum, Your AI University

Curated resources from leading universities and industry experts to help you master artificial intelligence.

Build your knowledge base, and prepare for interviews.

Join study group and learn together.

Discover top AI tools and companies.

Connect with like-minded professionals.

No Ads, no noise

</aside>

<aside>

[Daily AI Interview Questions] 12. How does PagedAttention resolve KV Cache memory fragmentation, and what are its serving trade-offs?

PagedAttention addresses a critical inefficiency in auto-regressive LLM serving: the dynamic and unpredictable memory footprint of the Key-Value (KV) cache. In many naive inference systems, memory for the KV cache is either pre-allocated contiguously based on the maximum possible sequence length or managed via inefficient dynamic allocation. Because most requests finish generating long before reaching this maximum, a significant portion of VRAM can be wasted due to internal fragmentation (reserved but unused memory) and external fragmentation (small, unusable gaps between allocations).

Drawing inspiration from operating system virtual memory, PagedAttention (Kwon et al., 2023, arXiv:2309.06180) divides the KV cache into fixed-size physical blocks (e.g., chunks of 16 or 32 tokens). It uses a Block Table to map the continuous logical sequence of tokens to these non-contiguous physical blocks in GPU memory. Conceptually, this virtualizes KV cache memory, decoupling logical sequence layout from physical memory allocation.

⚠️ Limitations & Caveats


🧪 Core Insights & Mathematical Foundations

$$

\begin{aligned} & \text{[Physical Block Size]: } S_{block} = B_{size} \times n_{heads} \times d_{head} \times p \\ & \quad\quad p = \text{bytes per parameter} \\ & \text{[Logical to Physical Mapping]: } \text{PhysAddr}i = \text{BT}\!\left[\lfloor i / B{size} \rfloor\right] \times B_{size} + (i \bmod B_{size}) \\ & \text{[Memory Waste Naive]: } \mathcal{W}{naive} = \sum{j=1}^{N} (\text{MaxLen} - \text{ActualLen}j) \times S{token} \\ & \text{[Memory Waste Paged]: } \mathcal{W}{paged} = \sum{j=1}^{N} (B_{size} - \text{ActualLen}j \bmod B{size}) \times S_{token} \leq N \times S_{block} \\ & \text{[Paged Attention]: } \text{Attn}(Q_t, K) = \sum_{b=0}^{\lfloor t/B_{size} \rfloor} Q_t \cdot K_{b}^{T}, \quad K_b = \text{block } b \text{ from BT} \end{aligned}

$$


Follow-up 1: Explain the implementation mechanics of the Block Table during the attention computation.

During decoding, the model generates a new Query vector for the current token and must compute attention scores against all historical Keys. In PagedAttention, the CUDA kernel does not iterate over a single contiguous memory region. Instead, it consults the Block Table to determine the physical locations of the blocks corresponding to the sequence.

The kernel then iterates over these blocks, loading Key and Value chunks into on-chip SRAM. It computes partial attention contributions for each block and aggregates them using numerically stable softmax accumulation. This ensures that the final attention output is mathematically identical to standard attention, while the underlying memory layout remains non-contiguous.

⚠️ Limitations & Caveats


Follow-up 2: How do prefix-aware architectures like RadixAttention (SGLang) advance beyond standard PagedAttention? (Optional)

While PagedAttention enables memory sharing for sequences generated simultaneously (e.g., Beam Search), it does not natively reuse KV cache across independent requests. RadixAttention, introduced in SGLang (Zheng et al., 2023, arXiv:2312.07104), extends this idea by organizing KV cache blocks into a global Radix Tree.

Completed requests are not immediately evicted. Instead, their KV cache blocks are retained in GPU memory and indexed by their token prefixes. When a new request arrives, the system traverses the Radix Tree to identify shared prefixes and reuses existing KV cache blocks, computing only the new suffix tokens. An eviction policy (e.g., Least Recently Used) is applied when memory pressure increases.

Dimension vLLM (Standard PagedAttention) SGLang (RadixAttention)
Computational Cost Moderate (Block table lookups) Higher (Radix tree traversal & maintenance)
Data Requirements Agnostic (Processes sequences independently) Thrives on high-overlap prompt distributions
KV Cache Sharing Intra-request (e.g., Beam Search) Inter-request (Global prefix reuse)
Known Failure Modes Redundant computation on repeated prompts Tree maintenance overhead under low reuse

⚠️ Limitations & Caveats

<aside>

[Daily AI Interview Questions] 11. How does FlashAttention optimize self-attention, and what are its hardware trade-offs?

Standard self-attention is expensive because its time and space complexities scale quadratically with sequence length N. It materializes the full N×N attention matrix in High Bandwidth Memory (HBM), which is a slow read/write operation for GPUs. FlashAttention (Dao et al., 2022, arXiv:2205.14135) restructures self-attention to be IO-aware, meaning it minimizes memory transfers between slow HBM and fast, on-chip SRAM.

FlashAttention achieves speedups not by reducing FLOPs (it performs comparable FLOPs in forward and more in backward due to recomputation), but by reducing memory access overhead:


⚠️ Limitations & Caveats


🧪 Core Insights & Mathematical Foundations

$$

\begin{aligned} & \text{[Standard Attention IO Complexity]: dominated by } \mathcal{O}(N^2) \text{ HBM accesses} \\ & \text{[FlashAttention IO Complexity]: } \mathcal{O}\left(\frac{N^2}{\sqrt{M}}\right) \text{ (or } \mathcal{O}(N^2 d / M_{\text{eff}})\text{), where } M \text{ is SRAM size} \\ & \text{[Online Softmax (Maximum)]: } m_{new} = \max(m_{old}, m_{block}) \\ & \text{[Online Softmax (Exp Sum)]: } l_{new} = l_{old} \cdot e^{m_{old} - m_{new}} + l_{block} \cdot e^{m_{block} - m_{new}} \\ & \text{[Rescaled Output]: } O_{new} = \frac{l_{old} \cdot e^{m_{old} - m_{new}} \cdot O_{old} + l_{block} \cdot e^{m_{block} - m_{new}} \cdot O_{block}}{l_{new}} \\ & \text{[FlashAttention-3 FP8 Scaling]: uses per-tensor scaling factors } (s_Q, s_K, s_V) \text{ for low-precision computation} \end{aligned}

$$


Follow-up 1: Explain the online softmax technique and how tiling enables FlashAttention.

Standard softmax requires knowing the global maximum (m) and the sum of exponentials (l) of a row before computing probabilities. This creates a global barrier, forcing the model to read the entire row of Query-Key dot products before normalizing.

FlashAttention circumvents this using online softmax (Milakov & Gimelshein, 2018), which tracks running local statistics as it processes blocks. This enables numerically stable streaming computation without materializing the full attention row in HBM.

When a new block is processed, FlashAttention compares the new local maximum with the running maximum and rescales previously accumulated outputs using the exponential difference. This correction ensures mathematical equivalence to standard softmax while enabling block-by-block computation.


⚠️ Limitations & Caveats


Follow-up 2: How do FlashAttention-2 and FlashAttention-3 address hardware underutilization? (Optional)

The original FlashAttention suffered from hardware underutilization because it did not fully parallelize across sequence dimensions and spent cycles on non-matmul operations (e.g., softmax updates) on standard CUDA cores. FlashAttention-2 addressed these issues primarily on Ampere-class GPUs (e.g., A100), but did not exploit the new hardware primitives introduced in Hopper-class GPUs (e.g., H100), motivating FlashAttention-3.


<aside> <img src="/icons/reorder_gray.svg" alt="/icons/reorder_gray.svg" width="40px" />

Interview Prep

</aside>

Coming Soon