<aside> 📢

<aside> 🎯

Master AI & ML with Educatum, Your AI University

Curated resources from leading universities and industry experts to help you master artificial intelligence.

Build your knowledge base, and prepare for interviews.

Join study group and learn together.

Discover top AI tools and companies.

Connect with like-minded professionals.

No Ads, no noise

</aside>

<aside>

[Daily AI Interview Questions] 10.What is Speculative Decoding, and how does it improve LLM inference throughput without changing model outputs?

Speculative Decoding addresses the memory-bandwidth bottleneck of autoregressive inference by converting sequential generation into parallel verification (Leviathan et al., 2023, arXiv:2211.17192). In standard decoding, the processor loads the entire massive weight matrix of the target model from High Bandwidth Memory (HBM) to generate a single token, leaving compute units underutilized.

Speculative Decoding mitigates this by introducing a smaller, faster "draft" model to predict a sequence of γ (gamma) future tokens. The large "target" model then processes this entire drafted sequence in a single forward pass. Because modern accelerators are compute-rich but memory-starved, the latency of passing γ tokens through the target model simultaneously is nearly identical to passing just one token.

⚠️ Limitations & Caveats

🧪 Core Insights & Mathematical Foundations

$$

\begin{aligned} & \text{[Acceptance Probability]: } P_{accept}(x) = \min\left(1, \frac{P_{target}(x)}{P_{draft}(x)}\right) \\ & \text{[Resampling Distribution (if rejected)]: } P_{resample}(x) = \frac{\max(0, P_{target}(x) - P_{draft}(x))}{\sum_{x'} \max(0, P_{target}(x') - P_{draft}(x'))} \\ & \text{[Expected Tokens per Step]: } \mathbb{E}[N] = 1 + \sum_{i=1}^{\gamma} \prod_{j=1}^{i} P_{accept}(x_j) \\ & \text{[Wall-Clock Speedup Ratio]: } S = \frac{\gamma \cdot t_{target}}{t_{target} + \gamma \cdot t_{draft}} \times \frac{\mathbb{E}[N]}{\gamma} \\ & \text{[Medusa Head Objective]: } \mathcal{L}{Medusa} = \sum{k=1}^{K} \lambda_k \cdot \text{CrossEntropy}(P_{target}^{(t+k)}, P_{head_k}^{(t)}) \end{aligned} $$

Follow-up 1: Explain the rejection sampling mechanism that guarantees the target distribution is preserved.

To ensure the output distribution is mathematically identical to what the target model would have produced alone, Speculative Decoding uses a specialized rejection sampling scheme. For a given drafted token x, the system compares the probability assigned by the target model (P_target) to the probability assigned by the draft model (P_draft).

If P_target(x) ≥ P_draft(x), the token is accepted with 100% probability, meaning the draft model correctly anticipated a highly likely token. If P_target(x) < P_draft(x), the token is accepted with a probability equal to the ratio P_target(x) / P_draft(x); this handles cases where the draft model was overly confident about a token that the target model considers less likely.

If a token is rejected through this stochastic process, the sequence breaks. The system must then sample a new token from a modified probability distribution (the "Resampling Distribution" in the math block) that subtracts the draft model's probability mass from the target model's distribution. This correction ensures the final marginal probability exactly matches P_target.

⚠️ Limitations & Caveats

Follow-up 2: How do draft-free architectures like Medusa address the complexities of two-model speculative decoding? (Optional)

Standard Speculative Decoding requires finding a smaller draft model that perfectly matches the tokenizer and adequately approximates the reasoning of the target model, which is often difficult in practice. Architectures like Medusa (Cai et al., 2024, arXiv:2401.10774) eliminate the secondary model entirely by grafting multiple independent "decoding heads" onto the final hidden layer of the target model itself.

Each Medusa head is trained to predict a different future token offset (e.g., Head 1 predicts token t+1, Head 2 predicts token t+2). During inference, the model generates a tree of possible future token sequences in a single forward pass. The system then uses a hardware-efficient "Tree Attention" mechanism to verify all candidate paths simultaneously, accepting the longest valid sequence.

Dimension Standard Speculative Decoding Medusa (Draft-Free)
Computational Cost High (Requires running an entire separate LLM) Low (Only adds a few linear layers to the base model)
VRAM Overhead High (Must load weights of a second model) Minimal (Base model + small projection heads)
Data Requirements None (Uses off-the-shelf pre-trained models) Requires fine-tuning the base model to train the Medusa heads
Known Failure Modes Frequent rejections if draft model is poorly aligned Tree Attention adds complex custom CUDA kernel dependencies

⚠️ Limitations & Caveats

<aside>

[Daily AI Interview Questions] 9. What are the primary bottlenecks in auto-regressive LLM inference, and how do standard optimizations mitigate them?

In auto-regressive Large Language Model (LLM) generation, execution is fundamentally split into two phases: a compute-heavy prefill phase and a memory-bandwidth-bound decoding phase. During the prefill phase, the entire prompt is processed in parallel with self-attention operating at O(L²) complexity, resulting in high arithmetic intensity and making this stage primarily compute-bound.

During decoding, the model generates one token at a time, requiring the inference engine to repeatedly read massive weight matrices from High Bandwidth Memory (HBM). Because the number of FLOPs performed per byte transferred is exceedingly low, the GPU cannot fully utilize its Tensor Cores, making the decoding stage strictly memory-bandwidth-bound. To mitigate this memory wall and improve hardware utilization, standard inference engines employ techniques that either reduce memory traffic or increase the useful work done per memory fetch:

⚠️ Limitations & Caveats

🧪 Core Insights & Mathematical Foundations

$$

\begin{aligned} & \text{[Arithmetic Intensity]: } I = \frac{\text{Total FLOPs}}{\text{Memory Bytes Transferred}} \\ & \text{[KV Cache Size per Token]: } S_{token} = 2 \times n_{layers} \times n_{heads} \times d_{head} \times \text{precision bytes} \\ & \text{[Total KV Cache Size]: } S_{total} = B \times L \times S_{token} \quad (B = \text{Batch}, L = \text{Seq Len}) \\ & \text{[Symmetric Quantization]: } X_{int} = \text{round}\left(\frac{X_{float}}{s}\right), \quad s = \frac{\max(|X_{float}|)}{2^{b-1}-1} \\ & \text{[Dequantization]: } \tilde{X}{float} = X{int} \times s \end{aligned} $$

Follow-up 1: Explain the mechanics of the KV Cache and why it introduces memory fragmentation.

The KV Cache stores the Key and Value tensors produced by each self-attention layer for previously generated tokens. When the model predicts the token at position t, it computes the Query, Key, and Value only for that specific position. The new Query is then compared against the cached Keys and Values from positions 1 to t−1 to compute attention scores, reducing the time complexity of the decoding step.

In early inference implementations, memory for the KV cache was allocated as large contiguous blocks sized for the maximum possible sequence length. Because real user requests have highly variable and unpredictable lengths, this static allocation strategy leads to significant memory fragmentation.

This fragmentation manifests in two forms: Internal fragmentation occurs when a sequence terminates long before reaching the reserved maximum length, leaving unused memory trapped within the allocated block. External fragmentation occurs when free VRAM becomes divided into many small non-contiguous regions that cannot satisfy a large contiguous allocation request for a newly arriving sequence.

⚠️ Limitations & Caveats

Follow-up 2: How does Continuous Batching with PagedAttention solve the inefficiencies of standard static batching? (Optional)

Standard static batching processes multiple sequences synchronously. Because all sequences must advance in lockstep, shorter requests idle while waiting for the longest sequence in the batch to finish, creating GPU utilization bubbles. Continuous batching (iteration-level scheduling) improves utilization by dynamically inserting new requests into the batch the moment another request finishes generation.

To handle the highly dynamic memory allocation requirements of continuous batching, frameworks like vLLM (Kwon et al., 2023, arXiv:2309.06180) address the issue using PagedAttention. PagedAttention divides the KV cache into fixed-size memory blocks (pages) that do not need to be physically contiguous in VRAM. Each sequence maintains a logical mapping to its blocks via a block table, analogous to how operating systems manage virtual memory.

Dimension Standard Static Batching Continuous Batching + PagedAttention
Computational Cost Low scheduling overhead Higher scheduling overhead
Data Requirements Efficient only when prompt lengths are uniform Efficient across highly diverse prompt lengths
Memory Efficiency High waste due to contiguous pre-allocation Near-zero waste (<4% internal fragmentation)
Known Failure Modes GPU idle bubbles, frequent OOM crashes Block table lookup overhead, hardware lock-in

⚠️ Limitations & Caveats


<aside> <img src="/icons/reorder_gray.svg" alt="/icons/reorder_gray.svg" width="40px" />

Interview Prep

</aside>

Coming Soon