<aside>

[Daily AI Interview Questions] 10.What is Speculative Decoding, and how does it improve LLM inference throughput without changing model outputs?

Speculative Decoding addresses the memory-bandwidth bottleneck of autoregressive inference by converting sequential generation into parallel verification (Leviathan et al., 2023, arXiv:2211.17192). In standard decoding, the processor loads the entire massive weight matrix of the target model from High Bandwidth Memory (HBM) to generate a single token, leaving compute units underutilized.

Speculative Decoding mitigates this by introducing a smaller, faster "draft" model to predict a sequence of γ (gamma) future tokens. The large "target" model then processes this entire drafted sequence in a single forward pass. Because modern accelerators are compute-rich but memory-starved, the latency of passing γ tokens through the target model simultaneously is nearly identical to passing just one token.

Drafting Phase: A lightweight model (e.g., a 7B parameter model drafting for a 70B parameter model) autoregressively generates a short sequence of γ tokens.
Verification Phase: The target model computes the logits for all γ drafted tokens in parallel.
Lossless Acceptance: A modified rejection sampling algorithm accepts or rejects the drafted tokens. If a token is rejected, the target model uses its computed logits to sample the correct replacement token, discarding the rest of the draft.

⚠️ Limitations & Caveats

The actual speedup is highly dependent on the "acceptance rate" (the alignment between the draft and target models); if the draft model frequently guesses incorrectly, the system incurs the overhead of the draft model without reaping the parallelization benefits.
Deploying Speculative Decoding requires hosting two separate models simultaneously, which increases the total VRAM footprint and complicates orchestration logic.

🧪 Core Insights & Mathematical Foundations

\begin{aligned} & \text{[Acceptance Probability]: } P_{accept}(x) = \min\left(1, \frac{P_{target}(x)}{P_{draft}(x)}\right) \\ & \text{[Resampling Distribution (if rejected)]: } P_{resample}(x) = \frac{\max(0, P_{target}(x) - P_{draft}(x))}{\sum_{x'} \max(0, P_{target}(x') - P_{draft}(x'))} \\ & \text{[Expected Tokens per Step]: } \mathbb{E}[N] = 1 + \sum_{i=1}^{\gamma} \prod_{j=1}^{i} P_{accept}(x_j) \\ & \text{[Wall-Clock Speedup Ratio]: } S = \frac{\gamma \cdot t_{target}}{t_{target} + \gamma \cdot t_{draft}} \times \frac{\mathbb{E}[N]}{\gamma} \\ & \text{[Medusa Head Objective]: } \mathcal{L}{Medusa} = \sum{k=1}^{K} \lambda_k \cdot \text{CrossEntropy}(P_{target}^{(t+k)}, P_{head_k}^{(t)}) \end{aligned} $$

Follow-up 1: Explain the rejection sampling mechanism that guarantees the target distribution is preserved.

To ensure the output distribution is mathematically identical to what the target model would have produced alone, Speculative Decoding uses a specialized rejection sampling scheme. For a given drafted token x, the system compares the probability assigned by the target model (P_target) to the probability assigned by the draft model (P_draft).

If P_target(x) ≥ P_draft(x), the token is accepted with 100% probability, meaning the draft model correctly anticipated a highly likely token. If P_target(x) < P_draft(x), the token is accepted with a probability equal to the ratio P_target(x) / P_draft(x); this handles cases where the draft model was overly confident about a token that the target model considers less likely.

If a token is rejected through this stochastic process, the sequence breaks. The system must then sample a new token from a modified probability distribution (the "Resampling Distribution" in the math block) that subtracts the draft model's probability mass from the target model's distribution. This correction ensures the final marginal probability exactly matches P_target.

⚠️ Limitations & Caveats

Because the sampling relies on comparing full probability distributions, both models must use the exact same tokenizer.
When using temperature scaling (T > 0), the acceptance rate typically drops because the distributions become flatter, making exact matches between the draft and target models less frequent.

Follow-up 2: How do draft-free architectures like Medusa address the complexities of two-model speculative decoding? (Optional)

Standard Speculative Decoding requires finding a smaller draft model that perfectly matches the tokenizer and adequately approximates the reasoning of the target model, which is often difficult in practice. Architectures like Medusa (Cai et al., 2024, arXiv:2401.10774) eliminate the secondary model entirely by grafting multiple independent "decoding heads" onto the final hidden layer of the target model itself.

Each Medusa head is trained to predict a different future token offset (e.g., Head 1 predicts token t+1, Head 2 predicts token t+2). During inference, the model generates a tree of possible future token sequences in a single forward pass. The system then uses a hardware-efficient "Tree Attention" mechanism to verify all candidate paths simultaneously, accepting the longest valid sequence.

Dimension	Standard Speculative Decoding	Medusa (Draft-Free)
Computational Cost	High (Requires running an entire separate LLM)	Low (Only adds a few linear layers to the base model)
VRAM Overhead	High (Must load weights of a second model)	Minimal (Base model + small projection heads)
Data Requirements	None (Uses off-the-shelf pre-trained models)	Requires fine-tuning the base model to train the Medusa heads
Known Failure Modes	Frequent rejections if draft model is poorly aligned	Tree Attention adds complex custom CUDA kernel dependencies

⚠️ Limitations & Caveats

Medusa requires an additional training phase to optimize the specialized heads, meaning it cannot be applied "zero-shot" to a newly released base model without access to a training pipeline and compute.
The length of the candidate tree grows exponentially with the number of heads, which can quickly saturate the KV cache and memory bandwidth if the tree depth is set too high. </aside>