====== Gemma 4 Multi-Token Prediction Drafters ======
**Gemma 4 Multi-Token Prediction Drafters** are specialized neural network checkpoints released by Google designed to accelerate inference through speculative decoding techniques. These drafter models enable significantly faster token generation in large language models by predicting multiple tokens simultaneously, achieving up to 3× speedup in decoding performance while maintaining output quality (([[https://arxiv.org/abs/2211.17192|Leviathan et al. - Fast Transformer Decoding: One Write-Head is All You Need (2022]])). The drafter checkpoints are available in multiple model sizes including 31B and 26B parameter variants, with immediate integration support across major inference frameworks.

===== Technical Architecture and Speculative Decoding =====
Speculative decoding represents a fundamental optimization approach for accelerating language model inference by decoupling token prediction into two phases: draft generation and verification. Multi-token prediction drafters function as auxiliary models that generate multiple candidate tokens in parallel, which are subsequently validated by the larger base model in a single forward pass (([[https://arxiv.org/abs/2302.01318|Chen et al. - Accelerating Large Language Model Decoding with Parallel Processing (2023]])). 

The Gemma 4 drafter architecture employs **multi-token prediction** mechanisms that predict not just the next single token, but several future tokens within a single inference call. This approach reduces the number of sequential forward passes required to generate complete sequences, reducing latency bottlenecks and reasoning degradation that would otherwise occur in sequential token generation (([[https://tldr.tech/ai/2026-05-06|TLDR AI (2026]])). The speculative decoding architecture enables parallel prediction and verification, eliminating the latency bottleneck of generating tokens one at a time compared to traditional sequential token generation approaches (([[https://tldr.tech/ai/2026-05-06|TLDR AI (2026]])). The verification stage determines which predicted tokens are correct according to the base model's probability distribution, allowing only verified tokens to be committed to the output sequence. This parallel token verification process validates multiple tokens simultaneously by the target model, enabling the speed improvements central to speculative decoding (([[https://tldr.tech/ai/2026-05-06|TLDR AI (2026]])). Rejected tokens are discarded, and the process continues from the first disagreement point.

The technical innovation enables theoretical speedup factors approaching the number of tokens predicted per draft phase, though practical speedup depends on draft accuracy and rejection rates. With proper configuration, the drafter achieves approximately 3× wall-clock speedup without quality degradation, meaning output distribution remains mathematically equivalent to standard autoregressive decoding (([[https://arxiv.org/abs/2211.17192|Leviathan et al. - Fast Transformer Decoding: One Write-Head is All You Need (2022]])). 

===== Model Sizes and Framework Integration =====
[[google|Google]] released Gemma 4 drafter checkpoints across multiple size configurations to balance draft quality and computational overhead. The **31B parameter variant** serves as the full-size drafter for maximum draft accuracy, while the **26B variant** provides a lighter-weight alternative with marginal quality differences. Smaller drafter sizes are also available for deployment scenarios with constrained computational resources or where smaller base models are being served, including ultra-lightweight variants such as the **78M draft model** for memory-constrained deployments (([[https://www.latent.space/p/ainews-silicon-valley-gets-serious|Latent Space (2026]])).

The drafter checkpoints achieved immediate integration into major inference frameworks including **[[vllm|vLLM]]**, **TensorRT-LLM**, **Ray Serve**, and other production-grade serving systems (([[https://news.smol.ai/issues/26-05-05-not-much/|AI News - Gemma 4 Multi-Token Prediction Drafters Release (2026]])). Implementation support extends across the broader inference ecosystem, with implementations available in **Transformers**, **MLX**, **SGLang**, **Ollama**, and **AI Edge** platforms (([[https://www.latent.space/p/ainews-silicon-valley-gets-serious|Latent Space (2026]])). This comprehensive framework support enables rapid adoption across existing inference infrastructure without requiring custom implementation code. The compatibility across multiple serving platforms ensures broad accessibility for organizations running various inference stacks.

===== Performance Characteristics and Practical Implications =====
The 3× decoding speedup represents a substantial improvement in inference latency without quality degradation, directly addressing one of the primary bottlenecks in large language model deployment. Since token generation in large models is typically compute-bound for the base model, optimizations that reduce the number of base model forward passes provide proportional latency improvements (([[https://arxiv.org/abs/2302.01318|Chen et al. - Accelerating Large Language Model Decoding with Parallel Processing (2023]])). 

The zero quality degradation characteristic means that the output probability distributions remain identical to standard autoregressive sampling, preserving semantic correctness and maintaining consistency with non-optimized inference. This property is critical for applications requiring deterministic behavior or where output consistency across different serving configurations is necessary. Gemma 4's implementation maintains [[reasoning_capabilities|reasoning capabilities]] alongside its inference speedup, ensuring that cognitive performance is preserved even as latency is reduced (([[https://tldr.tech/ai/2026-05-06|TLDR AI (2026]])). The practical impact extends across inference cost reduction, lower token-per-second latency for streaming applications, and reduced computational resource utilization.

===== Integration and Deployment Considerations =====
Organizations deploying Gemma 4 Multi-Token Prediction Drafters require coordination between the drafter checkpoint and the corresponding base Gemma 4 model. The framework integration abstracts much of this complexity, automatically handling token prediction, verification, and sequence continuation. The drafter remains stateless across requests, simplifying distributed serving configurations and enabling straightforward horizontal scaling.

Memory overhead of the drafter checkpoints represents an additional consideration, requiring sufficient GPU VRAM to load both the drafter and base model simultaneously. The 31B and 26B drafter variants are designed to fit within typical GPU memory constraints when paired with Gemma 4 base models, though specific memory requirements depend on quantization choices and batch size configurations. The computational cost of running the drafter itself is outweighed by the latency reduction achieved through decreased base model forward passes (([[https://arxiv.org/abs/2211.17192|Leviathan et al. - Fast Transformer Decoding: One Write-Head is All You Need (2022]])). 

===== See Also =====

  * [[multi_token_prediction|Multi-Token Prediction (MTP)]]
  * [[ngram_speculative|N-gram Speculative Decoding]]
  * [[dtree_speculative|DTree]]
  * [[dflash_speculative|DFlash]]
  * [[qwen3_models|Qwen3 Models]]

===== References =====