====== Meta Llama 4 Scout ======

**Meta Llama 4 Scout** is a multimodal Mixture-of-Experts (MoE) language model released by Meta on April 5, 2025, featuring a groundbreaking **10-million-token context window** — the longest of any production model at launch. With 109 billion total parameters but only 17 billion active per token, Scout delivers frontier-class performance while running on a single NVIDIA H100 GPU. ((Source: [[https://ai.meta.com/blog/llama-4-multimodal-intelligence/|Meta AI — Llama 4: Multimodal Intelligence]]))

===== Architecture =====

Scout uses a sparse MoE design optimized for efficiency:

  * **Total parameters:** 109 billion
  * **Active parameters per token:** 17 billion
  * **Experts:** 16 total, 2 active per token
  * **Layers:** 80
  * **Hidden dimension:** 8,192
  * **Attention heads:** 64 (8 KV heads, Grouped-Query Attention)
  * **Position embeddings:** iRoPE (interleaved Rotary Position Embeddings)
  * **Context window:** 10 million tokens (extrapolated from 256K training context)

The interleaved attention layers enable generalization to extremely long sequences far beyond the training context length. ((Source: [[https://ai.meta.com/blog/llama-4-multimodal-intelligence/|Meta AI — Llama 4]]))

===== Training =====

Scout was pre-trained and post-trained on **40 trillion tokens** spanning both text and images. ((Source: [[https://gpt-trainer.com/blog/llama+4+evolution+features+comparison|GPT-Trainer — Llama 4 Evolution]])) The model uses early fusion for multimodal processing, integrating image understanding directly into the core architecture rather than bolting on separate vision modules.

===== 10-Million-Token Context =====

The 10-million-token context window is Scout's most distinctive capability. Trained at 256K tokens, the model uses iRoPE position embeddings to extrapolate to 10M tokens at inference time. This enables:

  * Processing entire codebases in a single prompt
  * Multi-document summarization across hundreds of documents
  * Extended user activity analysis and personalization
  * "Needle-in-a-haystack" retrieval across massive corpora

Scout maintains low negative log-likelihood over 10 million code tokens, demonstrating robust long-context comprehension. ((Source: [[https://ai.meta.com/blog/llama-4-multimodal-intelligence/|Meta AI — Llama 4]]))

===== Benchmark Performance =====

  * **MMMU (0-shot):** 69.4%
  * **MMMU Pro:** 52.2%
  * **MathVista:** 70.7%
  * Best-in-class image grounding, visual QA, and document QA ((Source: [[https://build.nvidia.com/meta/llama-4-scout-17b-16e-instruct/modelcard|NVIDIA — Llama 4 Scout Model Card]]))

===== Deployment =====

Scout is deployable on a **single H100 GPU** using Int4 quantization (approximately 54.5 GB plus KV cache), making it one of the most accessible frontier models. Full-precision weights require 4x H100s. Inference cost is approximately **$0.09 per million tokens**. ((Source: [[https://apxml.com/models/llama-4-scout|APXML — Llama 4 Scout]]))

The model is released under the **Llama 4 Community License**, which is open for most uses with restrictions for organizations exceeding 700 million monthly active users. It supports fine-tuning in 12 languages.

===== Comparison with Llama 4 Maverick =====

Scout's sibling model, **Llama 4 Maverick**, uses 400 billion total parameters (17B active) with higher benchmark scores (MMMU 73.4%, MathVista 73.7%) but requires 3x H100s and has a 1M default context window. Scout prioritizes efficiency and accessibility for single-GPU deployment. ((Source: [[https://www.runpod.io/blog/llama4-scout-maverick|RunPod — Llama 4 Scout vs Maverick]]))

===== See Also =====

  * [[meta_ai|Meta AI]]
  * [[mixture_of_experts|Mixture of Experts]]
  * [[context_window|Context Window]]

===== References =====