====== Meta Llama 4 Scout ====== **Meta Llama 4 Scout** is a multimodal Mixture-of-Experts (MoE) language model released by Meta on April 5, 2025, featuring a groundbreaking **10-million-token context window** — the longest of any production model at launch. With 109 billion total parameters but only 17 billion active per token, Scout delivers frontier-class performance while running on a single NVIDIA H100 GPU. ((Source: [[https://ai.meta.com/blog/llama-4-multimodal-intelligence/|Meta AI — Llama 4: Multimodal Intelligence]])) ===== Architecture ===== Scout uses a sparse MoE design optimized for efficiency: * **Total parameters:** 109 billion * **Active parameters per token:** 17 billion * **Experts:** 16 total, 2 active per token * **Layers:** 80 * **Hidden dimension:** 8,192 * **Attention heads:** 64 (8 KV heads, Grouped-Query Attention) * **Position embeddings:** iRoPE (interleaved Rotary Position Embeddings) * **Context window:** 10 million tokens (extrapolated from 256K training context) The interleaved attention layers enable generalization to extremely long sequences far beyond the training context length. ((Source: [[https://ai.meta.com/blog/llama-4-multimodal-intelligence/|Meta AI — Llama 4]])) ===== Training ===== Scout was pre-trained and post-trained on **40 trillion tokens** spanning both text and images. ((Source: [[https://gpt-trainer.com/blog/llama+4+evolution+features+comparison|GPT-Trainer — Llama 4 Evolution]])) The model uses early fusion for multimodal processing, integrating image understanding directly into the core architecture rather than bolting on separate vision modules. ===== 10-Million-Token Context ===== The 10-million-token context window is Scout's most distinctive capability. Trained at 256K tokens, the model uses iRoPE position embeddings to extrapolate to 10M tokens at inference time. This enables: * Processing entire codebases in a single prompt * Multi-document summarization across hundreds of documents * Extended user activity analysis and personalization * "Needle-in-a-haystack" retrieval across massive corpora Scout maintains low negative log-likelihood over 10 million code tokens, demonstrating robust long-context comprehension. ((Source: [[https://ai.meta.com/blog/llama-4-multimodal-intelligence/|Meta AI — Llama 4]])) ===== Benchmark Performance ===== * **MMMU (0-shot):** 69.4% * **MMMU Pro:** 52.2% * **MathVista:** 70.7% * Best-in-class image grounding, visual QA, and document QA ((Source: [[https://build.nvidia.com/meta/llama-4-scout-17b-16e-instruct/modelcard|NVIDIA — Llama 4 Scout Model Card]])) ===== Deployment ===== Scout is deployable on a **single H100 GPU** using Int4 quantization (approximately 54.5 GB plus KV cache), making it one of the most accessible frontier models. Full-precision weights require 4x H100s. Inference cost is approximately **$0.09 per million tokens**. ((Source: [[https://apxml.com/models/llama-4-scout|APXML — Llama 4 Scout]])) The model is released under the **Llama 4 Community License**, which is open for most uses with restrictions for organizations exceeding 700 million monthly active users. It supports fine-tuning in 12 languages. ===== Comparison with Llama 4 Maverick ===== Scout's sibling model, **Llama 4 Maverick**, uses 400 billion total parameters (17B active) with higher benchmark scores (MMMU 73.4%, MathVista 73.7%) but requires 3x H100s and has a 1M default context window. Scout prioritizes efficiency and accessibility for single-GPU deployment. ((Source: [[https://www.runpod.io/blog/llama4-scout-maverick|RunPod — Llama 4 Scout vs Maverick]])) ===== See Also ===== * [[meta_ai|Meta AI]] * [[mixture_of_experts|Mixture of Experts]] * [[context_window|Context Window]] ===== References =====