Meta Llama 4 Scout

Meta Llama 4 Scout is a multimodal Mixture-of-Experts (MoE) language model released by Meta on April 5, 2025, featuring a groundbreaking 10-million-token context window — the longest of any production model at launch. With 109 billion total parameters but only 17 billion active per token, Scout delivers frontier-class performance while running on a single NVIDIA H100 GPU. ¹⁾

Architecture

Scout uses a sparse MoE design optimized for efficiency:

Total parameters: 109 billion
Active parameters per token: 17 billion
Experts: 16 total, 2 active per token
Layers: 80
Hidden dimension: 8,192
Attention heads: 64 (8 KV heads, Grouped-Query Attention)
Position embeddings: iRoPE (interleaved Rotary Position Embeddings)
Context window: 10 million tokens (extrapolated from 256K training context)

The interleaved attention layers enable generalization to extremely long sequences far beyond the training context length. ²⁾

Training

Scout was pre-trained and post-trained on 40 trillion tokens spanning both text and images. ³⁾ The model uses early fusion for multimodal processing, integrating image understanding directly into the core architecture rather than bolting on separate vision modules.

10-Million-Token Context

The 10-million-token context window is Scout's most distinctive capability. Trained at 256K tokens, the model uses iRoPE position embeddings to extrapolate to 10M tokens at inference time. This enables:

Processing entire codebases in a single prompt
Multi-document summarization across hundreds of documents
Extended user activity analysis and personalization
“Needle-in-a-haystack” retrieval across massive corpora

Scout maintains low negative log-likelihood over 10 million code tokens, demonstrating robust long-context comprehension. ⁴⁾

Benchmark Performance

MMMU (0-shot): 69.4%
MMMU Pro: 52.2%
MathVista: 70.7%
Best-in-class image grounding, visual QA, and document QA ⁵⁾

Deployment

Scout is deployable on a single H100 GPU using Int4 quantization (approximately 54.5 GB plus KV cache), making it one of the most accessible frontier models. Full-precision weights require 4x H100s. Inference cost is approximately $0.09 per million tokens. ⁶⁾

The model is released under the Llama 4 Community License, which is open for most uses with restrictions for organizations exceeding 700 million monthly active users. It supports fine-tuning in 12 languages.

Comparison with Llama 4 Maverick

Scout's sibling model, Llama 4 Maverick, uses 400 billion total parameters (17B active) with higher benchmark scores (MMMU 73.4%, MathVista 73.7%) but requires 3x H100s and has a 1M default context window. Scout prioritizes efficiency and accessibility for single-GPU deployment. ⁷⁾

References

¹⁾

Source: Meta AI — Llama 4: Multimodal Intelligence

²⁾ , ⁴⁾

Source: Meta AI — Llama 4

³⁾

Source: GPT-Trainer — Llama 4 Evolution

⁵⁾

Source: NVIDIA — Llama 4 Scout Model Card

⁶⁾

Source: APXML — Llama 4 Scout

⁷⁾

Source: RunPod — Llama 4 Scout vs Maverick

AI Agent Knowledge Base

Sidebar

Table of Contents

Meta Llama 4 Scout

Architecture

Training

10-Million-Token Context

Benchmark Performance

Deployment

Comparison with Llama 4 Maverick

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Meta Llama 4 Scout

Architecture

Training

10-Million-Token Context

Benchmark Performance

Deployment

Comparison with Llama 4 Maverick

See Also

References

Page Tools