Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Distillation is a machine learning technique in which a smaller or computationally efficient model (the student) is trained to approximate the behavior of a larger, more capable model (the teacher). The student model learns to reproduce the teacher's output distributions rather than learning directly from labeled data, enabling efficient deployment of high-performance models across resource-constrained environments.
Knowledge distillation emerged as a response to the computational constraints of deploying large neural networks. Rather than training smaller models from scratch on original datasets, distillation leverages the learned representations of larger models to guide training. The teacher model has typically learned complex patterns and generalizations that a student model can absorb more efficiently than discovering independently.1)
The primary motivations for distillation include: reducing model size for edge deployment, decreasing inference latency, lowering computational costs during inference, and maintaining performance across resource-constrained hardware. Distillation has enabled open-weight model developers to create competitive alternatives to proprietary systems by leveraging outputs from frontier closed models. OpenAI and Anthropic are actively implementing measures to prevent Chinese organizations and other competitors from using distillation methods to replicate their proprietary model capabilities.2)
The distillation process typically involves a teacher model—typically a large, well-trained neural network—generating soft targets (probability distributions over outputs) and reasoning signals that guide the training of a student model. The student learns to mimic not just the correct answers, but the decision-making process of the teacher, capturing implicit patterns and heuristics that are difficult to extract from raw training data alone.
The standard distillation approach involves a three-component system: a teacher model producing soft targets, a student model generating predictions, and a loss function combining two objectives. The teacher model outputs probability distributions (soft targets) rather than hard labels, which contain information about the model's confidence across all classes. The student learns to match these distributions while also maintaining reasonable performance on the original task objective.
The combined loss function typically takes the form:
$$L_{total} = \alpha L_{CE}(y, \sigma(z_s)) + (1-\alpha) L_{KL}(P_t, P_s)$$
where $L_{CE}$ is cross-entropy loss between hard labels and student predictions, $L_{KL}$ is the Kullback-Leibler divergence between teacher and student distributions, $\sigma$ represents the softmax function, and $\alpha$ is a hyperparameter balancing both objectives. The temperature parameter $T$ in the softmax function controls the softness of the output distributions—higher temperatures produce softer probability distributions that facilitate knowledge transfer.
Agent distillation is a specialized application of distillation tailored specifically for multi-step reasoning and tool-use systems. It represents the process of transferring the full task-solving behavior of large LLM agent systems — including reasoning chains, tool usage patterns, and multi-step decision trajectories — into smaller, deployable models.
Unlike general model distillation which uses flat token-level supervision to mimic outputs, agent distillation explicitly handles the compositional structure of agent trajectories, segmenting them into reasoning and action components for fine-grained alignment. This enables models as small as 0.5B-3.8B parameters to achieve performance competitive with models 4-10x larger.3)4)
| Aspect | Agent Distillation | General Model Distillation |
|---|---|---|
| Supervision level | Span-level ([REASON] vs [ACT] masks) | Token-level (flat KL divergence) |
| Focus | Structured trajectories, reasoning-action fidelity | Overall probability distributions |
| Data structure | Segmented reasoning chains + tool calls | Input-output pairs |
Agent distillation represents a critical phase in the compression of intelligence — the transition of AI capabilities from expensive and resource-intensive research systems into smaller, faster, and more modular systems suitable for practical deployment.5) Once complex agent capabilities are successfully distilled, AI stops being a theatrical demonstration of power and starts becoming practical infrastructure.