Distillation

Distillation is a machine learning technique in which a smaller or computationally efficient model (the student) is trained to approximate the behavior of a larger, more capable model (the teacher). The student model learns to reproduce the teacher's output distributions rather than learning directly from labeled data, enabling efficient deployment of high-performance models across resource-constrained environments.

Overview and Motivation

Knowledge distillation emerged as a response to the computational constraints of deploying large neural networks. Rather than training smaller models from scratch on original datasets, distillation leverages the learned representations of larger models to guide training. The teacher model has typically learned complex patterns and generalizations that a student model can absorb more efficiently than discovering independently.¹⁾

The primary motivations for distillation include: reducing model size for edge deployment, decreasing inference latency, lowering computational costs during inference, and maintaining performance across resource-constrained hardware. Distillation has enabled open-weight model developers to create competitive alternatives to proprietary systems by leveraging outputs from frontier closed models. OpenAI and Anthropic are actively implementing measures to prevent Chinese organizations and other competitors from using distillation methods to replicate their proprietary model capabilities.²⁾

How It Works

The distillation process typically involves a teacher model—typically a large, well-trained neural network—generating soft targets (probability distributions over outputs) and reasoning signals that guide the training of a student model. The student learns to mimic not just the correct answers, but the decision-making process of the teacher, capturing implicit patterns and heuristics that are difficult to extract from raw training data alone.

The standard distillation approach involves a three-component system: a teacher model producing soft targets, a student model generating predictions, and a loss function combining two objectives. The teacher model outputs probability distributions (soft targets) rather than hard labels, which contain information about the model's confidence across all classes. The student learns to match these distributions while also maintaining reasonable performance on the original task objective.

The combined loss function typically takes the form:

$$L_{total} = \alpha L_{CE}(y, \sigma(z_s)) + (1-\alpha) L_{KL}(P_t, P_s)$$

where $L_{CE}$ is cross-entropy loss between hard labels and student predictions, $L_{KL}$ is the Kullback-Leibler divergence between teacher and student distributions, $\sigma$ represents the softmax function, and $\alpha$ is a hyperparameter balancing both objectives. The temperature parameter $T$ in the softmax function controls the softness of the output distributions—higher temperatures produce softer probability distributions that facilitate knowledge transfer.

Agent Distillation

Agent distillation is a specialized application of distillation tailored specifically for multi-step reasoning and tool-use systems. It represents the process of transferring the full task-solving behavior of large LLM agent systems — including reasoning chains, tool usage patterns, and multi-step decision trajectories — into smaller, deployable models.

Unlike general model distillation which uses flat token-level supervision to mimic outputs, agent distillation explicitly handles the compositional structure of agent trajectories, segmenting them into reasoning and action components for fine-grained alignment. This enables models as small as 0.5B-3.8B parameters to achieve performance competitive with models 4-10x larger.³⁾⁴⁾

Aspect	Agent Distillation	General Model Distillation
Supervision level	Span-level ([REASON] vs [ACT] masks)	Token-level (flat KL divergence)
Focus	Structured trajectories, reasoning-action fidelity	Overall probability distributions
Data structure	Segmented reasoning chains + tool calls	Input-output pairs

Compression of Intelligence

Agent distillation represents a critical phase in the compression of intelligence — the transition of AI capabilities from expensive and resource-intensive research systems into smaller, faster, and more modular systems suitable for practical deployment.⁵⁾ Once complex agent capabilities are successfully distilled, AI stops being a theatrical demonstration of power and starts becoming practical infrastructure.

References

¹⁾

Hinton, G., Vanhoucke, V., & Dean, J. - Distilling the Knowledge in a Neural Network (2015

²⁾

The Neuron Daily. “Too Dangerous to Release” theneurondaily.com

³⁾

Chen et al. “FireAct: Toward Language Agent Fine-tuning.” arXiv:2310.05915

⁴⁾

Kang et al. “Distilling LLM Agent into Small Models with Retrieval and Code Tools.” arXiv:2505.17612

⁵⁾

The Sequence. “The Sequence: AI of the Week #839 - Gemma” thesequence.substack.com

AI Agent Knowledge Base

Sidebar

Table of Contents

Distillation

Overview and Motivation

How It Works

Agent Distillation

Compression of Intelligence

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Distillation

Overview and Motivation

How It Works

Agent Distillation

Compression of Intelligence

See Also

References

Page Tools