====== Small Language Model Agents ====== Small Language Model (SLM) agents use models with 1B-7B parameters as autonomous agent cores, offering dramatic cost and latency advantages over frontier models while achieving competitive performance on targeted tasks. The rise of efficient architectures like Phi-4, Gemma 3, and Qwen 3 has made it practical to deploy agentic systems on consumer hardware, edge devices, and mobile phones. ===== Why Small Models for Agents? ===== The core insight driving SLM agents is that most agentic subtasks --- tool calling, routing, extraction, classification --- do not require frontier-model reasoning. Fine-tuned small models handle 80-90% of agent subtasks with lower latency, lower cost, and more deterministic behavior than GPT-4o or Claude Opus. | **Metric** | **SLM (1B-7B)** | **Frontier (GPT-4o)** | | Inference cost | $0.13/M tokens | $3.75/M tokens (blended) | | Latency | Sub-100ms on edge | 500ms-2s cloud | | VRAM (Q4) | 2-6 GB | Cloud-only | | Tool call reliability | High (post fine-tune) | Variable (schema drift) | A Phi-3-mini (3.8B) fine-tuned on financial NLP outscored GPT-4o on 6 of 7 benchmarks at 29x lower cost. Gemma 3 4B achieves 89.2% on GSM8K and 71.3% on HumanEval from just 4 billion parameters. ===== Key Models ===== **Microsoft Phi-4 (14B) / Phi-4-mini (3.8B):** Trained on synthetic data with emphasis on STEM reasoning. Phi-4-mini fits in 3GB VRAM at Q4 quantization. Scores 67.3% MMLU. Optimized for instruction following and function calling. **Google Gemma 3 (4B):** First sub-10B model to break 1300 on LMArena. Gemma 3n E4B runs on 3GB memory, designed for on-device deployment. Strong at code generation and mathematical reasoning. **Alibaba Qwen 3 (4B-9B):** Qwen3.5-9B leads the small model leaderboard with MMLU-Pro 82.5 and GPQA Diamond 81.7, beating models 3x its size. Excels in multilingual agent tasks and regulated environments. ===== Quantization for Deployment ===== Quantization reduces model size by 60-90% while preserving accuracy, enabling edge deployment. **GGUF (llama.cpp format):** The standard for CPU and mixed CPU/GPU inference. Supports 2-bit through 8-bit quantization. Used by Ollama, LM Studio, and llama.cpp directly. **AWQ (Activation-Aware Weight Quantization):** Preserves performance on critical weights identified by activation patterns. Ideal for GPU inference with tool-use reliability. Used by vLLM and TensorRT-LLM. # Deploying a quantized SLM agent with llama-cpp-python from llama_cpp import Llama # Load GGUF-quantized Phi-4-mini (Q4_K_M ~3GB) llm = Llama( model_path="phi-4-mini-Q4_K_M.gguf", n_ctx=4096, n_gpu_layers=-1, # offload all layers to GPU chat_format="chatml-function-calling" ) # Agent tool-calling loop tools = [{"type": "function", "function": { "name": "search_database", "parameters": {"type": "object", "properties": { "query": {"type": "string"} }} }}] response = llm.create_chat_completion( messages=[ {"role": "system", "content": "You are an agent. Use tools to answer questions."}, {"role": "user", "content": "Find revenue for Q3 2025"} ], tools=tools, tool_choice="auto" ) print(response["choices"][0]["message"]) ===== Fine-Tuning for Tool Use ===== SLMs require fine-tuning to reliably perform function calling, structured output, and schema adherence. The standard approach uses LoRA or QLoRA, training only 0.1-1% of parameters. **Process:** - Generate synthetic tool-calling datasets using a frontier model - Fine-tune with LoRA (rank 16-64) using LlamaFactory or Axolotl - Train on structured output formats (JSON function calls) - Validate against schema compliance benchmarks **Results:** Fine-tuned 3B models achieve >95% schema compliance on repetitive workflows, compared to 80-85% for zero-shot frontier models. Training takes 2-4 GPU-hours on a single A100. $$C_{finetune} = r imes d imes h pprox 16 imes 3072 imes 4 = 196{,}608 ext{ trainable params (LoRA)}$$ Where $r$ is LoRA rank, $d$ is hidden dimension, and $h$ is the number of adapted layers. ===== Hybrid Agent Architecture ===== The production pattern combines SLMs and frontier models in a routing architecture: - **Router (SLM):** Classifies incoming requests by complexity - **Task experts (SLMs):** Handle 80-90% of subtasks (extraction, formatting, tool calls) - **Reasoning backbone (frontier):** Handles multi-step planning, novel situations - **Orchestrator:** Manages state and delegates between models This yields 10-30x cost reduction versus using frontier models for all tasks, while maintaining equivalent end-to-end accuracy. ===== References ===== * [[https://arxiv.org/abs/2404.14219|Phi-3 Technical Report (Microsoft, 2024)]] * [[https://arxiv.org/abs/2503.01743|Gemma 3 Technical Report (Google DeepMind, 2025)]] * [[https://arxiv.org/abs/2412.15115|Qwen 2.5 Technical Report (Alibaba, 2024)]] * [[https://developer.nvidia.com/blog/how-small-language-models-are-key-to-scalable-agentic-ai/|How SLMs Are Key to Scalable Agentic AI (NVIDIA, 2025)]] * [[https://github.com/hiyouga/LlamaFactory|LlamaFactory: Unified Fine-Tuning of 100+ LLMs (ACL 2024)]] ===== See Also ===== * [[agent_cost_optimization]] * [[multimodal_agent_architectures]] * [[agentic_rpa]]