====== Small Language Model Agents ====== Small Language Model (SLM) agents use models with 1B-7B parameters as autonomous agent cores, offering dramatic cost and latency advantages over frontier models while achieving competitive performance on targeted tasks. The rise of efficient architectures like Phi-4, Gemma 3, and Qwen 3 has made it practical to deploy agentic systems on consumer hardware, edge devices, and mobile phones. ===== Why Small Models for Agents? ===== The core insight driving SLM agents is that most agentic subtasks, tool calling, routing, extraction, classification, do not require frontier-model reasoning. Fine-tuned small models handle 80-90% of agent subtasks with lower latency, lower cost, and more deterministic behavior than GPT-4o or [[claude|Claude]] Opus.(([[https://developer.[[nvidia|nvidia]])).com/blog/how-small-language-models-are-key-to-scalable-agentic-ai/|How SLMs Are Key to Scalable Agentic AI. [[nvidia|NVIDIA]], 2025.]])) | **Metric** | **SLM (1B-7B)** | **Frontier (GPT-4o)** | | Inference cost | $0.13/M tokens | $3.75/M tokens (blended) | | Latency | Sub-100ms on edge | 500ms-2s cloud | | VRAM (Q4) | 2-6 GB | Cloud-only | | Tool call reliability | High (post fine-tune) | Variable (schema drift) | A Phi-3-mini (3.8B) fine-tuned on financial NLP outscored GPT-4o on 6 of 7 benchmarks at 29x lower cost. Gemma 3 4B achieves 89.2% on GSM8K and 71.3% on HumanEval from just 4 billion parameters. ===== Key Models ===== **Microsoft Phi-4 (14B) / Phi-4-mini (3.8B):** Trained on synthetic data with emphasis on STEM reasoning.(([[https://arxiv.org/abs/2404.14219|Phi-3 Technical Report. Microsoft, 2024.]])) Phi-4-mini fits in 3GB VRAM at Q4 quantization. Scores 67.3% MMLU. Optimized for instruction following and [[function_calling|function calling]]. **Google Gemma 3 (4B):** First sub-10B model to break 1300 on LMArena.(([[https://arxiv.org/abs/2503.01743|Gemma 3 Technical Report. Google DeepMind, 2025.]])) Gemma 3n E4B runs on 3GB memory, designed for on-device deployment. Strong at code generation and mathematical reasoning. **[[alibaba_qwen|Alibaba Qwen]] 3 (4B-9B):** Qwen3.5-9B leads the small model leaderboard(([[https://arxiv.org/abs/2412.15115|Qwen 2.5 Technical Report. Alibaba, 2024.]])) with MMLU-Pro 82.5 and GPQA Diamond 81.7, beating models 3x its size. Excels in multilingual agent tasks and regulated environments. ===== Quantization for Deployment ===== Quantization reduces model size by 60-90% while preserving accuracy, enabling edge deployment. **GGUF ([[llama_cpp|llama.cpp]] format):** The standard for CPU and mixed CPU/GPU inference. Supports 2-bit through 8-bit quantization. Used by Ollama, [[lm_studio|LM Studio]], and llama.cpp directly. **AWQ (Activation-Aware Weight Quantization):** Preserves performance on critical weights identified by activation patterns. Ideal for GPU inference with tool-use reliability. Used by [[vllm|vLLM]] and TensorRT-LLM. # Deploying a quantized SLM agent with llama-cpp-python from llama_cpp import Llama # Load GGUF-quantized Phi-4-mini (Q4_K_M ~3GB) llm = Llama( model_path="phi-4-mini-Q4_K_M.gguf", n_ctx=4096, n_gpu_layers=-1, # offload all layers to GPU chat_format="chatml-function-calling" ) # Agent tool-calling loop tools = [{"type": "function", "function": { "name": "search_database", "parameters": {"type": "object", "properties": { "query": {"type": "string"} }} }}] response = llm.create_chat_completion( messages=[ {"role": "system", "content": "You are an agent. Use tools to answer questions."}, {"role": "user", "content": "Find revenue for Q3 2025"} ], tools=tools, tool_choice="auto" ) print(response["choices"][0]["message"]) ===== Fine-Tuning for Tool Use ===== SLMs require fine-tuning to reliably perform [[function_calling|function calling]], structured output, and schema adherence. The standard approach uses LoRA or QLoRA, training only 0.1-1% of parameters. **Process:** - Generate synthetic tool-calling datasets using a frontier model - Fine-tune with LoRA (rank 16-64) using LlamaFactory(([[https://github.com/hiyouga/LlamaFactory|LlamaFactory: Unified Fine-Tuning of 100+ LLMs. GitHub.]])) or Axolotl - Train on structured output formats (JSON function calls) - Validate against schema compliance benchmarks **Results:** Fine-tuned 3B models achieve >95% schema compliance on repetitive workflows, compared to 80-85% for zero-shot frontier models. Training takes 2-4 GPU-hours on a single A100. $$C_{finetune} = r \times d \times h \approx 16 \times 3072 \times 4 = 196{,}608 \text{ trainable params (LoRA)}$$ Where $r$ is LoRA rank, $d$ is hidden dimension, and $h$ is the number of adapted layers. ===== Open-Source Customization and Fine-Tunability ===== Fine-tuning smaller, [[open_weight_models|open-weight models]] has emerged as a sustainable strategy for the open-source ecosystem, even when these models cannot match frontier closed-source systems in raw capability. The practice of customizing small models to complement proprietary agents offers a viable path forward for open development.(([[https://www.interconnects.ai/p/the-inevitable-need-for-an-open-model|The Inevitable Need for an Open Model. Interconnects, 2025.]])) Research increasingly demonstrates that [[open_weight_models|open-weight models]] can be efficiently adapted to domain-specific and task-specific requirements through fine-tuning, making them practical complements to frontier systems rather than direct competitors. This approach is generating a diverse ecosystem: a lively niche of smaller, custom-tuned models is emerging as the primary focus of open-source developers and practitioners. Rather than attempting to [[replicate|replicate]] the general capabilities of GPT-4o or [[claude|Claude]] Opus, the open community is optimizing for fine-tunable architectures that can be specialized for particular agent workflows, business domains, and regulatory contexts. This specialization-focused strategy leverages the core strength of small models: rapid adaptation through efficient fine-tuning on limited data and compute budgets. ===== Hybrid Agent Architecture ===== The production pattern combines SLMs and frontier models in a routing architecture: - **Router (SLM):** Classifies incoming requests by complexity - **Task experts (SLMs):** Handle 80-90% of subtasks (extraction, formatting, tool calls) - **Reasoning backbone (frontier):** Handles multi-step planning, novel situations - **Orchestrator:** Manages state and delegates between models This yields 10-30x cost reduction versus using frontier models for all tasks, while maintaining equivalent end-to-end accuracy. ===== See Also ===== * [[rise_potential_llm_agents_survey|The Rise and Potential of Large Language Model Based Agents: A Survey]] * [[pioneer_agent|Pioneer Agent]] * [[agent_memory_architecture|Agent Memory Architecture]] * [[fast_cheap_models_vs_powerful_models|Fast/Cheap Models vs Powerful Models]] * [[qwen3_6_35b_vs_glm_4_7|Qwen3.6-35B vs GLM 4.7 358B]] ===== References =====