====== Fine-Tuning Agents ====== Fine-tuning LLMs for agent tasks involves training models on domain-specific data to improve their reliability at tool calling, instruction following, and structured reasoning. While prompt engineering and RAG handle many use cases, fine-tuning becomes essential when agents need consistent behavior on specialized tasks, structured output compliance, or optimized performance at reduced model sizes and costs. ===== When to Fine-Tune vs. Prompt Engineer ===== | **Scenario** | **Recommended Approach** | **Rationale** | | Rapid prototyping | Prompt engineering | Fast iteration, no training infrastructure needed | | General-purpose agent | Prompt engineering + RAG | Flexible, leverages base model capabilities | | Consistent structured outputs | Fine-tuning | Guarantees format compliance at inference time | | Domain-specific tool calling | Fine-tuning | Improves reliability of function signatures and arguments | | Reducing model size/cost | Fine-tuning smaller model | Distill capabilities from large model to small model | | Improving instruction following | Fine-tuning | Aligns model behavior with specific operational rules | | Adapting to proprietary data | Fine-tuning + RAG | Combines learned patterns with retrieved context | **Rule of thumb:** Start with prompt engineering. If evaluation shows consistent failures on specific behaviors after prompt optimization, fine-tune. ===== Fine-Tuning Techniques ===== ==== Supervised Fine-Tuning (SFT) ==== Train on curated (prompt, completion) pairs that demonstrate desired agent behavior. For tool-use agents, this includes examples of correct function calls, argument formatting, and multi-step reasoning chains. ==== LoRA and QLoRA ==== **LoRA** (Low-Rank Adaptation) inserts small trainable matrices into frozen model layers, reducing compute by 10-100x while maintaining performance. **QLoRA** (Quantized LoRA) adds 4-bit quantization, enabling billion-parameter model fine-tuning on consumer GPUs. from peft import LoraConfig, get_peft_model, TaskType from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from trl import SFTTrainer # Load base model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3-8B-Instruct", load_in_4bit=True # QLoRA quantization ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct") # Configure LoRA adapters lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # Rank of low-rank matrices lora_alpha=32, # Scaling factor lora_dropout=0.05, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"] ) model = get_peft_model(model, lora_config) # Train on tool-calling dataset trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=tool_calling_dataset, # (prompt, tool_call) pairs args=TrainingArguments( output_dir="./agent-lora", num_train_epochs=3, per_device_train_batch_size=4, learning_rate=2e-4, warmup_steps=100 ) ) trainer.train() ==== RLHF (Reinforcement Learning from Human Feedback) ==== Aligns agent behavior with human preferences through three phases: - **Collect comparisons** — Humans rank agent outputs for the same input - **Train reward model** — A model learns to score outputs based on human preferences - **Optimize with PPO** — The agent is trained via reinforcement learning to maximize the reward model's score RLHF produces safer, more helpful agents but requires significant human annotation effort. ==== DPO (Direct Preference Optimization) ==== Simplifies RLHF by directly optimizing on preference pairs without training a separate reward model. DPO is more stable and computationally efficient, making it practical for smaller teams fine-tuning agent behavior. ===== Datasets for Tool-Use Fine-Tuning ===== Effective fine-tuning for function calling requires curated datasets: * **Function call pairs** — (user_query, correct_tool_call_with_arguments) examples demonstrating proper invocation * **Multi-step traces** — Complete agent trajectories showing planning, tool calls, and synthesis * **Error recovery examples** — Demonstrations of handling failed tool calls gracefully * **Negative examples** — Cases where no tool should be called, teaching the model restraint Public datasets include [[https://gorilla.cs.berkeley.edu/|Gorilla APIBench]] for API calling and [[https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k|xLAM Function Calling]] for structured tool use. ===== Evaluation ===== * **Loss convergence** — Monitor training and validation loss for overfitting * **Function calling accuracy** — Percentage of correct tool selections and argument formatting * **BFCL benchmark** — Berkeley Function Calling Leaderboard scores before and after fine-tuning * **Task completion rate** — End-to-end success on representative agent tasks * **Regression testing** — Ensure fine-tuning doesn't degrade general capabilities ===== References ===== * [[https://simplismart.ai/blog/fine-tuning-llms-in-2025-when-it-makes-sense-and-how-to-do-it-efficiently|SimpliSmart - Fine-Tuning LLMs in 2025]] * [[https://towardsai.net/p/data-science/fine-tuning-llms-in-2025-techniques-trade-offs-and-use-cases|Towards AI - Fine-Tuning Techniques and Trade-offs]] * [[https://gorilla.cs.berkeley.edu/|Gorilla - LLM API Calling Benchmark]] ===== See Also ===== * [[function_calling]] — The tool-calling capability that fine-tuning improves * [[embeddings]] — Fine-tuning embedding models for better retrieval * [[prompt_engineering]] — Prompt optimization as an alternative to fine-tuning * [[agent_debugging]] — Evaluating fine-tuned agent performance