====== Fine-Tuning Agents ======
Fine-tuning LLMs for agent tasks involves training models on domain-specific data to improve their reliability at tool calling, instruction following, and structured reasoning. While [[prompt_engineering|prompt engineering]] and RAG handle many use cases, fine-tuning becomes essential when agents need consistent behavior on specialized tasks, structured output compliance, or optimized performance at reduced model sizes and costs.(([[https://simplismart.ai/blog/fine-tuning-llms-in-2025-when-it-makes-sense-and-how-to-do-it-efficiently|SimpliSmart - Fine-Tuning LLMs in 2025]]))(([[https://towardsai.net/p/data-science/fine-tuning-llms-in-2025-techniques-trade-offs-and-use-cases|Towards AI - Fine-Tuning Techniques and Trade-offs]]))

===== When to Fine-Tune vs. Prompt Engineer =====
| **Scenario** | **Recommended Approach** | **Rationale** |
| Rapid prototyping | [[prompt_engineering|Prompt engineering]] | Fast iteration, no training infrastructure needed |
| General-purpose agent | [[prompt_engineering|Prompt engineering]] + RAG | Flexible, leverages base model capabilities |
| Consistent [[structured_outputs|structured outputs]] | Fine-tuning | Guarantees format compliance at inference time |
| Domain-specific tool calling | Fine-tuning | Improves reliability of function signatures and arguments |
| Reducing model size/cost | Fine-tuning smaller model | Distill capabilities from large model to small model |
| Improving instruction following | Fine-tuning | Aligns model behavior with specific operational rules |
| Adapting to proprietary data | Fine-tuning + RAG | Combines learned patterns with retrieved context |

**Rule of thumb:** Start with [[prompt_engineering|prompt engineering]]. If evaluation shows consistent failures on specific behaviors after prompt optimization, fine-tune.

===== Fine-Tuning Techniques =====
==== Supervised Fine-Tuning (SFT) ====
Train on curated (prompt, completion) pairs that demonstrate desired agent behavior. For tool-use agents, this includes examples of correct function calls, argument formatting, and multi-step reasoning chains.

==== LoRA and QLoRA ====
**LoRA** (Low-Rank Adaptation) inserts small trainable matrices into frozen model layers, reducing compute by 10-100x while maintaining performance. **QLoRA** (Quantized LoRA) adds 4-bit quantization, enabling billion-parameter model fine-tuning on consumer GPUs.

<code python>
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "[[meta|meta]]-llama/Llama-3-8B-Instruct",
    load_in_4bit=True  # QLoRA quantization
)
tokenizer = AutoTokenizer.from_pretrained("[[meta|meta]]-llama/Llama-3-8B-Instruct")

# Configure LoRA adapters
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # Rank of low-rank matrices
    lora_alpha=32,       # Scaling factor
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]
)
model = get_peft_model(model, lora_config)

# Train on tool-calling dataset
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tool_calling_dataset,  # (prompt, tool_call) pairs
    args=TrainingArguments(
        output_dir="./agent-lora",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
        warmup_steps=100
    )
)
trainer.train()
</code>

==== RLHF (Reinforcement Learning from Human Feedback) ====
Aligns agent behavior with human preferences through three phases:

  - **Collect comparisons** — Humans rank agent outputs for the same input
  - **Train reward model** — A model learns to score outputs based on human preferences
  - **Optimize with PPO** — The agent is trained via [[reinforcement_learning|reinforcement learning]] to maximize the reward model's score

RLHF produces safer, more helpful agents but requires significant human annotation effort.

==== DPO (Direct Preference Optimization) ====
Simplifies RLHF by directly optimizing on preference pairs without training a separate reward model. DPO is more stable and computationally efficient, making it practical for smaller teams fine-tuning agent behavior.

===== Datasets for Tool-Use Fine-Tuning =====
Effective fine-tuning for [[function_calling|function calling]] requires curated datasets:

  * **Function call pairs** — (user_query, correct_tool_call_with_arguments) examples demonstrating proper invocation
  * **Multi-step traces** — Complete agent trajectories showing planning, tool calls, and synthesis
  * **Error recovery examples** — Demonstrations of handling failed tool calls gracefully
  * **Negative examples** — Cases where no tool should be called, teaching the model restraint

Public datasets include [[https://gorilla.cs.berkeley.edu/|Gorilla APIBench]] for API calling and [[https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k|xLAM Function Calling]] for structured tool use.(([[https://gorilla.cs.berkeley.edu/|Gorilla - LLM API Calling Benchmark]]))

===== Evaluation =====
  * **Loss convergence** — Monitor training and validation loss for overfitting
  * **[[function_calling|Function calling]] accuracy** — Percentage of correct tool selections and argument formatting
  * **BFCL benchmark** — Berkeley [[function_calling|Function Calling]] Leaderboard scores before and after fine-tuning
  * **Task completion rate** — End-to-end success on representative agent tasks
  * **Regression testing** — Ensure fine-tuning doesn't degrade general capabilities

===== See Also =====
  * [[agenttuning|AgentTuning: Enabling Generalized Agent Capabilities in LLMs]]
  * [[tool_use|Tool Use for LLM Agents]]
  * [[how_to_fine_tune_an_llm|How to Fine-Tune an LLM]]
  * [[agentic_skills|Agentic Skills]]
  * [[agentbench|AgentBench]]

===== References =====