Fine-Tuning Agents

Fine-tuning LLMs for agent tasks involves training models on domain-specific data to improve their reliability at tool calling, instruction following, and structured reasoning. While prompt engineering and RAG handle many use cases, fine-tuning becomes essential when agents need consistent behavior on specialized tasks, structured output compliance, or optimized performance at reduced model sizes and costs.¹⁾²⁾

When to Fine-Tune vs. Prompt Engineer

Scenario	Recommended Approach	Rationale
Rapid prototyping	Prompt engineering	Fast iteration, no training infrastructure needed
General-purpose agent	Prompt engineering + RAG	Flexible, leverages base model capabilities
Consistent structured outputs	Fine-tuning	Guarantees format compliance at inference time
Domain-specific tool calling	Fine-tuning	Improves reliability of function signatures and arguments
Reducing model size/cost	Fine-tuning smaller model	Distill capabilities from large model to small model
Improving instruction following	Fine-tuning	Aligns model behavior with specific operational rules
Adapting to proprietary data	Fine-tuning + RAG	Combines learned patterns with retrieved context

Rule of thumb: Start with prompt engineering. If evaluation shows consistent failures on specific behaviors after prompt optimization, fine-tune.

Fine-Tuning Techniques

Supervised Fine-Tuning (SFT)

Train on curated (prompt, completion) pairs that demonstrate desired agent behavior. For tool-use agents, this includes examples of correct function calls, argument formatting, and multi-step reasoning chains.

LoRA and QLoRA

LoRA (Low-Rank Adaptation) inserts small trainable matrices into frozen model layers, reducing compute by 10-100x while maintaining performance. QLoRA (Quantized LoRA) adds 4-bit quantization, enabling billion-parameter model fine-tuning on consumer GPUs.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
 
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "[[meta|meta]]-llama/Llama-3-8B-Instruct",
    load_in_4bit=True  # QLoRA quantization
)
tokenizer = AutoTokenizer.from_pretrained("[[meta|meta]]-llama/Llama-3-8B-Instruct")
 
# Configure LoRA adapters
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # Rank of low-rank matrices
    lora_alpha=32,       # Scaling factor
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]
)
model = get_peft_model(model, lora_config)
 
# Train on tool-calling dataset
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tool_calling_dataset,  # (prompt, tool_call) pairs
    args=TrainingArguments(
        output_dir="./agent-lora",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
        warmup_steps=100
    )
)
trainer.train()

RLHF (Reinforcement Learning from Human Feedback)

Aligns agent behavior with human preferences through three phases:

Collect comparisons — Humans rank agent outputs for the same input
Train reward model — A model learns to score outputs based on human preferences
Optimize with PPO — The agent is trained via reinforcement learning to maximize the reward model's score

RLHF produces safer, more helpful agents but requires significant human annotation effort.

DPO (Direct Preference Optimization)

Simplifies RLHF by directly optimizing on preference pairs without training a separate reward model. DPO is more stable and computationally efficient, making it practical for smaller teams fine-tuning agent behavior.

Datasets for Tool-Use Fine-Tuning

Effective fine-tuning for function calling requires curated datasets:

Function call pairs — (user_query, correct_tool_call_with_arguments) examples demonstrating proper invocation
Multi-step traces — Complete agent trajectories showing planning, tool calls, and synthesis
Error recovery examples — Demonstrations of handling failed tool calls gracefully
Negative examples — Cases where no tool should be called, teaching the model restraint

Public datasets include Gorilla APIBench for API calling and xLAM Function Calling for structured tool use.³⁾

Evaluation

Loss convergence — Monitor training and validation loss for overfitting
Function calling accuracy — Percentage of correct tool selections and argument formatting
BFCL benchmark — Berkeley Function Calling Leaderboard scores before and after fine-tuning
Task completion rate — End-to-end success on representative agent tasks
Regression testing — Ensure fine-tuning doesn't degrade general capabilities

References

¹⁾

SimpliSmart - Fine-Tuning LLMs in 2025

²⁾

Towards AI - Fine-Tuning Techniques and Trade-offs

³⁾

Gorilla - LLM API Calling Benchmark

Table of Contents