Table of Contents

Fine-Tuning Agents

Fine-tuning LLMs for agent tasks involves training models on domain-specific data to improve their reliability at tool calling, instruction following, and structured reasoning. While prompt engineering and RAG handle many use cases, fine-tuning becomes essential when agents need consistent behavior on specialized tasks, structured output compliance, or optimized performance at reduced model sizes and costs.

When to Fine-Tune vs. Prompt Engineer

Scenario Recommended Approach Rationale
Rapid prototyping Prompt engineering Fast iteration, no training infrastructure needed
General-purpose agent Prompt engineering + RAG Flexible, leverages base model capabilities
Consistent structured outputs Fine-tuning Guarantees format compliance at inference time
Domain-specific tool calling Fine-tuning Improves reliability of function signatures and arguments
Reducing model size/cost Fine-tuning smaller model Distill capabilities from large model to small model
Improving instruction following Fine-tuning Aligns model behavior with specific operational rules
Adapting to proprietary data Fine-tuning + RAG Combines learned patterns with retrieved context

Rule of thumb: Start with prompt engineering. If evaluation shows consistent failures on specific behaviors after prompt optimization, fine-tune.

Fine-Tuning Techniques

Supervised Fine-Tuning (SFT)

Train on curated (prompt, completion) pairs that demonstrate desired agent behavior. For tool-use agents, this includes examples of correct function calls, argument formatting, and multi-step reasoning chains.

LoRA and QLoRA

LoRA (Low-Rank Adaptation) inserts small trainable matrices into frozen model layers, reducing compute by 10-100x while maintaining performance. QLoRA (Quantized LoRA) adds 4-bit quantization, enabling billion-parameter model fine-tuning on consumer GPUs.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
 
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B-Instruct",
    load_in_4bit=True  # QLoRA quantization
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")
 
# Configure LoRA adapters
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # Rank of low-rank matrices
    lora_alpha=32,       # Scaling factor
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]
)
model = get_peft_model(model, lora_config)
 
# Train on tool-calling dataset
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tool_calling_dataset,  # (prompt, tool_call) pairs
    args=TrainingArguments(
        output_dir="./agent-lora",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
        warmup_steps=100
    )
)
trainer.train()

RLHF (Reinforcement Learning from Human Feedback)

Aligns agent behavior with human preferences through three phases:

  1. Collect comparisons — Humans rank agent outputs for the same input
  2. Train reward model — A model learns to score outputs based on human preferences
  3. Optimize with PPO — The agent is trained via reinforcement learning to maximize the reward model's score

RLHF produces safer, more helpful agents but requires significant human annotation effort.

DPO (Direct Preference Optimization)

Simplifies RLHF by directly optimizing on preference pairs without training a separate reward model. DPO is more stable and computationally efficient, making it practical for smaller teams fine-tuning agent behavior.

Datasets for Tool-Use Fine-Tuning

Effective fine-tuning for function calling requires curated datasets:

Public datasets include Gorilla APIBench for API calling and xLAM Function Calling for structured tool use.

Evaluation

References

See Also