agenttuning

AgentTuning: Enabling Generalized Agent Capabilities in LLMs

AgentTuning is an instruction-tuning method developed at Tsinghua University that enhances LLMs with agent capabilities while preserving their general language abilities. Introduced by Zeng et al. (2023), it produces AgentLM models where the 70B variant achieves performance comparable to GPT-3.5-turbo on unseen agent tasks.¹⁾²⁾³⁾

Overview

Fine-tuning LLMs exclusively on agent-specific data risks catastrophic forgetting of general capabilities. AgentTuning solves this through a hybrid instruction-tuning strategy: mixing agent interaction trajectories (AgentInstruct) with general-domain instructions during training. This is the first systematic attempt at instruction-tuning LLMs across multiple agent task types.

Methodology

graph TD subgraph AgentInstruct Creation A1[Instruction Generation] --> A2[Trajectory Interaction] A2 --> A3[Trajectory Filtering] A3 --> A4[1866 Verified Trajectories] end subgraph Hybrid Training A4 --> B1[Agent Trajectories] C1[General Instructions] --> B2[Mixed Training Data] B1 --> B2 B2 --> D[Supervised Fine-tuning] D --> E[AgentLM] end

The process has two main components:

1. AgentInstruct Dataset

A curated dataset of 1,866 verified interaction trajectories created in three stages:

Instruction Generation: Diverse task instructions across agent domains
Trajectory Interaction: LLMs execute tasks, producing thought-action-observation chains
Trajectory Filtering: Only high-quality, successful trajectories with valid CoT reasoning are retained

2. Hybrid Instruction-Tuning

Agent and general-domain data are mixed at a controlled ratio <latex>\alpha</latex>:

<latex>\mathcal{D}_{\text{train}} = \alpha \cdot \mathcal{D}_{\text{agent}} + (1 - \alpha) \cdot \mathcal{D}_{\text{general}}</latex>

The training loss is standard supervised fine-tuning:

<latex>\mathcal{L} = -\sum_{(x,y) \in \mathcal{D}_{\text{train}}} \sum_{t=1}^{|y|} \log P_\theta(y_t | x, y_{<t})</latex>

This hybrid approach prevents overfitting to agent patterns while building robust planning, reasoning, and tool-use capabilities.

AgentInstruct Task Coverage

The dataset spans multiple agent task domains:

Web browsing: Navigating and interacting with web pages
Database operations: Querying and manipulating structured data
Tool use: Invoking APIs and external tools
Operating system tasks: File manipulation, command execution
Knowledge-grounded QA: Multi-step reasoning with retrieval

Key Results

AgentTuning was applied to the Llama 2 series, producing AgentLM models:

Model	Key Achievement
AgentLM-7B	Agent capabilities with minimal general ability loss
AgentLM-13B	Consistent improvement over base Llama 2 on agent benchmarks
AgentLM-70B	Performance comparable to GPT-3.5-turbo on unseen agent tasks

Generalization: Strong performance on both held-in and held-out (unseen) agent tasks
Preserved general abilities: No significant degradation on standard NLP benchmarks
Error reduction: Significant decrease in formatting errors, duplicated generation, and refusal to answer
Bridges the gap between open-source and commercial LLMs for agent applications
Only 1,866 trajectories needed – demonstrating data efficiency

Code Example

# AgentTuning-style hybrid instruction tuning
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import concatenate_datasets, load_dataset
 
# Load agent and general instruction datasets
agent_data = load_dataset('THUDM/AgentInstruct')
general_data = load_dataset('general_instructions', split='train')
 
# Hybrid mixing at ratio alpha
alpha = 0.3  # 30% agent data, 70% general
n_agent = int(alpha * len(general_data))
agent_subset = agent_data['train'].select(range(min(n_agent, len(agent_data['train']))))
mixed_data = concatenate_datasets([agent_subset, general_data])
mixed_data = mixed_data.shuffle(seed=42)
 
# Fine-tune Llama 2
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-70b-hf')
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir='./agentlm-70b',
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-5,
    ),
    train_dataset=mixed_data,
)
trainer.train()

⁴⁾ ⁵⁾

Zeng et al. “AgentTuning: Enabling Generalized Agent Capabilities in LLMs.” arXiv:2310.12823, ACL 2024.

Table of Contents