====== AgentTuning: Enabling Generalized Agent Capabilities in LLMs ====== AgentTuning is an instruction-tuning method developed at Tsinghua University that **enhances LLMs with agent capabilities while preserving their general language abilities**. Introduced by Zeng et al. (2023), it produces AgentLM models where the 70B variant achieves performance comparable to GPT-3.5-turbo on unseen agent tasks.((Zeng et al. "AgentTuning: Enabling Generalized Agent Capabilities in LLMs." [[https://arxiv.org/abs/2310.12823|arXiv:2310.12823]], ACL 2024.))(([[https://thudm.github.io/AgentTuning/|AgentTuning Project Page (Tsinghua)]]))(([[https://github.com/thudm/agenttuning|AgentTuning GitHub Repository]])) ===== Overview ===== Fine-tuning LLMs exclusively on agent-specific data risks catastrophic forgetting of general capabilities. AgentTuning solves this through a **hybrid instruction-tuning strategy**: mixing agent interaction trajectories (AgentInstruct) with general-domain instructions during training. This is the first systematic attempt at instruction-tuning LLMs across multiple agent task types. ===== Methodology ===== graph TD subgraph AgentInstruct Creation A1[Instruction Generation] --> A2[Trajectory Interaction] A2 --> A3[Trajectory Filtering] A3 --> A4[1866 Verified Trajectories] end subgraph Hybrid Training A4 --> B1[Agent Trajectories] C1[General Instructions] --> B2[Mixed Training Data] B1 --> B2 B2 --> D[Supervised Fine-tuning] D --> E[AgentLM] end The process has two main components: **1. AgentInstruct Dataset** A curated dataset of 1,866 verified interaction trajectories created in three stages: * **Instruction Generation**: Diverse task instructions across agent domains * **Trajectory Interaction**: LLMs execute tasks, producing thought-action-observation chains * **Trajectory Filtering**: Only high-quality, successful trajectories with valid CoT reasoning are retained **2. Hybrid Instruction-Tuning** Agent and general-domain data are mixed at a controlled ratio \alpha: \mathcal{D}_{\text{train}} = \alpha \cdot \mathcal{D}_{\text{agent}} + (1 - \alpha) \cdot \mathcal{D}_{\text{general}} The training loss is standard supervised fine-tuning: \mathcal{L} = -\sum_{(x,y) \in \mathcal{D}_{\text{train}}} \sum_{t=1}^{|y|} \log P_\theta(y_t | x, y_{ This hybrid approach prevents overfitting to agent patterns while building robust planning, reasoning, and tool-use capabilities. ===== AgentInstruct Task Coverage ===== The dataset spans multiple agent task domains: * **Web browsing**: Navigating and interacting with web pages * **Database operations**: Querying and manipulating structured data * **Tool use**: Invoking APIs and external tools * **Operating system tasks**: File manipulation, command execution * **Knowledge-grounded QA**: Multi-step reasoning with retrieval ===== Key Results ===== AgentTuning was applied to the **Llama 2 series**, producing AgentLM models: ^ Model ^ Key Achievement ^ | AgentLM-7B | Agent capabilities with minimal general ability loss | | AgentLM-13B | Consistent improvement over base Llama 2 on agent benchmarks | | AgentLM-70B | Performance comparable to GPT-3.5-turbo on unseen agent tasks | * **Generalization**: Strong performance on both held-in and held-out (unseen) agent tasks * **Preserved general abilities**: No significant degradation on standard NLP benchmarks * **Error reduction**: Significant decrease in formatting errors, duplicated generation, and refusal to answer * **Bridges the gap** between open-source and commercial LLMs for agent applications * Only **1,866 trajectories** needed -- demonstrating data efficiency ===== Code Example ===== # AgentTuning-style hybrid instruction tuning from transformers import AutoModelForCausalLM, Trainer, TrainingArguments from datasets import concatenate_datasets, load_dataset # Load agent and general instruction datasets agent_data = load_dataset('THUDM/AgentInstruct') general_data = load_dataset('general_instructions', split='train') # Hybrid mixing at ratio alpha alpha = 0.3 # 30% agent data, 70% general n_agent = int(alpha * len(general_data)) agent_subset = agent_data['train'].select(range(min(n_agent, len(agent_data['train'])))) mixed_data = concatenate_datasets([agent_subset, general_data]) mixed_data = mixed_data.shuffle(seed=42) # Fine-tune Llama 2 model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-70b-hf') trainer = Trainer( model=model, args=TrainingArguments( output_dir='./agentlm-70b', num_train_epochs=3, per_device_train_batch_size=4, learning_rate=2e-5, ), train_dataset=mixed_data, ) trainer.train() (([[https://arxiv.org/abs/2310.12823|Zeng et al. (2023) - AgentTuning: Enabling Generalized Agent Capabilities in LLMs]])) (([[https://aclanthology.org/2024.findings-acl.181.pdf|ACL 2024 Findings Paper]])) ===== See Also ===== * [[fireact_agent_finetuning|FireAct: Agent Fine-tuning]] * [[rise_potential_llm_agents_survey|LLM Agent Survey: Brain/Perception/Action Framework]] * [[retroformer|Retroformer: Policy Gradient Agent Optimization]] * [[agent_benchmarks|Agent Benchmarks and Evaluation]] ===== References =====