FireAct: Toward Language Agent Fine-tuning

FireAct is a fine-tuning approach that enables smaller language models to perform agentic tasks at levels approaching GPT-4 by training on diverse trajectories generated by stronger models. Introduced by Chen et al. (2023), FireAct demonstrates that multi-task, multi-method trajectory data is the key to effective agent fine-tuning.¹⁾²⁾³⁾

Overview

Prompting-based agents (ReAct, Reflexion, CoT) are limited by the base model's capacity and require expensive few-shot demonstrations at inference. FireAct shows that fine-tuning on GPT-4-generated trajectories allows 7B-parameter models to match or exceed prompted GPT-3.5 on agent tasks, with greater robustness to noisy tool outputs.

The core insight: data diversity across tasks and methods matters more than data volume.

Methodology

graph TD A[Multiple Task Datasets] --> B[GPT-4 Trajectory Generation] B --> C1[CoT Trajectories] B --> C2[ReAct Trajectories] B --> C3[Reflexion Trajectories] C1 --> D[Convert to ReAct Format] C2 --> D C3 --> D D --> E[Filter Successful Trajectories] E --> F[Fine-tune Smaller Model] F --> G[Agent LM - No Few-shot Needed]

The training process:

Trajectory Generation: GPT-4 solves tasks from multiple datasets using CoT, ReAct, and Reflexion prompting
Format Unification: All successful trajectories are converted to the ReAct format (interleaved Thought/Action/Observation)
Supervised Fine-tuning: Smaller models (Llama2-7B, GPT-3.5) are fine-tuned on the unified trajectory data

The fine-tuning objective:

<latex>\mathcal{L} = -\sum_{t=1}^{T} \log P_\theta(a_t | x, a_{<t})</latex>

where <latex>a_t</latex> are the thought-action tokens and <latex>x</latex> is the task input. Diversity is controlled by mixing trajectories from <latex>K</latex> tasks and <latex>M</latex> methods:

<latex>\mathcal{D}_{\text{train}} = \bigcup_{k=1}^{K} \bigcup_{m=1}^{M} \mathcal{T}_{k,m}</latex>

Models and Benchmarks

Model	Role
GPT-4	Teacher: generates training trajectories
GPT-3.5	Student: fine-tuned on trajectories
Llama2-7B	Student: fine-tuned on trajectories

Benchmark	Task Type
HotpotQA	Multi-hop question answering
StrategyQA	Strategic reasoning
Bamboogle	Complex QA
MMLU	Broad knowledge evaluation

Key Results

Llama2-7B on HotpotQA: 77% performance increase after fine-tuning with 500 GPT-4 trajectories
GPT-3.5 on HotpotQA: EM score 31.4 → 39.2 (+25% improvement)
GPT-3.5 on Bamboogle: EM 40.8 → 44.0, outperforming prompted GPT-3.5
Robustness: Fine-tuned agents show only 14.2% performance drop with noisy tools vs. 33.8% for prompted agents
Multi-task training consistently outperforms single-task training
Fine-tuned agents require no few-shot examples at inference, reducing cost

Code Example

# FireAct-style trajectory generation and fine-tuning pipeline
from datasets import load_dataset
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments
 
# Step 1: Generate trajectories with GPT-4
def generate_trajectory(task, method='react'):
    prompt = build_agent_prompt(task, method)
    trajectory = gpt4_agent.run(prompt, tools=['search', 'lookup'])
    if trajectory.success:
        return convert_to_react_format(trajectory)
    return None
 
# Step 2: Collect multi-task, multi-method trajectories
trajectories = []
for dataset_name in ['hotpotqa', 'strategyqa', 'bamboogle']:
    dataset = load_dataset(dataset_name, split='train[:500]')
    for method in ['cot', 'react', 'reflexion']:
        for sample in dataset:
            traj = generate_trajectory(sample, method)
            if traj:
                trajectories.append(traj)
 
# Step 3: Fine-tune smaller model
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
trainer = Trainer(
    model=model,
    args=TrainingArguments(output_dir='./fireact-7b', num_train_epochs=3),
    train_dataset=trajectories,
)
trainer.train()

References

¹⁾

Chen et al. (2023) - FireAct: Toward Language Agent Fine-tuning

²⁾

FireAct Project Page (Princeton NLP)

³⁾

FireAct Demo and Resources

AI Agent Knowledge Base

Sidebar

Table of Contents

FireAct: Toward Language Agent Fine-tuning

Overview

Methodology

Models and Benchmarks

Key Results

Code Example

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

FireAct: Toward Language Agent Fine-tuning

Overview

Methodology

Models and Benchmarks

Key Results

Code Example

See Also

References

Page Tools