Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
AgentBench is a comprehensive multi-dimensional benchmark introduced by Liu et al. (2023) for evaluating Large Language Models (LLMs) as autonomous agents. Published at ICLR 2024, it comprises 8 distinct interactive environments that assess LLM reasoning, decision-making, and tool-use abilities in multi-turn, open-ended settings. The benchmark revealed a significant performance gap between commercial and open-source models, establishing a rigorous standard for LLM agent evaluation.
AgentBench evaluates agents across diverse interactive tasks requiring different cognitive and operational skills:
Each environment uses task-specific success rates as the primary metric – the proportion of tasks completed correctly. Evaluation follows a multi-turn interaction protocol where agents receive observations and must select actions. The benchmark uses few-shot prompting and aggregates scores across environments for overall assessment.
The overall score is computed as:
<latex> S_{\text{overall}} = \frac{1}{8} \sum_{i=1}^{8} S_i </latex>
where S_i is the success rate in environment i.
The most striking finding is the dramatic performance disparity between commercial and open-source models:
| Model | Type | Overall Score | Notable Strengths |
|---|---|---|---|
| GPT-4 (0613) | Commercial | Highest overall | Strong across all 8 environments |
| Claude (Opus) | Commercial | Second tier | Competitive in reasoning tasks |
| GPT-3.5-turbo | Commercial | Mid-range | Decent on simpler environments |
| CodeLlama-34B | Open-source | Low | Only competitive on OS/DB tasks |
| Vicuna-13B | Open-source | Very Low | Near-zero on most environments |
| LLaMA-2-70B | Open-source | Very Low | Struggles with multi-turn interaction |
GPT-4 outperformed open-source competitors by 2-5x on multi-turn tasks. Open-source models (up to 70B parameters) scored near zero in demanding environments like DCG and WebBrowsing.
The benchmark identified three primary failure modes:
# AgentBench evaluation setup (simplified) import json ENVIRONMENTS = [ "os_interaction", # Linux shell tasks "db", # SQL database queries "knowledge_graph", # Cypher graph queries "card_game", # Digital card game (Hanabi) "ltp", # Lateral thinking puzzles "alfworld", # Household tasks "webshop", # Web shopping "webarena", # Web browsing ] def evaluate_agent(model_name, env_name, tasks): results = [] for task in tasks: conversation = [{"role": "system", "content": task["system_prompt"]}] for turn in range(task["max_turns"]): response = call_llm(model_name, conversation) action = parse_action(response) observation = env.step(action) conversation.append({"role": "assistant", "content": response}) conversation.append({"role": "user", "content": observation}) if env.is_done(): break results.append(env.evaluate()) return sum(results) / len(results) scores = {} for env_name in ENVIRONMENTS: tasks = load_tasks(env_name) scores[env_name] = evaluate_agent("gpt-4", env_name, tasks) overall = sum(scores.values()) / len(scores) print(f"Overall AgentBench Score: {overall:.3f}")
AgentBench uses a unified evaluation pipeline with: