====== AgentBench ======
AgentBench is a comprehensive multi-dimensional benchmark introduced by Liu et al. (2023) for evaluating Large Language Models (LLMs) as autonomous agents. Published at ICLR 2024, it comprises 8 distinct interactive environments that assess LLM reasoning, decision-making, and tool-use abilities in multi-turn, open-ended settings. The benchmark revealed a significant performance gap between commercial and open-source models, establishing a rigorous standard for LLM agent evaluation.
===== The Eight Environments =====
AgentBench evaluates agents across diverse interactive tasks requiring different cognitive and operational skills:
- **Operating System (OS)**: Agents interact with a Linux bash shell to perform file operations, process management, system queries, and scripting tasks.
- **Database (DB)**: Agents query SQLite databases using SQL for data retrieval, manipulation, and multi-step analytical reasoning.
- **Knowledge Graph (KG)**: Agents navigate graph databases via Cypher queries to explore entity relationships and answer complex questions.
- **Digital Card Game (DCG)**: Agents play turn-based card games (e.g., Hanabi), requiring strategic planning, opponent modeling, and incomplete information reasoning.
- **Lateral Thinking Puzzles (LTP)**: Agents solve puzzles through sequential yes/no questioning, testing creative and deductive reasoning.
- **HouseHolding (HH)**: Agents manage simulated household tasks using the ALFWorld environment, combining embodied reasoning with natural language.
- **Web Shopping (WS)**: Agents browse simulated e-commerce sites to search products, compare options, and complete purchases matching specifications.
- **Web Browsing (WB)**: Agents navigate real web pages to find information, follow links, and extract data from dynamic content.
===== Evaluation Framework =====
Each environment uses **task-specific success rates** as the primary metric -- the proportion of tasks completed correctly. Evaluation follows a multi-turn interaction protocol where agents receive observations and must select actions. The benchmark uses few-shot prompting and aggregates scores across environments for overall assessment.
The overall score is computed as:
S_{\text{overall}} = \frac{1}{8} \sum_{i=1}^{8} S_i
where ''S_i'' is the success rate in environment ''i''.
===== Key Results: Commercial vs. Open-Source Gap =====
The most striking finding is the dramatic performance disparity between commercial and open-source models:
^ Model ^ Type ^ Overall Score ^ Notable Strengths ^
| GPT-4 (0613) | Commercial | Highest overall | Strong across all 8 environments |
| Claude (Opus) | Commercial | Second tier | Competitive in reasoning tasks |
| GPT-3.5-turbo | Commercial | Mid-range | Decent on simpler environments |
| CodeLlama-34B | Open-source | Low | Only competitive on OS/DB tasks |
| Vicuna-13B | Open-source | Very Low | Near-zero on most environments |
| LLaMA-2-70B | Open-source | Very Low | Struggles with multi-turn interaction |
GPT-4 outperformed open-source competitors by 2-5x on multi-turn tasks. Open-source models (up to 70B parameters) scored near zero in demanding environments like DCG and WebBrowsing.
===== Failure Analysis =====
The benchmark identified three primary failure modes:
* **Poor long-term reasoning**: Agents lose track of goals over extended interaction sequences, failing to maintain coherent plans across 10+ turns.
* **Weak decision-making**: Incorrect tool selection, suboptimal action sequences, and inability to recover from errors (e.g., wrong SQL syntax in DB, invalid bash commands in OS).
* **Instruction following deficits**: Misinterpreting prompts or environmental feedback, especially prevalent in open-source models that lack instruction-tuning quality.
===== Code Example =====
# AgentBench evaluation setup (simplified)
import json
ENVIRONMENTS = [
"os_interaction", # Linux shell tasks
"db", # SQL database queries
"knowledge_graph", # Cypher graph queries
"card_game", # Digital card game (Hanabi)
"ltp", # Lateral thinking puzzles
"alfworld", # Household tasks
"webshop", # Web shopping
"webarena", # Web browsing
]
def evaluate_agent(model_name, env_name, tasks):
results = []
for task in tasks:
conversation = [{"role": "system", "content": task["system_prompt"]}]
for turn in range(task["max_turns"]):
response = call_llm(model_name, conversation)
action = parse_action(response)
observation = env.step(action)
conversation.append({"role": "assistant", "content": response})
conversation.append({"role": "user", "content": observation})
if env.is_done():
break
results.append(env.evaluate())
return sum(results) / len(results)
scores = {}
for env_name in ENVIRONMENTS:
tasks = load_tasks(env_name)
scores[env_name] = evaluate_agent("gpt-4", env_name, tasks)
overall = sum(scores.values()) / len(scores)
print(f"Overall AgentBench Score: {overall:.3f}")
===== Architecture and Design =====
AgentBench uses a unified evaluation pipeline with:
* **Containerized environments**: Each environment runs in Docker for reproducibility (MySQL for DB, Neo4j for KG, etc.)
* **Standardized agent interface**: All environments use a common text-in/text-out protocol
* **Function-calling support**: The v2 (AgentBench FC) edition supports modern function-calling APIs
* **Integrated with AgentRL**: The latest version supports end-to-end multi-task, multi-turn RL training
===== Insights for Agent Development =====
* Improving instruction following through high-quality multi-round alignment data yields the largest gains
* Training on code has ambivalent effects -- it helps OS/DB tasks but can hurt creative reasoning tasks like LTP
* Scale alone is insufficient; 70B open-source models still trail commercial models significantly
* Long-context handling and error recovery are critical bottlenecks for all models
===== References =====
* [[https://arxiv.org/abs/2308.03688|Liu et al. (2023) - AgentBench: Evaluating LLMs as Agents]]
* [[https://github.com/THUDM/AgentBench|Official AgentBench Repository (THUDM)]]
* [[https://openreview.net/forum?id=zAdUB0aCTQ|ICLR 2024 Proceedings]]
===== See Also =====
* [[tau_bench|tau-bench]] - Complementary benchmark focusing on Tool-Agent-User interaction
* [[llm_as_judge|LLM-as-a-Judge]] - Automated evaluation methodology for LLM outputs
* [[taskweaver|TaskWeaver]] - Code-first agent framework evaluated on similar tasks