AgentBench

AgentBench is a comprehensive benchmark for LLM agents (2023/2024) introduced by Liu et al. for evaluating LLMs as agents across diverse real-world tasks.¹⁾ Published at ICLR 2024, it comprises 8 distinct interactive environments that assess LLM reasoning, decision-making, and tool-use abilities in multi-turn, open-ended settings. The benchmark revealed a significant performance gap between commercial and open-source models, establishing a rigorous standard for evaluating LLMs as agents.

The Eight Environments

AgentBench evaluates agents across diverse interactive tasks requiring different cognitive and operational skills:

Operating System (OS): Agents interact with a Linux bash shell to perform file operations, process management, system queries, and scripting tasks.
Database (DB): Agents query SQLite databases using SQL for data retrieval, manipulation, and multi-step analytical reasoning.
Knowledge Graph (KG): Agents navigate graph databases via Cypher queries to explore entity relationships and answer complex questions.
Digital Card Game (DCG): Agents play turn-based card games (e.g., Hanabi), requiring strategic planning, opponent modeling, and incomplete information reasoning.
Lateral Thinking Puzzles (LTP): Agents solve puzzles through sequential yes/no questioning, testing creative and deductive reasoning.
HouseHolding (HH): Agents manage simulated household tasks using the ALFWorld environment, combining embodied reasoning with natural language.
Web Shopping (WS): Agents browse simulated e-commerce sites to search products, compare options, and complete purchases matching specifications.
Web Browsing (WB): Agents navigate real web pages to find information, follow links, and extract data from dynamic content.

Evaluation Framework

Each environment uses task-specific success rates as the primary metric, the proportion of tasks completed correctly. Evaluation follows a multi-turn interaction protocol where agents receive observations and must select actions. The benchmark uses few-shot prompting and aggregates scores across environments for overall assessment.

The overall score is computed as:

<latex> S_{\text{overall}} = \frac{1}{8} \sum_{i=1}^{8} S_i </latex>

where S_i is the success rate in environment i.

Key Results: Commercial vs. Open-Source Gap

The most striking finding is the dramatic performance disparity between commercial and open-source models:

Model	Type	Overall Score	Notable Strengths
GPT-4 (0613)	Commercial	Highest overall	Strong across all 8 environments
Claude (Opus)	Commercial	Second tier	Competitive in reasoning tasks
GPT-3.5-turbo	Commercial	Mid-range	Decent on simpler environments
CodeLlama-34B	Open-source	Low	Only competitive on OS/DB tasks
Vicuna-13B	Open-source	Very Low	Near-zero on most environments
LLaMA-2-70B	Open-source	Very Low	Struggles with multi-turn interaction

GPT-4 outperformed open-source competitors by 2-5x on multi-turn tasks. Open-source models (up to 70B parameters) scored near zero in demanding environments like DCG and WebBrowsing.²⁾

Failure Analysis

The benchmark identified three primary failure modes:

Poor long-term reasoning: Agents lose track of goals over extended interaction sequences, failing to maintain coherent plans across 10+ turns.
Weak decision-making: Incorrect tool selection, suboptimal action sequences, and inability to recover from errors (e.g., wrong SQL syntax in DB, invalid bash commands in OS).
Instruction following deficits: Misinterpreting prompts or environmental feedback, especially prevalent in open-source models that lack instruction-tuning quality.

Code Example

# AgentBench evaluation setup (simplified)
import json
 
ENVIRONMENTS = [
    "os_interaction",    # Linux shell tasks
    "db",                # SQL database queries
    "knowledge_graph",   # Cypher graph queries
    "card_game",         # Digital card game (Hanabi)
    "ltp",               # Lateral thinking puzzles
    "alfworld",          # Household tasks
    "webshop",           # Web shopping
    "webarena",          # Web browsing
]
 
def evaluate_agent(model_name, env_name, tasks):
    results = []
    for task in tasks:
        conversation = [{"role": "system", "content": task["system_prompt"]}]
        for turn in range(task["max_turns"]):
            response = call_llm(model_name, conversation)
            action = parse_action(response)
            observation = env.step(action)
            conversation.append({"role": "assistant", "content": response})
            conversation.append({"role": "user", "content": observation})
            if env.is_done():
                break
        results.append(env.evaluate())
    return sum(results) / len(results)
 
scores = {}
for env_name in ENVIRONMENTS:
    tasks = load_tasks(env_name)
    scoresenv_name = evaluate_agent("gpt-4", env_name, tasks)
 
overall = sum(scores.values()) / len(scores)
print(f"Overall AgentBench Score: {overall:.3f}")

Architecture and Design

AgentBench uses a unified evaluation pipeline with:

Containerized environments: Each environment runs in Docker for reproducibility (MySQL for DB, Neo4j for KG, etc.)
Standardized agent interface: All environments use a common text-in/text-out protocol
Function-calling support: The v2 (AgentBench FC) edition supports modern function-calling APIs
Integrated with AgentRL: The latest version supports end-to-end multi-task, multi-turn RL training.³⁾

Insights for Agent Development

Improving instruction following through high-quality multi-round alignment data yields the largest gains
Training on code has ambivalent effects, it helps OS/DB tasks but can hurt creative reasoning tasks like LTP
Scale alone is insufficient; 70B open-source models still trail commercial models significantly
Long-context handling and error recovery are critical bottlenecks for all models ⁴⁾ ⁵⁾.com/THUDM/AgentBench|Official AgentBench Repository (THUDM]])) ⁶⁾

References

¹⁾

Liu et al. “AgentBench: Evaluating LLMs as Agents.” arXiv:2308.03688, ICLR 2024.

²⁾

Official AgentBench Repository. github.com/THUDM/AgentBench|github.com/THUDM/AgentBench]]

³⁾

ICLR 2024 Proceedings. openreview.net

⁴⁾

Liu et al. (2023) - AgentBench: Evaluating LLMs as Agents

⁵⁾

github

⁶⁾

ICLR 2024 Proceedings

AI Agent Knowledge Base

Sidebar

Table of Contents

AgentBench

The Eight Environments

Evaluation Framework

Key Results: Commercial vs. Open-Source Gap

Failure Analysis

Code Example

Architecture and Design

Insights for Agent Development

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

AgentBench

The Eight Environments

Evaluation Framework

Key Results: Commercial vs. Open-Source Gap

Failure Analysis

Code Example

Architecture and Design

Insights for Agent Development

See Also

References

Page Tools