====== AgentBench ======
AgentBench is a comprehensive benchmark for LLM agents (2023/2024) introduced by Liu et al. for evaluating LLMs as agents across diverse real-world tasks.((Liu et al. "AgentBench: Evaluating LLMs as Agents." [[https://arxiv.org/abs/2308.03688|arXiv:2308.03688]], ICLR 2024.)) Published at ICLR 2024, it comprises 8 distinct interactive environments that assess LLM reasoning, decision-making, and tool-use abilities in multi-turn, open-ended settings. The benchmark revealed a significant performance gap between commercial and open-source models, establishing a rigorous standard for evaluating LLMs as agents.

===== The Eight Environments =====
AgentBench evaluates agents across diverse interactive tasks requiring different cognitive and operational skills:

  - **Operating System (OS)**: Agents interact with a Linux bash shell to perform file operations, process management, system queries, and scripting tasks.
  - **Database (DB)**: Agents query [[sqlite|SQLite]] databases using SQL for data retrieval, manipulation, and multi-step analytical reasoning.
  - **Knowledge Graph (KG)**: Agents navigate graph databases via Cypher queries to explore entity relationships and answer complex questions.
  - **Digital Card Game (DCG)**: Agents play turn-based card games (e.g., Hanabi), requiring strategic planning, opponent modeling, and incomplete information reasoning.
  - **Lateral Thinking Puzzles (LTP)**: Agents solve puzzles through sequential yes/no questioning, testing creative and deductive reasoning.
  - **HouseHolding (HH)**: Agents manage simulated household tasks using the ALFWorld environment, combining [[embodied_reasoning|embodied reasoning]] with natural language.
  - **Web Shopping (WS)**: Agents browse simulated e-commerce sites to search products, compare options, and complete purchases matching specifications.
  - **Web Browsing (WB)**: Agents navigate real web pages to find information, follow links, and extract data from dynamic content.

===== Evaluation Framework =====
Each environment uses **task-specific success rates** as the primary metric, the proportion of tasks completed correctly. Evaluation follows a multi-turn interaction protocol where agents receive observations and must select actions. The benchmark uses [[few_shot_prompting|few-shot prompting]] and aggregates scores across environments for overall assessment.

The overall score is computed as:

<latex>
S_{\text{overall}} = \frac{1}{8} \sum_{i=1}^{8} S_i
</latex>

where ''S_i'' is the success rate in environment ''i''.

===== Key Results: Commercial vs. Open-Source Gap =====
The most striking finding is the dramatic performance disparity between commercial and open-source models:

^ Model ^ Type ^ Overall Score ^ Notable Strengths ^
| GPT-4 (0613) | Commercial | Highest overall | Strong across all 8 environments |
| [[claude|Claude]] (Opus) | Commercial | Second tier | Competitive in reasoning tasks |
| GPT-3.5-turbo | Commercial | Mid-range | Decent on simpler environments |
| CodeLlama-34B | Open-source | Low | Only competitive on OS/DB tasks |
| Vicuna-13B | Open-source | Very Low | Near-zero on most environments |
| LLaMA-2-70B | Open-source | Very Low | Struggles with multi-turn interaction |

GPT-4 outperformed open-source competitors by 2-5x on multi-turn tasks. Open-source models (up to 70B parameters) scored near zero in demanding environments like DCG and WebBrowsing.((Official AgentBench Repository. [[https://[[github|github]].com/THUDM/AgentBench|github.com/THUDM/AgentBench]]))

===== Failure Analysis =====
The benchmark identified three primary failure modes:

  * **Poor long-term reasoning**: Agents lose track of goals over extended interaction sequences, failing to maintain coherent plans across 10+ turns.
  * **Weak decision-making**: Incorrect tool selection, suboptimal action sequences, and inability to recover from errors (e.g., wrong SQL syntax in DB, invalid bash commands in OS).
  * **Instruction following deficits**: Misinterpreting prompts or environmental feedback, especially prevalent in open-source models that lack instruction-tuning quality.

===== Code Example =====
<code python>
# AgentBench evaluation setup (simplified)
import json

ENVIRONMENTS = [
    "os_interaction",    # Linux shell tasks
    "db",                # SQL database queries
    "knowledge_graph",   # Cypher graph queries
    "card_game",         # Digital card game (Hanabi)
    "ltp",               # Lateral thinking puzzles
    "alfworld",          # Household tasks
    "webshop",           # Web shopping
    "webarena",          # Web browsing
]

def evaluate_agent(model_name, env_name, tasks):
    results = []
    for task in tasks:
        conversation = [{"role": "system", "content": task["system_prompt"]}]
        for turn in range(task["max_turns"]):
            response = call_llm(model_name, conversation)
            action = parse_action(response)
            observation = env.step(action)
            conversation.append({"role": "assistant", "content": response})
            conversation.append({"role": "user", "content": observation})
            if env.is_done():
                break
        results.append(env.evaluate())
    return sum(results) / len(results)

scores = {}
for env_name in ENVIRONMENTS:
    tasks = load_tasks(env_name)
    scoresenv_name = evaluate_agent("gpt-4", env_name, tasks)

overall = sum(scores.values()) / len(scores)
print(f"Overall AgentBench Score: {overall:.3f}")
</code>

===== Architecture and Design =====
AgentBench uses a unified evaluation pipeline with:

  * **Containerized environments**: Each environment runs in Docker for reproducibility (MySQL for DB, [[neo4j|Neo4j]] for KG, etc.)
  * **Standardized agent interface**: All environments use a common text-in/text-out protocol
  * **Function-calling support**: The v2 (AgentBench FC) edition supports modern function-calling APIs
  * **Integrated with AgentRL**: The latest version supports end-to-end multi-task, multi-turn RL training.((ICLR 2024 Proceedings. [[https://openreview.net/forum?id=zAdUB0aCTQ|openreview.net]]))

===== Insights for Agent Development =====
  * Improving instruction following through high-quality multi-round alignment data yields the largest gains
  * Training on code has ambivalent effects, it helps OS/DB tasks but can hurt creative reasoning tasks like LTP
  * Scale alone is insufficient; 70B open-source models still trail commercial models significantly
  * Long-context handling and error recovery are critical bottlenecks for all models (([[https://arxiv.org/abs/2308.03688|Liu et al. (2023) - AgentBench: Evaluating LLMs as Agents]])) (([[https://[[github|github]])).com/THUDM/AgentBench|Official AgentBench Repository (THUDM]])) (([[https://openreview.net/forum?id=zAdUB0aCTQ|ICLR 2024 Proceedings]]))

===== See Also =====
  * [[tool_use|Tool Use for LLM Agents]]
  * [[agent_evaluation|Agent Evaluation]]
  * [[xagent|XAgent: Autonomous LLM Agent for Complex Tasks]]
  * [[llamaindex_parsebench|LlamaIndex ParseBench]]
  * [[world_of_workflows_benchmark|World of Workflows Benchmark]]

===== References =====