====== AgentBench ====== AgentBench is a comprehensive multi-dimensional benchmark introduced by Liu et al. (2023) for evaluating Large Language Models (LLMs) as autonomous agents. Published at ICLR 2024, it comprises 8 distinct interactive environments that assess LLM reasoning, decision-making, and tool-use abilities in multi-turn, open-ended settings. The benchmark revealed a significant performance gap between commercial and open-source models, establishing a rigorous standard for LLM agent evaluation. ===== The Eight Environments ===== AgentBench evaluates agents across diverse interactive tasks requiring different cognitive and operational skills: - **Operating System (OS)**: Agents interact with a Linux bash shell to perform file operations, process management, system queries, and scripting tasks. - **Database (DB)**: Agents query SQLite databases using SQL for data retrieval, manipulation, and multi-step analytical reasoning. - **Knowledge Graph (KG)**: Agents navigate graph databases via Cypher queries to explore entity relationships and answer complex questions. - **Digital Card Game (DCG)**: Agents play turn-based card games (e.g., Hanabi), requiring strategic planning, opponent modeling, and incomplete information reasoning. - **Lateral Thinking Puzzles (LTP)**: Agents solve puzzles through sequential yes/no questioning, testing creative and deductive reasoning. - **HouseHolding (HH)**: Agents manage simulated household tasks using the ALFWorld environment, combining embodied reasoning with natural language. - **Web Shopping (WS)**: Agents browse simulated e-commerce sites to search products, compare options, and complete purchases matching specifications. - **Web Browsing (WB)**: Agents navigate real web pages to find information, follow links, and extract data from dynamic content. ===== Evaluation Framework ===== Each environment uses **task-specific success rates** as the primary metric -- the proportion of tasks completed correctly. Evaluation follows a multi-turn interaction protocol where agents receive observations and must select actions. The benchmark uses few-shot prompting and aggregates scores across environments for overall assessment. The overall score is computed as: S_{\text{overall}} = \frac{1}{8} \sum_{i=1}^{8} S_i where ''S_i'' is the success rate in environment ''i''. ===== Key Results: Commercial vs. Open-Source Gap ===== The most striking finding is the dramatic performance disparity between commercial and open-source models: ^ Model ^ Type ^ Overall Score ^ Notable Strengths ^ | GPT-4 (0613) | Commercial | Highest overall | Strong across all 8 environments | | Claude (Opus) | Commercial | Second tier | Competitive in reasoning tasks | | GPT-3.5-turbo | Commercial | Mid-range | Decent on simpler environments | | CodeLlama-34B | Open-source | Low | Only competitive on OS/DB tasks | | Vicuna-13B | Open-source | Very Low | Near-zero on most environments | | LLaMA-2-70B | Open-source | Very Low | Struggles with multi-turn interaction | GPT-4 outperformed open-source competitors by 2-5x on multi-turn tasks. Open-source models (up to 70B parameters) scored near zero in demanding environments like DCG and WebBrowsing. ===== Failure Analysis ===== The benchmark identified three primary failure modes: * **Poor long-term reasoning**: Agents lose track of goals over extended interaction sequences, failing to maintain coherent plans across 10+ turns. * **Weak decision-making**: Incorrect tool selection, suboptimal action sequences, and inability to recover from errors (e.g., wrong SQL syntax in DB, invalid bash commands in OS). * **Instruction following deficits**: Misinterpreting prompts or environmental feedback, especially prevalent in open-source models that lack instruction-tuning quality. ===== Code Example ===== # AgentBench evaluation setup (simplified) import json ENVIRONMENTS = [ "os_interaction", # Linux shell tasks "db", # SQL database queries "knowledge_graph", # Cypher graph queries "card_game", # Digital card game (Hanabi) "ltp", # Lateral thinking puzzles "alfworld", # Household tasks "webshop", # Web shopping "webarena", # Web browsing ] def evaluate_agent(model_name, env_name, tasks): results = [] for task in tasks: conversation = [{"role": "system", "content": task["system_prompt"]}] for turn in range(task["max_turns"]): response = call_llm(model_name, conversation) action = parse_action(response) observation = env.step(action) conversation.append({"role": "assistant", "content": response}) conversation.append({"role": "user", "content": observation}) if env.is_done(): break results.append(env.evaluate()) return sum(results) / len(results) scores = {} for env_name in ENVIRONMENTS: tasks = load_tasks(env_name) scores[env_name] = evaluate_agent("gpt-4", env_name, tasks) overall = sum(scores.values()) / len(scores) print(f"Overall AgentBench Score: {overall:.3f}") ===== Architecture and Design ===== AgentBench uses a unified evaluation pipeline with: * **Containerized environments**: Each environment runs in Docker for reproducibility (MySQL for DB, Neo4j for KG, etc.) * **Standardized agent interface**: All environments use a common text-in/text-out protocol * **Function-calling support**: The v2 (AgentBench FC) edition supports modern function-calling APIs * **Integrated with AgentRL**: The latest version supports end-to-end multi-task, multi-turn RL training ===== Insights for Agent Development ===== * Improving instruction following through high-quality multi-round alignment data yields the largest gains * Training on code has ambivalent effects -- it helps OS/DB tasks but can hurt creative reasoning tasks like LTP * Scale alone is insufficient; 70B open-source models still trail commercial models significantly * Long-context handling and error recovery are critical bottlenecks for all models ===== References ===== * [[https://arxiv.org/abs/2308.03688|Liu et al. (2023) - AgentBench: Evaluating LLMs as Agents]] * [[https://github.com/THUDM/AgentBench|Official AgentBench Repository (THUDM)]] * [[https://openreview.net/forum?id=zAdUB0aCTQ|ICLR 2024 Proceedings]] ===== See Also ===== * [[tau_bench|tau-bench]] - Complementary benchmark focusing on Tool-Agent-User interaction * [[llm_as_judge|LLM-as-a-Judge]] - Automated evaluation methodology for LLM outputs * [[taskweaver|TaskWeaver]] - Code-first agent framework evaluated on similar tasks