====== Agent Evaluation ======
**Agent evaluation** encompasses the benchmarks, metrics, and methodologies used to assess the capabilities of [[ai_agents|AI agents]] across domains including software engineering, web navigation, code generation, tool use, and general reasoning. As of 2025, standardized benchmarks have become critical for comparing agent frameworks and tracking progress in autonomous AI capabilities. However, agentic benchmarking exists in a transitional phase where benchmark scores often diverge significantly from real-world agent deployment performance.(([[https://www.interconnects.ai/p/reading-todays-open-closed-performance|Interconnects - Agentic Benchmarking (2026]]))

===== SWE-Bench =====
**[[swe_bench|SWE-Bench]]** tests AI agents on real-world software engineering tasks derived from GitHub issues. Agents must edit codebases, run tests, and resolve bugs in repositories like Django, SymPy, and scikit-learn.((Jimenez et al. "[[swe_bench|SWE-Bench]]: Can Language Models Resolve Real-World GitHub Issues?" [[https://arxiv.org/abs/2310.06770|arXiv:2310.06770]], 2023.)) The agent interacts via bash tools in Dockerized environments.

**[[swe_bench|SWE-Bench]] Verified** is a curated subset of 500 tasks with human-verified fixes for stricter evaluation, addressing concerns about ambiguous or flawed test cases in the original benchmark.

^ Metric ^ Value ^
| Task Source | Real [[github|GitHub]] issues and PRs |
| Environment | Dockerized repository snapshots |
| Top Scores (2025) | >60% resolution rate |
| Key Innovation | End-to-end coding + testing |

Top-performing agents achieve over 60% resolution through high-level planners, specialized training, and memory-augmented architectures.(([[swe_bench|SWE-Bench]])) Leaderboard. [[https://www.swebench.com|swebench.com]])) **SWE-Bench Pro** extends the benchmark to measure agent effectiveness in more sophisticated infrastructure and coding scenarios.(([[https://www.latent.space/p/ainews-[[moonshot|moonshot]]))-kimi-k26-the-worlds|Latent Space (2026]])) Recent advances in agent skill generation benchmarks and test-time compute scaling for agentic coding systems have been explored by institutional research initiatives.(([[https://thesequence.substack.com/p/the-sequence-radar-849-last-week|TheSequence (2026]]))

===== GAIA =====
**GAIA** (General AI Assistants) assesses zero-shot reasoning across question-answering, tool use, and multi-step planning with real-world tasks. It includes 466 tasks across three difficulty levels, requiring agents to integrate web search, code execution, and interpretation without task-specific training data.((Mialon et al. "GAIA: A Benchmark for General AI Assistants." [[https://arxiv.org/abs/2311.12983|arXiv:2311.12983]], 2023.))

^ Level ^ Description ^ Top Scores (2025) ^
| Level 1 | Simple factual questions | ~70-80% |
| Level 2 | Multi-step reasoning | ~60-70% |
| Level 3 | Complex multi-tool tasks | ~50-60% |

===== WebArena =====
**WebArena** benchmarks web-browsing agents in realistic simulations of e-commerce sites, social forums, and content management systems. It contains 804 tasks across four categories: Web Shopping, Web Search, Social Interaction, and Content Editing.((Zhou et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." [[https://arxiv.org/abs/2307.13854|arXiv:2307.13854]], 2023.))

Agents use browser tools for navigation, form-filling, and decision-making. Early GPT-4 agents scored approximately 14%, improving to over 60% by 2025. **IBM CUGA leads at 61.7%** as of early 2025.((WebArena Leaderboard. [[https://webarena.dev|webarena.dev]])) The **Odysseys** benchmark represents an advanced evolution in web-agent evaluation, introducing 200 long-horizon tasks on live internet environments with rubric-based evaluation metrics instead of binary success measures, and incorporating trajectory efficiency measurement to move beyond synthetic task evaluation. Best model performance on Odysseys reaches 44.5% success rate with efficiency at 1.15%, demonstrating the increased complexity of real-world agent evaluation versus synthetic benchmarks.(([[https://news.smol.ai/issues/26-04-29-not-much/|AI News (smol.ai) - Web-Agent Benchmarking (2026]]))

===== AgentBench =====
**AgentBench** is a comprehensive suite testing language agents on decision-making, reasoning, and tool usage across 8 diverse environments:((Liu et al. "AgentBench: Evaluating LLMs as Agents." [[https://arxiv.org/abs/2308.03688|arXiv:2308.03688]], 2023.))

  * Operating system interaction
  * Database querying
  * Web browsing
  * Knowledge graph navigation
  * Lateral thinking puzzles
  * Digital card games
  * Household simulation
  * Web shopping

The benchmark includes 2,000+ tasks with success measured by goal completion rates across all environments.

===== HumanEval =====
**HumanEval** evaluates code generation by prompting models to complete 164 Python functions from docstrings. Scoring uses **pass@k** — the probability that at least one of k generated solutions passes all unit tests.((Chen et al. "Evaluating Large Language Models Trained on Code." [[https://arxiv.org/abs/2107.03374|arXiv:2107.03374]], 2021.))

While originally designed for LLM evaluation rather than agents, HumanEval has been adapted for tool-augmented coding scenarios. Top 2025 models exceed 90% pass@1.

===== Open-World Agent Evaluation =====
**Open-World Agent Evaluation** represents an emerging evaluation approach that measures agent performance on non-fully-verifiable, uncertain, real-world tasks rather than artificially bounded benchmarks.(([[https://news.smol.ai/issues/26-04-27-not-much/|AI News (smol.ai) - Open-World Agent Evaluation (2026]]))  This methodology addresses the tendency of current agentic benchmarks to overfit to automatically verifiable tasks, providing more authentic assessment of agent capabilities in practical deployment scenarios.

===== Other Notable Benchmarks =====
  * **CUB** ([[computer_use_benchmark|Computer Use Benchmark]]) — 106 end-to-end workflows across 7 industries for GUI agents; top score 10.4%
  * **OSWorld** — Realistic operating system environment for multimodal desktop agents
  * **Mind2Web** — 2,350 tasks on 137 live websites for web agent evaluation
  * **BFCL v4** (Berkeley Function-Calling Leaderboard) — Multi-step tool use evaluation
  * **[[terminal_bench|Terminal-Bench]]** — Terminal-based task completion
  * **[[tau_bench|tau-Bench]]** — Multi-turn workflow evaluation
  * **ALFWorld** — Household simulation tasks
  * **Toolathlon** — [[tool_using_agents|Tool-using agents]] measuring agent effectiveness across infrastructure and coding tasks(([[https://www.latent.space/p/ainews-moonshot-kimi-k26-the-worlds|Latent Space (2026]]))

===== Leaderboard Summary (2025) =====
^ Benchmark ^ Top Performer ^ Score ^ Notes ^
| [[swe_bench|SWE-Bench]] Verified | Advanced planners | >60% | End-to-end software engineering |
| WebArena | IBM CUGA | 61.7% | Web browsing autonomy |
| GAIA Level 3 | Leading LLMs | ~50-60% | General reasoning |
| HumanEval | Top LLMs | >90% pass@1 | Code generation |
| CUB | Writer Action Agent | 10.4% | [[computer_use|Computer use]] (very challenging) |
| [[agentbench|AgentBench]] | Domain-specific | ~50-70% avg | Multi-environment |

===== Code Example =====
<code python>
Simple evaluation harness pattern
import json
from typing import Callable

def evaluate_agent(
    agent_fn: Callable,
    benchmark: listdict,
    metric_fn: Callable
) -> dict:
    """Evaluate an agent against a benchmark dataset."""
    results = []
    for task in benchmark:
        prediction = agent_fn(task['input'])
        score = metric_fn(prediction, task['expected'])
        results.append({
            'task_id': task['id'],
            'score': score,
            'prediction': prediction
        })

    total = len(results)
    passed = sum(1 for r in results if r['score'] >= 1.0)
    return {
        'total_tasks': total,
        'passed': passed,
        'pass_rate': passed / total,
        'results': results
    }

Example usage
scores = evaluate_agent(
    agent_fn=my_coding_agent,
    benchmark=swe_bench_tasks,
    metric_fn=test_pass_metric
)
print(f'Pass rate: {scores["pass_rate"]:.1%}')
</code>

===== See Also =====
  * [[cost_aware_agent_evaluation|Cost-Aware Agent Evaluation]]
  * [[how_to_evaluate_an_agent|How to Evaluate an Agent]]
  * [[binary_success_vs_rubric_evaluation|Rubric-Based Agent Evaluation]]
  * [[agent_index|Agent Index]]
  * [[agent_benchmark_blind_spots|Benchmarks for Agent Blind Spots]]

===== References =====