====== Terminal-Bench ====== Terminal-Bench is a benchmark for evaluating AI agents on realistic command-line interface (CLI) and DevOps tasks. Developed by **Stanford University** and the **Laude Institute**, it tests whether AI agents can reliably perform multi-step terminal operations in sandboxed environments, including compiling code, configuring systems, running tools, and navigating filesystems. ===== Overview ===== Traditional AI benchmarks focus on text generation or question answering. Terminal-Bench instead measures **end-to-end operational reliability** in real terminal environments. Each task provides a unique environment with human-written solutions and automated verification tests, ensuring reproducible and meaningful evaluation. The benchmark targets agent systems rather than raw language models, evaluating the complete pipeline of perception, reasoning, planning, and execution in CLI contexts. ===== Terminal-Bench 2.0 ===== Terminal-Bench 2.0 is the harder successor to the original benchmark, featuring **89 curated real-world tasks** across multiple domains: * **Software Engineering** - Building, debugging, and deploying applications * **System Administration** - Configuring servers, managing services, networking * **Scientific Workflows** - Data processing pipelines, computational tasks * **Security** - Input validation, vulnerability assessment (SQL injection, CSRF/SSRF per CWE standards) * **Model Training** - ML pipeline setup, inference configuration Every task in 2.0 was manually verified by domain experts, filtered for quality over quantity, and inspired by real practitioner workflows. ===== Evaluation Methodology ===== Tasks are evaluated in isolated, sandboxed terminal environments. Agents receive a natural language goal and must autonomously execute the required steps. Verification is performed through automated test suites that check for correct outputs, file states, and system configurations. Key evaluation dimensions include: * Multi-step reasoning and planning * Tool discovery and usage * Error recovery and adaptation * Environment comprehension ===== Results ===== Frontier AI models and agent systems score **less than 65%** on Terminal-Bench 2.0, revealing significant gaps in: * Chaining complex multi-step operations * Recovering from unexpected errors * Adapting to unfamiliar tool configurations # Example: Programmatic evaluation with Terminal-Bench # Tasks are defined with environment specs and verification scripts task = { "id": "pytorch-model-cli", "description": "Implement a CLI tool for PyTorch model inference on MNIST", "environment": "ubuntu-22.04-python3.10", "timeout": 600, "verification": "run_tests.sh" } # Agent receives the task description and interacts with the terminal # Success is determined by the verification script exit code ===== Architecture ===== Terminal-Bench uses a client-server architecture where: - The **server** provisions isolated environments (containers) per task - The **agent client** connects via terminal interface - **Verification scripts** run post-execution to determine pass/fail - Results are aggregated into leaderboard scores ===== References ===== * [[https://arxiv.org/abs/2601.11868|Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Computer Terminal Environments (arXiv:2601.11868)]] * [[https://www.tbench.ai|Terminal-Bench Official Site and Leaderboard]] * [[https://github.com/harbor-framework/terminal-bench|Terminal-Bench GitHub Repository]] * [[https://www.laude.org/updates/terminal-bench|Laude Institute - Terminal-Bench]] * [[https://snorkel.ai/blog/terminal-bench-2-0-raising-the-bar-for-ai-agent-evaluation/|Snorkel AI - Terminal-Bench 2.0]] ===== See Also ===== * [[gaia_benchmark]] - General AI assistant benchmark with multi-step real-world tasks * [[computer_use_benchmark]] - GUI interaction benchmarks for agent computer proficiency * [[humanitys_last_exam]] - Expert-level question benchmark for frontier models * [[agent_simulation_environments]] - Environments for training and evaluating agents