Terminal-Bench is a benchmark for evaluating AI agents on realistic command-line interface (CLI) and DevOps tasks. Developed by Stanford University and the Laude Institute, it tests whether AI agents can reliably perform multi-step terminal operations in sandboxed environments, including compiling code, configuring systems, running tools, and navigating filesystems.
Traditional AI benchmarks focus on text generation or question answering. Terminal-Bench instead measures end-to-end operational reliability in real terminal environments. Each task provides a unique environment with human-written solutions and automated verification tests, ensuring reproducible and meaningful evaluation.
The benchmark targets agent systems rather than raw language models, evaluating the complete pipeline of perception, reasoning, planning, and execution in CLI contexts.
Terminal-Bench 2.0 is the harder successor to the original benchmark, featuring 89 curated real-world tasks across multiple domains:
Every task in 2.0 was manually verified by domain experts, filtered for quality over quantity, and inspired by real practitioner workflows.
Tasks are evaluated in isolated, sandboxed terminal environments. Agents receive a natural language goal and must autonomously execute the required steps. Verification is performed through automated test suites that check for correct outputs, file states, and system configurations.
Key evaluation dimensions include:
Frontier AI models and agent systems score less than 65% on Terminal-Bench 2.0, revealing significant gaps in:
# Example: Programmatic evaluation with Terminal-Bench # Tasks are defined with environment specs and verification scripts task = { "id": "pytorch-model-cli", "description": "Implement a CLI tool for PyTorch model inference on MNIST", "environment": "ubuntu-22.04-python3.10", "timeout": 600, "verification": "run_tests.sh" } # Agent receives the task description and interacts with the terminal # Success is determined by the verification script exit code
Terminal-Bench uses a client-server architecture where: