Table of Contents

Terminal-Bench

Terminal-Bench is a benchmark for evaluating AI agents on realistic command-line interface (CLI) and DevOps tasks. Developed by Stanford University and the Laude Institute, it tests whether AI agents can reliably perform multi-step terminal operations in sandboxed environments, including compiling code, configuring systems, running tools, and navigating filesystems.

Overview

Traditional AI benchmarks focus on text generation or question answering. Terminal-Bench instead measures end-to-end operational reliability in real terminal environments. Each task provides a unique environment with human-written solutions and automated verification tests, ensuring reproducible and meaningful evaluation.

The benchmark targets agent systems rather than raw language models, evaluating the complete pipeline of perception, reasoning, planning, and execution in CLI contexts.

Terminal-Bench 2.0

Terminal-Bench 2.0 is the harder successor to the original benchmark, featuring 89 curated real-world tasks across multiple domains:

Every task in 2.0 was manually verified by domain experts, filtered for quality over quantity, and inspired by real practitioner workflows.

Evaluation Methodology

Tasks are evaluated in isolated, sandboxed terminal environments. Agents receive a natural language goal and must autonomously execute the required steps. Verification is performed through automated test suites that check for correct outputs, file states, and system configurations.

Key evaluation dimensions include:

Results

Frontier AI models and agent systems score less than 65% on Terminal-Bench 2.0, revealing significant gaps in:

# Example: Programmatic evaluation with Terminal-Bench
# Tasks are defined with environment specs and verification scripts
task = {
    "id": "pytorch-model-cli",
    "description": "Implement a CLI tool for PyTorch model inference on MNIST",
    "environment": "ubuntu-22.04-python3.10",
    "timeout": 600,
    "verification": "run_tests.sh"
}
# Agent receives the task description and interacts with the terminal
# Success is determined by the verification script exit code

Architecture

Terminal-Bench uses a client-server architecture where:

  1. The server provisions isolated environments (containers) per task
  2. The agent client connects via terminal interface
  3. Verification scripts run post-execution to determine pass/fail
  4. Results are aggregated into leaderboard scores

References

See Also