Terminal-Bench

Terminal-Bench is a benchmark for evaluating AI agents on realistic command-line interface (CLI) and DevOps tasks. Developed by Stanford University and the Laude Institute, it tests whether AI agents can reliably perform multi-step terminal operations in sandboxed environments, including compiling code, configuring systems, running tools, and navigating filesystems.

Overview

Traditional AI benchmarks focus on text generation or question answering. Terminal-Bench instead measures end-to-end operational reliability in real terminal environments. Each task provides a unique environment with human-written solutions and automated verification tests, ensuring reproducible and meaningful evaluation.

The benchmark targets agent systems rather than raw language models, evaluating the complete pipeline of perception, reasoning, planning, and execution in CLI contexts.

Terminal-Bench 2.0

Terminal-Bench 2.0 is the harder successor to the original benchmark, featuring 89 curated real-world tasks across multiple domains:

Software Engineering - Building, debugging, and deploying applications
System Administration - Configuring servers, managing services, networking
Scientific Workflows - Data processing pipelines, computational tasks
Security - Input validation, vulnerability assessment (SQL injection, CSRF/SSRF per CWE standards)
Model Training - ML pipeline setup, inference configuration

Every task in 2.0 was manually verified by domain experts, filtered for quality over quantity, and inspired by real practitioner workflows.

Evaluation Methodology

Tasks are evaluated in isolated, sandboxed terminal environments. Agents receive a natural language goal and must autonomously execute the required steps. Verification is performed through automated test suites that check for correct outputs, file states, and system configurations.

Key evaluation dimensions include:

Multi-step reasoning and planning
Tool discovery and usage
Error recovery and adaptation
Environment comprehension

Results

Frontier AI models and agent systems score less than 65% on Terminal-Bench 2.0, revealing significant gaps in:

Chaining complex multi-step operations
Recovering from unexpected errors
Adapting to unfamiliar tool configurations

# Example: Programmatic evaluation with Terminal-Bench
# Tasks are defined with environment specs and verification scripts
task = {
    "id": "pytorch-model-cli",
    "description": "Implement a CLI tool for PyTorch model inference on MNIST",
    "environment": "ubuntu-22.04-python3.10",
    "timeout": 600,
    "verification": "run_tests.sh"
}
# Agent receives the task description and interacts with the terminal
# Success is determined by the verification script exit code

Architecture

Terminal-Bench uses a client-server architecture where:

The server provisions isolated environments (containers) per task
The agent client connects via terminal interface
Verification scripts run post-execution to determine pass/fail
Results are aggregated into leaderboard scores

AI Agent Knowledge Base

Sidebar

Table of Contents

Terminal-Bench

Overview

Terminal-Bench 2.0

Evaluation Methodology

Results

Architecture

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Terminal-Bench

Overview

Terminal-Bench 2.0

Evaluation Methodology

Results

Architecture

References

See Also

Page Tools