Table of Contents

Terminal-Bench 2 Benchmark

Terminal-Bench 2 is a specialized benchmark designed to evaluate the performance of coding agents and their system prompts, known as “harnesses,” in executing command-line programming tasks. The benchmark measures how effectively automated harness engineering techniques can improve agent performance compared to manually designed baselines.

Overview and Purpose

Terminal-Bench 2 serves as a quantitative evaluation framework for assessing coding agent capabilities in terminal-based environments. Rather than focusing solely on agent architecture or model capabilities, the benchmark specifically measures the effectiveness of harness evolution—the automated optimization of system prompts and execution frameworks that guide agent behavior. This represents a shift from traditional static prompt design toward dynamic, learned prompt optimization 1).

The benchmark evaluates agents on their ability to understand and execute complex terminal commands, parse output, handle errors, and maintain context across multi-step programming tasks. Performance is measured through success rates on representative coding challenges executed in terminal environments.

Performance Metrics and Results

Terminal-Bench 2 demonstrates significant improvements through automated harness optimization. The benchmark shows that automatically evolved harnesses improved baseline performance from 69.7% to 77.0% across ten iterations of optimization, representing a 7.3 percentage point improvement 2).

This improvement exceeds performance of hand-crafted baseline systems, notably outperforming Codex-CLI, a manually designed baseline that achieved 71.9% accuracy. The iterative evolution approach demonstrates that systematic optimization of prompt engineering can yield measurable gains in agent effectiveness, suggesting that manually designed harnesses may leave significant performance on the table.

Harness Evolution Methodology

The benchmark evaluates harness evolution—an automated approach to optimizing the system prompts and execution strategies that guide agent behavior. Rather than relying on expert-crafted prompts, harness evolution uses iterative techniques to automatically refine how agents approach terminal-based tasks. This process typically involves:

* Iteration cycles: Multiple rounds of optimization where promising harness variants are tested and refined * Performance feedback: Measurement against benchmark tasks to identify which harness modifications improve results * Variant exploration: Testing different prompt structures, instruction orderings, and behavioral constraints * Convergence tracking: Monitoring improvement trajectories across successive optimization iterations

The ten-iteration progression shown in Terminal-Bench 2 demonstrates that systematic evolution can progressively improve agent performance, with cumulative gains emerging from incremental refinements 3).

Applications and Implications

Terminal-Bench 2 has direct relevance for several areas in AI development:

* Prompt optimization research: The benchmark provides empirical evidence that automated harness engineering outperforms manual prompt design, suggesting new directions for prompt optimization research * Agent development practices: Results indicate that organizations developing coding agents should consider automated harness optimization alongside model selection and architecture choices * Baseline comparison: The benchmark establishes empirical performance targets that future coding agent systems can be measured against * Meta-prompt engineering: The approach demonstrates that the prompts guiding agent behavior themselves can be systematically optimized, rather than treated as fixed manual constructs

Challenges and Limitations

While Terminal-Bench 2 demonstrates improvements in automated harness optimization, several considerations remain relevant:

* Task scope: The benchmark focuses specifically on terminal-based coding tasks, and generalization to other agent domains (web interaction, file systems, API usage) requires further validation * Computational cost: Iterative harness evolution requires multiple evaluation cycles, creating computational overhead compared to static prompt selection * Transferability: Harnesses optimized on Terminal-Bench 2 may not transfer effectively to different terminal environments, operating systems, or command sets * Baseline selection: Performance gains are measured relative to hand-crafted baselines; comparison with other automated optimization approaches would provide additional context

See Also

References