====== Terminal-Bench 2 Benchmark ====== **[[terminal_bench|Terminal-Bench]] 2** is a specialized benchmark designed to evaluate the performance of coding agents and their system prompts, known as "harnesses," in executing command-line programming tasks. The benchmark measures how effectively automated harness engineering techniques can improve agent performance compared to manually designed baselines. ===== Overview and Purpose ===== Terminal-Bench 2 serves as a quantitative evaluation framework for assessing coding agent capabilities in terminal-based environments. Rather than focusing solely on agent architecture or model capabilities, the benchmark specifically measures the effectiveness of **harness evolution**—the automated optimization of system prompts and execution frameworks that guide agent behavior. This represents a shift from traditional static prompt design toward dynamic, learned prompt optimization (([[https://cobusgreyling.substack.com/p/auto-agentic-harness-engineering|Greyling - Auto-Agentic Harness Engineering (2026]])). The benchmark evaluates agents on their ability to understand and execute complex terminal commands, parse output, handle errors, and maintain context across multi-step programming tasks. Performance is measured through success rates on representative coding challenges executed in terminal environments. ===== Performance Metrics and Results ===== Terminal-Bench 2 demonstrates significant improvements through automated harness optimization. The benchmark shows that **automatically evolved harnesses** improved baseline performance from 69.7% to 77.0% across ten iterations of optimization, representing a 7.3 percentage point improvement (([[https://cobusgreyling.substack.com/p/auto-agentic-harness-engineering|Greyling - Auto-Agentic Harness Engineering (2026]])). This improvement exceeds performance of hand-crafted baseline systems, notably outperforming **[[codex_cli|Codex-CLI]]**, a manually designed baseline that achieved 71.9% accuracy. The iterative evolution approach demonstrates that systematic optimization of prompt engineering can yield measurable gains in agent effectiveness, suggesting that manually designed harnesses may leave significant performance on the table. ===== Harness Evolution Methodology ===== The benchmark evaluates harness evolution—an automated approach to optimizing the system prompts and execution strategies that guide agent behavior. Rather than relying on expert-crafted prompts, harness evolution uses iterative techniques to automatically refine how agents approach terminal-based tasks. This process typically involves: * **Iteration cycles**: Multiple rounds of optimization where promising harness variants are tested and refined * **Performance feedback**: Measurement against benchmark tasks to identify which harness modifications improve results * **Variant exploration**: Testing different prompt structures, instruction orderings, and behavioral constraints * **Convergence tracking**: Monitoring improvement trajectories across successive optimization iterations The ten-iteration progression shown in Terminal-Bench 2 demonstrates that systematic evolution can progressively improve agent performance, with cumulative gains emerging from incremental refinements (([[https://cobusgreyling.substack.com/p/auto-agentic-harness-engineering|Greyling - Auto-Agentic Harness Engineering (2026]])). ===== Applications and Implications ===== Terminal-Bench 2 has direct relevance for several areas in AI development: * **Prompt optimization research**: The benchmark provides empirical evidence that automated harness engineering outperforms manual prompt design, suggesting new directions for prompt optimization research * **Agent development practices**: Results indicate that organizations developing coding agents should consider automated harness optimization alongside model selection and architecture choices * **Baseline comparison**: The benchmark establishes empirical performance targets that future [[coding_agent|coding agent]] systems can be measured against * **Meta-prompt engineering**: The approach demonstrates that the prompts guiding agent behavior themselves can be systematically optimized, rather than treated as fixed manual constructs ===== Challenges and Limitations ===== While Terminal-Bench 2 demonstrates improvements in automated harness optimization, several considerations remain relevant: * **Task scope**: The benchmark focuses specifically on terminal-based coding tasks, and generalization to other agent domains (web interaction, file systems, API usage) requires further validation * **Computational cost**: Iterative harness evolution requires multiple evaluation cycles, creating computational overhead compared to static prompt selection * **Transferability**: Harnesses optimized on Terminal-Bench 2 may not transfer effectively to different terminal environments, operating systems, or command sets * **Baseline selection**: Performance gains are measured relative to hand-crafted baselines; comparison with other automated optimization approaches would provide additional context ===== See Also ===== * [[terminal_bench|Terminal-Bench]] * [[codex_cli|Codex-CLI]] * [[swe_bench|SWE-Bench]] * [[claude_opus_4_6_forgecode_vs_capy|Claude Opus 4.6 with ForgeCode vs Capy]] * [[core_bench|CORE-Bench]] ===== References =====