====== Terminal-Bench 2 Benchmark ======
**[[terminal_bench|Terminal-Bench]] 2** is a specialized benchmark designed to evaluate the performance of coding agents and their system prompts, known as "harnesses," in executing command-line programming tasks. The benchmark measures how effectively automated harness engineering techniques can improve agent performance compared to manually designed baselines.

===== Overview and Purpose =====
Terminal-Bench 2 serves as a quantitative evaluation framework for assessing coding agent capabilities in terminal-based environments. Rather than focusing solely on agent architecture or model capabilities, the benchmark specifically measures the effectiveness of **harness evolution**—the automated optimization of system prompts and execution frameworks that guide agent behavior. This represents a shift from traditional static prompt design toward dynamic, learned prompt optimization (([[https://cobusgreyling.substack.com/p/auto-agentic-harness-engineering|Greyling - Auto-Agentic Harness Engineering (2026]])).

The benchmark evaluates agents on their ability to understand and execute complex terminal commands, parse output, handle errors, and maintain context across multi-step programming tasks. Performance is measured through success rates on representative coding challenges executed in terminal environments.

===== Performance Metrics and Results =====
Terminal-Bench 2 demonstrates significant improvements through automated harness optimization. The benchmark shows that **automatically evolved harnesses** improved baseline performance from 69.7% to 77.0% across ten iterations of optimization, representing a 7.3 percentage point improvement (([[https://cobusgreyling.substack.com/p/auto-agentic-harness-engineering|Greyling - Auto-Agentic Harness Engineering (2026]])).

This improvement exceeds performance of hand-crafted baseline systems, notably outperforming **[[codex_cli|Codex-CLI]]**, a manually designed baseline that achieved 71.9% accuracy. The iterative evolution approach demonstrates that systematic optimization of prompt engineering can yield measurable gains in agent effectiveness, suggesting that manually designed harnesses may leave significant performance on the table.

===== Harness Evolution Methodology =====
The benchmark evaluates harness evolution—an automated approach to optimizing the system prompts and execution strategies that guide agent behavior. Rather than relying on expert-crafted prompts, harness evolution uses iterative techniques to automatically refine how agents approach terminal-based tasks. This process typically involves:

* **Iteration cycles**: Multiple rounds of optimization where promising harness variants are tested and refined
* **Performance feedback**: Measurement against benchmark tasks to identify which harness modifications improve results
* **Variant exploration**: Testing different prompt structures, instruction orderings, and behavioral constraints
* **Convergence tracking**: Monitoring improvement trajectories across successive optimization iterations

The ten-iteration progression shown in Terminal-Bench 2 demonstrates that systematic evolution can progressively improve agent performance, with cumulative gains emerging from incremental refinements (([[https://cobusgreyling.substack.com/p/auto-agentic-harness-engineering|Greyling - Auto-Agentic Harness Engineering (2026]])).

===== Applications and Implications =====
Terminal-Bench 2 has direct relevance for several areas in AI development:

* **Prompt optimization research**: The benchmark provides empirical evidence that automated harness engineering outperforms manual prompt design, suggesting new directions for prompt optimization research
* **Agent development practices**: Results indicate that organizations developing coding agents should consider automated harness optimization alongside model selection and architecture choices
* **Baseline comparison**: The benchmark establishes empirical performance targets that future [[coding_agent|coding agent]] systems can be measured against
* **Meta-prompt engineering**: The approach demonstrates that the prompts guiding agent behavior themselves can be systematically optimized, rather than treated as fixed manual constructs

===== Challenges and Limitations =====
While Terminal-Bench 2 demonstrates improvements in automated harness optimization, several considerations remain relevant:

* **Task scope**: The benchmark focuses specifically on terminal-based coding tasks, and generalization to other agent domains (web interaction, file systems, API usage) requires further validation
* **Computational cost**: Iterative harness evolution requires multiple evaluation cycles, creating computational overhead compared to static prompt selection
* **Transferability**: Harnesses optimized on Terminal-Bench 2 may not transfer effectively to different terminal environments, operating systems, or command sets
* **Baseline selection**: Performance gains are measured relative to hand-crafted baselines; comparison with other automated optimization approaches would provide additional context


===== See Also =====
  * [[terminal_bench|Terminal-Bench]]
  * [[codex_cli|Codex-CLI]]
  * [[swe_bench|SWE-Bench]]
  * [[claude_opus_4_6_forgecode_vs_capy|Claude Opus 4.6 with ForgeCode vs Capy]]
  * [[core_bench|CORE-Bench]]

===== References =====