Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Terminal-Bench 2 is a specialized benchmark designed to evaluate the performance of coding agents and their system prompts, known as “harnesses,” in executing command-line programming tasks. The benchmark measures how effectively automated harness engineering techniques can improve agent performance compared to manually designed baselines.
Terminal-Bench 2 serves as a quantitative evaluation framework for assessing coding agent capabilities in terminal-based environments. Rather than focusing solely on agent architecture or model capabilities, the benchmark specifically measures the effectiveness of harness evolution—the automated optimization of system prompts and execution frameworks that guide agent behavior. This represents a shift from traditional static prompt design toward dynamic, learned prompt optimization 1).
The benchmark evaluates agents on their ability to understand and execute complex terminal commands, parse output, handle errors, and maintain context across multi-step programming tasks. Performance is measured through success rates on representative coding challenges executed in terminal environments.
Terminal-Bench 2 demonstrates significant improvements through automated harness optimization. The benchmark shows that automatically evolved harnesses improved baseline performance from 69.7% to 77.0% across ten iterations of optimization, representing a 7.3 percentage point improvement 2).
This improvement exceeds performance of hand-crafted baseline systems, notably outperforming Codex-CLI, a manually designed baseline that achieved 71.9% accuracy. The iterative evolution approach demonstrates that systematic optimization of prompt engineering can yield measurable gains in agent effectiveness, suggesting that manually designed harnesses may leave significant performance on the table.
The benchmark evaluates harness evolution—an automated approach to optimizing the system prompts and execution strategies that guide agent behavior. Rather than relying on expert-crafted prompts, harness evolution uses iterative techniques to automatically refine how agents approach terminal-based tasks. This process typically involves:
* Iteration cycles: Multiple rounds of optimization where promising harness variants are tested and refined * Performance feedback: Measurement against benchmark tasks to identify which harness modifications improve results * Variant exploration: Testing different prompt structures, instruction orderings, and behavioral constraints * Convergence tracking: Monitoring improvement trajectories across successive optimization iterations
The ten-iteration progression shown in Terminal-Bench 2 demonstrates that systematic evolution can progressively improve agent performance, with cumulative gains emerging from incremental refinements 3).
Terminal-Bench 2 has direct relevance for several areas in AI development:
* Prompt optimization research: The benchmark provides empirical evidence that automated harness engineering outperforms manual prompt design, suggesting new directions for prompt optimization research * Agent development practices: Results indicate that organizations developing coding agents should consider automated harness optimization alongside model selection and architecture choices * Baseline comparison: The benchmark establishes empirical performance targets that future coding agent systems can be measured against * Meta-prompt engineering: The approach demonstrates that the prompts guiding agent behavior themselves can be systematically optimized, rather than treated as fixed manual constructs
While Terminal-Bench 2 demonstrates improvements in automated harness optimization, several considerations remain relevant:
* Task scope: The benchmark focuses specifically on terminal-based coding tasks, and generalization to other agent domains (web interaction, file systems, API usage) requires further validation * Computational cost: Iterative harness evolution requires multiple evaluation cycles, creating computational overhead compared to static prompt selection * Transferability: Harnesses optimized on Terminal-Bench 2 may not transfer effectively to different terminal environments, operating systems, or command sets * Baseline selection: Performance gains are measured relative to hand-crafted baselines; comparison with other automated optimization approaches would provide additional context