Codex-CLI

Codex-CLI is a hand-crafted baseline coding-agent system designed to evaluate performance on Terminal-Bench 2, a benchmark for assessing autonomous coding agent capabilities. The system achieved 71.9% performance on this benchmark, serving as a comparison point for evaluating more sophisticated agent architectures and automatically evolved harness systems ¹⁾

Overview and Purpose

Codex-CLI represents a manually engineered baseline approach to terminal-based coding tasks. As a reference implementation, it establishes a performance baseline against which more complex or automatically optimized coding agents can be measured. The 71.9% performance metric on Terminal-Bench 2 provides a concrete benchmark for comparing different agent design philosophies and optimization strategies.

The system functions within the context of autonomous coding agent evaluation, where benchmarks like Terminal-Bench 2 measure the ability of AI systems to execute terminal commands, interpret outputs, and navigate complex coding workflows without human intervention ²⁾

Benchmark Comparison Context

Codex-CLI serves as a critical comparison point in agent evaluation studies. When evaluated against automatically evolved harness systems, the hand-crafted baseline achieved 71.9% accuracy, while automatically generated variants achieved 77.0% performance on the same benchmark ³⁾, demonstrating a 5.1 percentage point improvement through evolutionary optimization techniques.

This comparison illustrates a broader trend in AI research where manual engineering baselines are increasingly being supplemented or superseded by automatically optimized systems. The performance delta between the hand-crafted and evolved approaches suggests that systematic optimization of agent harness architectures can yield meaningful improvements in coding task execution.

Technical Architecture

As a baseline coding agent, Codex-CLI likely incorporates fundamental components common to terminal-based AI agents: integration with command-line interfaces, capability to parse and interpret terminal output, and sequential decision-making for multi-step coding tasks. The hand-crafted nature of the system implies careful manual design of control flow, error handling, and task decomposition strategies.

The distinction between Codex-CLI and automatically evolved harness systems highlights different approaches to agent engineering. Hand-crafted systems rely on domain expertise and manual optimization, while evolved systems use algorithmic techniques to discover effective architectures without explicit human design of all components.

Evaluation Metrics and Applications

Performance evaluation on Terminal-Bench 2 involves assessing the system's ability to handle real-world terminal operations, including command execution, output interpretation, and conditional task branching. The 71.9% success rate indicates that while Codex-CLI handles a substantial majority of benchmark tasks, significant edge cases or complex workflows remain challenging.

Such baseline systems are valuable in the AI research pipeline for establishing performance floors, enabling fair comparison of novel techniques, and identifying specific failure modes that more advanced approaches should address. The publicly available benchmark comparison provides transparency in agent capability assessment.

References

¹⁾ , ²⁾ , ³⁾

Cobus Greyling - Auto-Agentic Harness Engineering (2026

Table of Contents