Capy is an alternative code execution harness designed for evaluating large language model (LLM) performance on code generation and execution tasks. It represents one approach among several frameworks for assessing how well language models can produce, understand, and execute code across different computational environments.
Capy functions as a code execution harness—a testing framework that enables systematic evaluation of language models on their ability to generate and execute code. Unlike optimized execution environments specifically designed for particular model architectures or inference stacks, Capy represents a more generalized approach to code execution evaluation 1).
The harness is particularly notable for demonstrating how model performance can vary significantly depending on the optimization level of the execution environment. Testing with Claude Opus 4.6 showed that non-optimized harnesses can produce measurable performance degradation compared to specialized implementations, with reported accuracy on Terminal-Bench 2.0 reaching 75.3%—a result that highlights the sensitivity of code execution tasks to infrastructure-level design choices 2).
As a code execution harness, Capy operates by:
* Accepting code generation outputs from language models * Executing the generated code in a controlled environment * Comparing execution results against expected outputs * Providing quantitative performance metrics
The framework appears to support evaluation on Terminal-Bench 2.0, a benchmark focused on terminal and command-line interface execution tasks. This suggests Capy is designed to handle evaluation scenarios involving shell commands, system operations, and command-line tool interactions—areas where precise code generation and execution are critical for model assessment 3).
A significant finding regarding Capy is the relationship between harness optimization and model performance outcomes. The 75.3% accuracy achieved with Claude Opus 4.6 demonstrates that non-optimized code execution harnesses can introduce performance constraints that reduce model capabilities below their potential maximum 4).
This observation underscores an important principle in LLM evaluation: the choice of execution framework and infrastructure can meaningfully impact measured performance. Optimized harnesses—those specifically tuned for particular model architectures or inference requirements—may enable higher accuracy rates compared to more generalized approaches. This distinction is relevant for researchers and practitioners comparing code execution capabilities across different models, as harness-level differences can confound direct performance comparisons.
Code execution harnesses like Capy are employed in several contexts:
* Model evaluation: Assessing code generation quality and correctness across standardized benchmarks * Comparative analysis: Understanding relative performance differences between models * Infrastructure optimization: Testing how execution environment design affects measured capabilities * Benchmark development: Supporting the creation of reproducible code execution evaluation standards
The ability to systematically evaluate code execution performance is particularly important as language models increasingly handle programming tasks, code review, and automated code generation in production environments.
Code execution evaluation represents a specialized area within broader LLM benchmarking. Other evaluation approaches focus on different dimensions of model capability, from general knowledge tasks to reasoning and multi-step problem solving. The emergence of harnesses like Capy reflects the field's recognition that code execution—being deterministic and measurable—requires specialized evaluation methodologies distinct from other capability assessments.