The MirrorCode Benchmark is a software engineering evaluation framework designed to test AI agents' ability to reimplement complex software systems using only command-line access and test cases as guidance1). Developed by METR and Epoch, it measures an AI model's capacity to reverse-engineer and reconstruct functional software without access to source code—a task that typically requires weeks of human engineering effort. METR, an AI measurement organization focused on evaluating the capabilities and risks of frontier AI models, collaborates on this benchmark to assess AI performance on high-level, long-horizon tasks in autonomous software engineering.
Unlike traditional code-understanding benchmarks, MirrorCode tests whether AI agents can reverse-engineer and recreate source code by observing program behavior, outputs, and test cases—simulating a realistic software engineering challenge. The benchmark constrains AI agents to execute-only access to target programs, forcing them to deduce implementation details through black-box testing rather than inspecting source code directly. This mirrors real-world scenarios where engineers must understand and replicate software behavior without access to original source.
The benchmark operates by providing AI agents with:
Agents must analyze the tool's behavior through experimentation, infer its logic, and then write working reimplementations that pass all test cases. This tests multiple capabilities simultaneously: program synthesis, system design, debugging, and long-horizon reasoning.
The MirrorCode suite encompasses diverse software domains, including:
This diversity ensures the benchmark evaluates generalization across different problem types rather than optimization for a single domain.
Agents are evaluated on their ability to:
The benchmark measures both functional correctness (whether reimplemented code passes test cases) and implementation quality (code clarity, efficiency, and maintainability).
MirrorCode is significant for several reasons. It evaluates practical software engineering capabilities beyond pattern matching or code completion—assessing whether AI systems can perform genuine software archaeology and reverse-engineering. The benchmark reveals gaps in autonomous programming autonomy, as reimplementation requires deeper understanding than code generation from specifications.
The benchmark demonstrates that modern large language models (LLMs) can handle multi-week engineering tasks autonomously2). This has important implications for:
Unlike isolated programming tasks, MirrorCode requires agents to maintain context, iterate on failures, and apply reasoning across multiple hours of model computation—characteristics essential for practical AI agents. Results provide insights into AI readiness for complex, open-ended software engineering tasks that require problem decomposition, debugging, and iterative refinement.