====== Theory of Code Space ====== Theory of Code Space (ToCS) is a benchmark introduced by Grigory Sapunov from Intento (arXiv:2603.00601) that evaluates whether AI code agents can construct, maintain, and update coherent architectural beliefs during codebase exploration. Inspired by the Theory of Space benchmark for spatial reasoning in multimodal models, ToCS transplants the concept of cognitive map building to software engineering. Agents explore procedurally generated Python codebases under partial observability and must externalize their architectural understanding as structured JSON at periodic intervals, producing a time-series of belief states rather than a single final snapshot. ===== Motivation: Beyond Code Generation ===== Current code benchmarks (SWE-bench, HumanEval, MBPP) test output correctness: does the patch compile, does the test pass? None measure whether an agent understands how modules relate, where architectural boundaries lie, or what invariants must be preserved. ToCS addresses this gap by directly probing the agent's evolving internal model of software architecture. When human developers navigate large codebases, they build mental models of module dependencies, data flow patterns, and design constraints. They update these models as they read more files. ToCS measures whether code agents can do the same. ===== Benchmark Design ===== ToCS makes three key design choices: **1. Procedural Codebase Generation**: Medium-complexity Python projects are generated with controlled structure, featuring four typed edge categories: * **Syntactic imports**: Direct ''import'' and ''from X import Y'' statements * **Runtime dispatch**: Dynamic method calls resolved at runtime * **Data flow dependencies**: Shared data structures passed between modules * **Config-driven dynamic wiring**: Dependencies specified in configuration files, not code Each generated codebase includes planted architectural constraints (invariants) with verified ground truth. **2. Partial Observability Harness**: Agents explore under a file budget, opening one file at a time. This ensures that exploration strategy matters and prevents trivial solutions from simply reading everything. **3. Periodic Belief Probing**: Agents periodically externalize their belief state as structured JSON, including component status (observed/inferred/unknown), exported symbols, typed dependency edges with confidence scores, discovered invariants, and unexplored regions. # Illustration of the ToCS belief state structure belief_state = { "components": { "auth_module": { "status": "observed", "files": ["auth/handler.py", "auth/tokens.py"], "exports": ["authenticate", "refresh_token"] }, "database_layer": { "status": "inferred", "files": ["db/connector.py"], "exports": ["query", "execute"] } }, "dependencies": [ {"source": "auth_module", "target": "database_layer", "type": "runtime_dispatch", "confidence": 0.7, "evidence": "dynamic call in handler.py:42"} ], "invariants": [ {"type": "forbidden_dependency", "description": "auth_module must not import ui_layer directly", "confidence": 0.9} ], "unexplored": ["utils/", "config/"], "budget_remaining": 5 } ===== Key Findings ===== Experiments with four baselines and six frontier LLMs (GPT-5.3-Codex, Gemini, Claude Sonnet 4.6) reveal three model-dependent phenomena: ==== 1. The Active-Passive Gap ==== The Active-Passive Gap measures the difference between an agent's architectural understanding when actively exploring (choosing which files to open) versus passively receiving all files at once. ext{APG} = F_1^{ ext{passive}} - F_1^{ ext{active}} This gap is **model-dependent**: GPT builds better maps through active exploration than from seeing all files at once (negative APG, active exceeds passive), while Gemini shows the opposite pattern, revealing that active exploration is itself a non-trivial capability that some models lack. ==== 2. Self-Scaffolding via Belief Maps ==== Retaining structured belief map JSON in context between probes acts as self-scaffolding -- external memory that helps the agent maintain coherent understanding. GPT benefits substantially (+14 F1 points), but Gemini shows minimal improvement. The probe itself becomes an active intervention that shapes subsequent exploration decisions. ==== 3. Catastrophic Belief Collapse ==== A smaller model maintains perfectly stable beliefs across probes, while its larger counterpart suffers catastrophic belief collapse -- forgetting previously discovered components between consecutive probes. This phenomenon is invisible in final-snapshot evaluations and can only be detected through time-series analysis of belief states. ===== Metrics ===== * **Dependency F1**: Exact matching of source-target-type edges against ground truth, requiring file-level specificity * **Architectural Constraint Discovery**: Measures detection of planted invariants (forbidden dependencies, validation chains, etc.) * **Belief Stability**: Consistency of belief states across consecutive probes ===== Implications for Code Agent Development ===== ToCS reveals that current code agents lack fundamental architectural reasoning capabilities: * Models that excel at code generation may fail at architectural comprehension * Active exploration capability is not guaranteed even in frontier models * Belief persistence across long contexts is unreliable * Time-series evaluation catches failure modes invisible to snapshot benchmarks ===== References ===== * [[https://arxiv.org/abs/2603.00601|Sapunov, "Theory of Code Space: Do Code Agents Understand Software Architecture?" arXiv:2603.00601, 2026]] * [[https://github.com/che-shr-cat/tocs|ToCS Open-Source Repository]] * [[https://gonzoml.substack.com/p/do-code-agents-actually-understand|"Do Code Agents Actually Understand the Code They're Working With?" - GonzoML Blog]] ===== See Also ===== * [[world_of_workflows_benchmark|World of Workflows Benchmark]] * [[instruction_following_evaluation|Instruction Following Evaluation (IF-CRITIC)]] * [[personalized_agents_human_feedback|Personalized Agents from Human Feedback (PAHF)]]