Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Theory of Code Space (ToCS) is a benchmark introduced by Grigory Sapunov from Intento (arXiv:2603.00601) that evaluates whether AI code agents can construct, maintain, and update coherent architectural beliefs during codebase exploration. Inspired by the Theory of Space benchmark for spatial reasoning in multimodal models, ToCS transplants the concept of cognitive map building to software engineering. Agents explore procedurally generated Python codebases under partial observability and must externalize their architectural understanding as structured JSON at periodic intervals, producing a time-series of belief states rather than a single final snapshot.
Current code benchmarks (SWE-bench, HumanEval, MBPP) test output correctness: does the patch compile, does the test pass? None measure whether an agent understands how modules relate, where architectural boundaries lie, or what invariants must be preserved. ToCS addresses this gap by directly probing the agent's evolving internal model of software architecture.
When human developers navigate large codebases, they build mental models of module dependencies, data flow patterns, and design constraints. They update these models as they read more files. ToCS measures whether code agents can do the same.
ToCS makes three key design choices:
1. Procedural Codebase Generation: Medium-complexity Python projects are generated with controlled structure, featuring four typed edge categories:
import and from X import Y statementsEach generated codebase includes planted architectural constraints (invariants) with verified ground truth.
2. Partial Observability Harness: Agents explore under a file budget, opening one file at a time. This ensures that exploration strategy matters and prevents trivial solutions from simply reading everything.
3. Periodic Belief Probing: Agents periodically externalize their belief state as structured JSON, including component status (observed/inferred/unknown), exported symbols, typed dependency edges with confidence scores, discovered invariants, and unexplored regions.
# Illustration of the ToCS belief state structure belief_state = { "components": { "auth_module": { "status": "observed", "files": ["auth/handler.py", "auth/tokens.py"], "exports": ["authenticate", "refresh_token"] }, "database_layer": { "status": "inferred", "files": ["db/connector.py"], "exports": ["query", "execute"] } }, "dependencies": [ {"source": "auth_module", "target": "database_layer", "type": "runtime_dispatch", "confidence": 0.7, "evidence": "dynamic call in handler.py:42"} ], "invariants": [ {"type": "forbidden_dependency", "description": "auth_module must not import ui_layer directly", "confidence": 0.9} ], "unexplored": ["utils/", "config/"], "budget_remaining": 5 }
Experiments with four baselines and six frontier LLMs (GPT-5.3-Codex, Gemini, Claude Sonnet 4.6) reveal three model-dependent phenomena:
The Active-Passive Gap measures the difference between an agent's architectural understanding when actively exploring (choosing which files to open) versus passively receiving all files at once.
<latex>
ext{APG} = F_1^{ ext{passive}} - F_1^{ ext{active}}
</latex>
This gap is model-dependent: GPT builds better maps through active exploration than from seeing all files at once (negative APG, active exceeds passive), while Gemini shows the opposite pattern, revealing that active exploration is itself a non-trivial capability that some models lack.
Retaining structured belief map JSON in context between probes acts as self-scaffolding – external memory that helps the agent maintain coherent understanding. GPT benefits substantially (+14 F1 points), but Gemini shows minimal improvement. The probe itself becomes an active intervention that shapes subsequent exploration decisions.
A smaller model maintains perfectly stable beliefs across probes, while its larger counterpart suffers catastrophic belief collapse – forgetting previously discovered components between consecutive probes. This phenomenon is invisible in final-snapshot evaluations and can only be detected through time-series analysis of belief states.
ToCS reveals that current code agents lack fundamental architectural reasoning capabilities: