Theory of Code Space

Theory of Code Space (ToCS) is a benchmark introduced by Grigory Sapunov from Intento (arXiv:2603.00601) that evaluates whether AI code agents can construct, maintain, and update coherent architectural beliefs during codebase exploration. Inspired by the Theory of Space benchmark for spatial reasoning in multimodal models, ToCS transplants the concept of cognitive map building to software engineering. Agents explore procedurally generated Python codebases under partial observability and must externalize their architectural understanding as structured JSON at periodic intervals, producing a time-series of belief states rather than a single final snapshot.

Motivation: Beyond Code Generation

Current code benchmarks (SWE-bench, HumanEval, MBPP) test output correctness: does the patch compile, does the test pass? None measure whether an agent understands how modules relate, where architectural boundaries lie, or what invariants must be preserved. ToCS addresses this gap by directly probing the agent's evolving internal model of software architecture.

When human developers navigate large codebases, they build mental models of module dependencies, data flow patterns, and design constraints. They update these models as they read more files. ToCS measures whether code agents can do the same.

Benchmark Design

ToCS makes three key design choices:

1. Procedural Codebase Generation: Medium-complexity Python projects are generated with controlled structure, featuring four typed edge categories:

Syntactic imports: Direct import and from X import Y statements
Runtime dispatch: Dynamic method calls resolved at runtime
Data flow dependencies: Shared data structures passed between modules
Config-driven dynamic wiring: Dependencies specified in configuration files, not code

Each generated codebase includes planted architectural constraints (invariants) with verified ground truth.

2. Partial Observability Harness: Agents explore under a file budget, opening one file at a time. This ensures that exploration strategy matters and prevents trivial solutions from simply reading everything.

3. Periodic Belief Probing: Agents periodically externalize their belief state as structured JSON, including component status (observed/inferred/unknown), exported symbols, typed dependency edges with confidence scores, discovered invariants, and unexplored regions.

# Illustration of the ToCS belief state structure
belief_state = {
    "components": {
        "auth_module": {
            "status": "observed",
            "files": ["auth/handler.py", "auth/tokens.py"],
            "exports": ["authenticate", "refresh_token"]
        },
        "database_layer": {
            "status": "inferred",
            "files": ["db/connector.py"],
            "exports": ["query", "execute"]
        }
    },
    "dependencies": [
        {"source": "auth_module", "target": "database_layer",
         "type": "runtime_dispatch", "confidence": 0.7,
         "evidence": "dynamic call in handler.py:42"}
    ],
    "invariants": [
        {"type": "forbidden_dependency",
         "description": "auth_module must not import ui_layer directly",
         "confidence": 0.9}
    ],
    "unexplored": ["utils/", "config/"],
    "budget_remaining": 5
}

Key Findings

Experiments with four baselines and six frontier LLMs (GPT-5.3-Codex, Gemini, Claude Sonnet 4.6) reveal three model-dependent phenomena:

1. The Active-Passive Gap

The Active-Passive Gap measures the difference between an agent's architectural understanding when actively exploring (choosing which files to open) versus passively receiving all files at once.

<latex>

ext{APG} = F_1^{	ext{passive}} - F_1^{	ext{active}}

</latex>

This gap is model-dependent: GPT builds better maps through active exploration than from seeing all files at once (negative APG, active exceeds passive), while Gemini shows the opposite pattern, revealing that active exploration is itself a non-trivial capability that some models lack.

2. Self-Scaffolding via Belief Maps

Retaining structured belief map JSON in context between probes acts as self-scaffolding – external memory that helps the agent maintain coherent understanding. GPT benefits substantially (+14 F1 points), but Gemini shows minimal improvement. The probe itself becomes an active intervention that shapes subsequent exploration decisions.

3. Catastrophic Belief Collapse

A smaller model maintains perfectly stable beliefs across probes, while its larger counterpart suffers catastrophic belief collapse – forgetting previously discovered components between consecutive probes. This phenomenon is invisible in final-snapshot evaluations and can only be detected through time-series analysis of belief states.

Metrics

Dependency F1: Exact matching of source-target-type edges against ground truth, requiring file-level specificity
Architectural Constraint Discovery: Measures detection of planted invariants (forbidden dependencies, validation chains, etc.)
Belief Stability: Consistency of belief states across consecutive probes

Implications for Code Agent Development

ToCS reveals that current code agents lack fundamental architectural reasoning capabilities:

Models that excel at code generation may fail at architectural comprehension
Active exploration capability is not guaranteed even in frontier models
Belief persistence across long contexts is unreliable
Time-series evaluation catches failure modes invisible to snapshot benchmarks

AI Agent Knowledge Base

Sidebar

Table of Contents

Theory of Code Space

Motivation: Beyond Code Generation

Benchmark Design

Key Findings

1. The Active-Passive Gap

2. Self-Scaffolding via Belief Maps

3. Catastrophic Belief Collapse

Metrics

Implications for Code Agent Development

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Theory of Code Space

Motivation: Beyond Code Generation

Benchmark Design

Key Findings

1. The Active-Passive Gap

2. Self-Scaffolding via Belief Maps

3. Catastrophic Belief Collapse

Metrics

Implications for Code Agent Development

References

See Also

Page Tools