Table of Contents

Repository-Centric Learning

Repository-Centric Learning (RCL) is a training paradigm for small language models that prioritizes deep vertical mastery of individual software repositories over broad horizontal exposure across many codebases. Introduced through SWE-Spot by Peng et al. (2026), RCL proposes that compact models must internalize the 'physics' of a target software environment through parametric knowledge acquisition rather than relying on costly inference-time search.

graph LR A[Repository] --> B[Unit 1: Design Patterns] A --> C[Unit 2: Implementation Details] A --> D[Unit 3: Evolution History] A --> E[Unit 4: Runtime Behavior] B --> F[RCL Curriculum] C --> F D --> F E --> F F --> G[Fine-tune Base Model] G --> H[Repo-Expert SLM]

The Problem with Task-Centric Learning

The prevailing approach to training coding models follows a Task-Centric Learning (TCL) paradigm: expose the model to as many diverse repositories and tasks as possible, hoping it learns generalizable coding skills. This works for large frontier models with enormous parameter budgets, but fails for Small Language Models (SLMs) due to a fundamental capability gap.

SLMs trained with TCL:

The RCL Paradigm Shift

RCL inverts the TCL assumption. Instead of learning a little about many repositories, the model learns everything about a specific repository:

Dimension Task-Centric Learning (TCL) Repository-Centric Learning (RCL)
Breadth vs Depth Horizontal (many repos) Vertical (single repo)
Knowledge Location Inference-time search Parametric (in weights)
Generalization Cross-repo transfer Repo-specific mastery
Inference Cost High (RAG, search) Low (direct generation)
Cold Start Every new task One-time training

Four-Unit Repository-Centric Experience

RCL transforms static codebases into interactive learning signals through a structured four-unit curriculum:

Unit 1: Design. The model learns the repository's high-level architectural patterns – module organization, dependency structures, design decisions, and API contracts. This builds understanding of why the code is structured as it is.

Unit 2: Implementation. Focused on code-level details – writing, debugging, understanding function implementations, class hierarchies, and coding idioms specific to the project.

Unit 3: Evolution. The model studies the repository's version history – commit patterns, refactoring trajectories, how features were added over time, and how bugs were fixed. This captures the temporal dynamics of software development.

Unit 4: Runtime. Incorporates execution traces, test behaviors, and dynamic properties that cannot be inferred from static code alone. This grounds the model's understanding in actual program behavior.

# Conceptual illustration of RCL training pipeline
class RepositoryCentricExperience:
    def __init__(self, repo_path):
        self.repo = Repository(repo_path)
 
    def generate_design_examples(self):
        # Unit 1: Architecture and design patterns
        return self.repo.extract_module_relationships()
 
    def generate_implementation_examples(self):
        # Unit 2: Code writing and debugging
        return self.repo.extract_function_implementations()
 
    def generate_evolution_examples(self):
        # Unit 3: Version history and change patterns
        return self.repo.extract_commit_trajectories()
 
    def generate_runtime_examples(self):
        # Unit 4: Execution traces and test behaviors
        return self.repo.extract_test_execution_traces()
 
    def train_repo_expert(self, base_model):
        # Train a repo-specialized expert
        curriculum = (
            self.generate_design_examples() +
            self.generate_implementation_examples() +
            self.generate_evolution_examples() +
            self.generate_runtime_examples()
        )
        return fine_tune(base_model, curriculum)

Internalizing Repository Physics

The central metaphor of RCL is that each software repository has its own 'physics' – a set of core rules, dependency patterns, idioms, conventions, and dynamics that govern how the codebase behaves and evolves. Just as a physics engine must understand gravity and collision to simulate a world, a coding agent must understand a repository's internal logic to operate effectively within it.

RCL embeds this physics directly into model weights during training, eliminating the need for inference-time discovery through RAG or search. The model develops an intuitive understanding analogous to how experienced developers build deep familiarity with codebases they work on daily.

Key Results

SWE-Spot-4B, trained with RCL, achieves remarkable results:

These results break established scaling trends, demonstrating that repository mastery is a distinct capability dimension that complements general coding ability.

Theoretical Implications

RCL suggests that for building efficient intelligence in constrained settings, the path forward is not always scale – it is depth. A small model that deeply understands its operational environment can outperform a much larger model that has only shallow familiarity.

$$\text{Effectiveness} = f(\text{depth}_{\text{repo}}) \gg g(\text{breadth}_{\text{tasks}}) \quad \text{for SLMs}$$

References

See Also