Agent Context Files

Agent context files such as AGENTS.md and CLAUDE.md are repository-level instruction documents designed to guide AI coding agents. Gloaguen et al. (2026) present the first rigorous empirical investigation of their effectiveness, finding that these files tend to reduce task success rates while increasing inference costs by over 20%. This counterintuitive result challenges the widespread adoption of context files and suggests that minimal, carefully curated instructions outperform verbose guidance.

Background

As AI coding agents (Claude Code, Codex, Qwen Code) become standard development tools, practitioners have adopted the convention of placing instruction files in repository roots:

CLAUDE.md – Instructions for Anthropic's Claude-based agents
AGENTS.md – OpenAI's recommended format for coding agents
COPILOT.md – GitHub Copilot workspace instructions
.cursorrules – Cursor editor agent instructions

These files typically contain coding conventions, architectural guidelines, testing requirements, and tool usage preferences. Despite widespread adoption, no prior work had rigorously measured whether they actually improve agent performance.

Methodology

The study evaluated coding agents with three conditions:

No context – Agent operates without any context file
LLM-generated context – Context files generated by an LLM following official prompts from OpenAI and Anthropic (averaging ~641 words, 9.7 sections)
Human-written context – Developer-committed context files from real repositories

Benchmarks

Two evaluation settings were used:

SWE-bench – Standard software engineering tasks from popular open-source repositories
AGENTbench – A novel dataset of issues from repositories that already contain developer-committed context files

Agent traces were analyzed by categorizing tool calls (Edit/sed, Read/cat) and intents (install dependencies, run tests, explore files) via LLM-based classification.

Repository statistics: ~3337 files per codebase, 75% test coverage, PRs editing ~2.5 files and ~118.9 lines on average.

Key Results

The findings challenge common assumptions about context file utility:

Performance Impact

On AGENTbench:

No context: 40.7% success rate
LLM-generated context: 46.5% success rate
Human-written context: 45.3% success rate

However, broader analysis across conditions showed that context files lowered success rates overall with higher computational costs (more tool calls, longer traces).

Behavioral Shifts

Context files induced measurable behavioral changes:

Agents performed broader exploration – more file traversal, more testing attempts
Instructions were respected but led to unnecessary actions that complicated task completion
The additional exploration increased token usage without proportional benefit

Prompt Insensitivity

No consistent advantage from model-matched prompts:

Using Claude-style prompts for Claude Code showed no reliable improvement over generic prompts
Performance varied unpredictably by benchmark and task type

Analysis

The core problem is that context files introduce unnecessary requirements that constrain agent behavior suboptimally. A formal model of the effect:

$$P(\text{success} | \text{context}) = P(\text{success} | \text{no context}) \cdot \frac{P(\text{helpful instructions})}{P(\text{helpful}) + P(\text{harmful constraints})}$$

When the ratio of helpful to harmful instructions falls below 1, context files degrade performance. The study suggests this ratio is frequently unfavorable for verbose context files.

The inference cost overhead is significant:

$$\Delta C = C_{\text{context}} - C_{\text{baseline}} \approx 0.2 \cdot C_{\text{baseline}}$$

representing a 20%+ increase in API costs for no performance gain.

Code Example

from pathlib import Path
 
CONTEXT_FILES = ["CLAUDE.md", "AGENTS.md", ".cursorrules", "COPILOT.md"]
MAX_RECOMMENDED_WORDS = 300
 
def load_agent_context(repo_root):
    # Load agent context file with minimal-first strategy
    for filename in CONTEXT_FILES:
        path = Path(repo_root) / filename
        if path.exists():
            content = path.read_text()
            word_count = len(content.split())
            if word_count > MAX_RECOMMENDED_WORDS:
                print(f"Warning: {filename} has {word_count} words, "
                      f"exceeds recommended {MAX_RECOMMENDED_WORDS}")
            return content
    return None
 
def create_minimal_context():
    # Generate minimal context following study recommendations
    # Focus only on non-obvious, repo-specific conventions
    return "\n".join([
        "# Project Context",
        "- Language: Python 3.12",
        "- Test runner: pytest",
        "- Style: ruff format",
        "- Do not modify generated files in src/generated/",
    ])

Recommendations

Based on the findings, the authors recommend:

Keep context files minimal – Focus on non-obvious, repo-specific conventions only
Avoid verbose instructions – More words correlate with more unnecessary agent actions
Test context effectiveness – Measure actual task completion rates with and without context
Prefer constraints over prescriptions – Tell agents what not to do rather than detailed how-to guides

Table of Contents