Table of Contents

SWE-bench

SWE-bench is the standard benchmark for evaluating large language models and AI agents on real-world software engineering tasks collected from GitHub. It consists of 2,294 real GitHub issues from popular open-source Python repositories, providing a rigorous evaluation framework for coding agents.

Task Format

Each SWE-bench task places a coding agent in a Docker environment with a checkout of the codebase from just before an issue was resolved. The agent must:

  1. Read and understand the issue description
  2. Navigate and comprehend the relevant codebase
  3. Modify source code to resolve the issue
  4. Submit a patch that passes the real unit tests from the pull request that originally closed the issue

The evaluation uses functional correctness — patches must pass hidden test suites extracted from the actual fix, ensuring agents produce working solutions rather than superficially plausible code.

Benchmark Variants

The SWE-bench ecosystem has expanded into multiple variants to address different evaluation needs:

Leaderboard and Performance

Performance varies significantly across benchmark variants (as of early 2026):

SWE-bench Pro (Public Set):

SWE-bench Verified:

Multi-SWE-bench (Java):

Top models score around 23% on SWE-bench Pro public set compared to 70%+ on SWE-bench Verified, demonstrating that Pro provides a more discriminative measure of agent capability.

How Agents Solve GitHub Issues

Modern SWE agents employ sophisticated multi-component architectures:

  1. Localization: A component identifies where changes are needed in the codebase and why, using file search, code understanding, and issue analysis.
  2. Editing: A separate component applies targeted edits based on the localization output.
  3. Verification: A scorer LLM assigns scores to proposed patches and selects the best candidates through tournament-style comparison.
  4. Iteration: Agents may generate multiple candidate patches and compare them to find the strongest solution.

Notably, research shows that SWE-bench Verified is not particularly sensitive to agent toolkit design — powerful toolsets do not necessarily translate into higher benchmark scores, suggesting that reasoning ability matters more than tool sophistication.

# Simplified SWE-bench agent loop
class SWEAgent:
    def solve_issue(self, repo_path, issue_description):
        """Solve a GitHub issue in a Docker-isolated environment."""
        # Phase 1: Localize the problem
        relevant_files = self.search_codebase(repo_path, issue_description)
        root_cause = self.analyze_issue(relevant_files, issue_description)
 
        # Phase 2: Generate candidate patches
        patches = []
        for strategy in self.edit_strategies:
            patch = strategy.generate_patch(root_cause, relevant_files)
            patches.append(patch)
 
        # Phase 3: Score and select best patch
        scored = [(self.score_patch(p, issue_description), p) for p in patches]
        best_patch = max(scored, key=lambda x: x[0])[1]
        return best_patch

Data Contamination Concerns

A significant concern has emerged regarding Python-focused benchmarks: mounting evidence suggests that latest frontier models may have been exposed to benchmark data during training, degrading community confidence in benchmark validity. This has motivated expansion into other languages like Java, where agents scoring in the 20-30% range represent meaningful room for improvement compared to Python's saturated 70-80% range.

References

See Also