====== SWE-bench ====== **SWE-bench** is the standard benchmark for evaluating large language models and AI agents on real-world software engineering tasks collected from GitHub. It consists of 2,294 real GitHub issues from popular open-source Python repositories, providing a rigorous evaluation framework for coding agents. ===== Task Format ===== Each SWE-bench task places a coding agent in a Docker environment with a checkout of the codebase from just before an issue was resolved. The agent must: - Read and understand the issue description - Navigate and comprehend the relevant codebase - Modify source code to resolve the issue - Submit a patch that passes the real unit tests from the pull request that originally closed the issue The evaluation uses **functional correctness** — patches must pass hidden test suites extracted from the actual fix, ensuring agents produce working solutions rather than superficially plausible code. ===== Benchmark Variants ===== The SWE-bench ecosystem has expanded into multiple variants to address different evaluation needs: * **SWE-bench (Original)** — 2,294 tasks from 12 Python repositories. Some tasks may be unsolvable without additional context beyond the issue description. * **SWE-bench Verified** — A human-validated 500-problem subset reviewed by experienced developers to ensure solvability. Became the standard benchmark for approximately one year before saturation by frontier models. * **SWE-bench Lite** — A smaller, streamlined subset for faster evaluation cycles during development. * **SWE-bench Pro** — Released late 2025, contains 1,865 problems from 41 actively maintained repositories. Features long-horizon tasks requiring hours to days for professional engineers, with patches spanning multiple files. * **Multi-SWE-bench** — Extended evaluations across multiple programming languages (notably Java), addressing concerns about Python overfitting and potential data contamination. ===== Leaderboard and Performance ===== Performance varies significantly across benchmark variants (as of early 2026): **SWE-bench Pro (Public Set):** * Auggie CLI: ~52% * Claude Opus 4.5 via SWE-Agent: ~46% **SWE-bench Verified:** * Claude Opus 4.5 + Live-SWE-agent: ~79% * mini-SWE-agent: ~74% (in just 100 lines of Python) **Multi-SWE-bench (Java):** * IBM iSWE-Agent (Claude 4.5 Sonnet): ~33% * Gemini 2.5 Pro: ~29% Top models score around 23% on SWE-bench Pro public set compared to 70%+ on SWE-bench Verified, demonstrating that Pro provides a more discriminative measure of agent capability. ===== How Agents Solve GitHub Issues ===== Modern SWE agents employ sophisticated multi-component architectures: - **Localization**: A component identifies where changes are needed in the codebase and why, using file search, code understanding, and issue analysis. - **Editing**: A separate component applies targeted edits based on the localization output. - **Verification**: A scorer LLM assigns scores to proposed patches and selects the best candidates through tournament-style comparison. - **Iteration**: Agents may generate multiple candidate patches and compare them to find the strongest solution. Notably, research shows that SWE-bench Verified is not particularly sensitive to agent toolkit design — powerful toolsets do not necessarily translate into higher benchmark scores, suggesting that reasoning ability matters more than tool sophistication. # Simplified SWE-bench agent loop class SWEAgent: def solve_issue(self, repo_path, issue_description): """Solve a GitHub issue in a Docker-isolated environment.""" # Phase 1: Localize the problem relevant_files = self.search_codebase(repo_path, issue_description) root_cause = self.analyze_issue(relevant_files, issue_description) # Phase 2: Generate candidate patches patches = [] for strategy in self.edit_strategies: patch = strategy.generate_patch(root_cause, relevant_files) patches.append(patch) # Phase 3: Score and select best patch scored = [(self.score_patch(p, issue_description), p) for p in patches] best_patch = max(scored, key=lambda x: x[0])[1] return best_patch ===== Data Contamination Concerns ===== A significant concern has emerged regarding Python-focused benchmarks: mounting evidence suggests that latest frontier models may have been exposed to benchmark data during training, degrading community confidence in benchmark validity. This has motivated expansion into other languages like Java, where agents scoring in the 20-30% range represent meaningful room for improvement compared to Python's saturated 70-80% range. ===== References ===== * [[https://arxiv.org/abs/2310.06770|SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (arXiv:2310.06770)]] * [[https://www.swebench.com|SWE-bench Official Website]] * [[https://arxiv.org/abs/2509.16941|SWE-bench Pro (arXiv:2509.16941)]] * [[https://github.com/SWE-bench/SWE-smith|SWE-smith: Training Toolkit for SWE Agents]] ===== See Also ===== * [[agent_as_a_judge|Agent-as-a-Judge]] * [[web_arena_benchmark|WebArena Benchmark]] * [[agent_index|AI Agent Index]]