Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
SWE-bench is the standard benchmark for evaluating large language models and AI agents on real-world software engineering tasks collected from GitHub. It consists of 2,294 real GitHub issues from popular open-source Python repositories, providing a rigorous evaluation framework for coding agents.
Each SWE-bench task places a coding agent in a Docker environment with a checkout of the codebase from just before an issue was resolved. The agent must:
The evaluation uses functional correctness — patches must pass hidden test suites extracted from the actual fix, ensuring agents produce working solutions rather than superficially plausible code.
The SWE-bench ecosystem has expanded into multiple variants to address different evaluation needs:
Performance varies significantly across benchmark variants (as of early 2026):
SWE-bench Pro (Public Set):
SWE-bench Verified:
Multi-SWE-bench (Java):
Top models score around 23% on SWE-bench Pro public set compared to 70%+ on SWE-bench Verified, demonstrating that Pro provides a more discriminative measure of agent capability.
Modern SWE agents employ sophisticated multi-component architectures:
Notably, research shows that SWE-bench Verified is not particularly sensitive to agent toolkit design — powerful toolsets do not necessarily translate into higher benchmark scores, suggesting that reasoning ability matters more than tool sophistication.
# Simplified SWE-bench agent loop class SWEAgent: def solve_issue(self, repo_path, issue_description): """Solve a GitHub issue in a Docker-isolated environment.""" # Phase 1: Localize the problem relevant_files = self.search_codebase(repo_path, issue_description) root_cause = self.analyze_issue(relevant_files, issue_description) # Phase 2: Generate candidate patches patches = [] for strategy in self.edit_strategies: patch = strategy.generate_patch(root_cause, relevant_files) patches.append(patch) # Phase 3: Score and select best patch scored = [(self.score_patch(p, issue_description), p) for p in patches] best_patch = max(scored, key=lambda x: x[0])[1] return best_patch
A significant concern has emerged regarding Python-focused benchmarks: mounting evidence suggests that latest frontier models may have been exposed to benchmark data during training, degrading community confidence in benchmark validity. This has motivated expansion into other languages like Java, where agents scoring in the 20-30% range represent meaningful room for improvement compared to Python's saturated 70-80% range.