====== Benchmark Leaderboard ====== Current top scores across major AI benchmarks. Data sourced from official leaderboards and research trackers. **Last updated:** March 25, 2026 ===== SWE-bench Verified ===== Software engineering benchmark -- resolving real GitHub issues from popular Python repos. Source: [[https://www.swebench.com|swebench.com]], [[https://llm-stats.com/benchmarks/swe-bench-verified|llm-stats.com]] ^ Rank ^ Agent / Model ^ Score (% Resolved) ^ | 1 | Claude Opus 4.5 (Anthropic) | 80.9% | | 2 | MiniMax M2.5 (MiniMax, 230B) | 80.2% | | 3 | GPT-5.2 (OpenAI) | 80.0% | | 4 | Claude Sonnet 4.6 (Anthropic) | 79.6% | | 5 | Gemini 3 Flash (Google) | 78.0% | | 6 | GLM-5 (Zhipu AI, 744B) | 77.8% | | 7 | Kimi K2.5 (Moonshot AI, 1T) | 76.8% | | 8 | Seed 2.0 Pro (ByteDance) | 76.5% | | 9 | Claude Sonnet 4.5 (Anthropic) | 75.2% | | 10 | DeepSeek-R1 (DeepSeek) | 74.0% | ===== GAIA (General AI Assistants) ===== Multi-step real-world tasks requiring tool use, web browsing, and reasoning. Source: [[https://huggingface.co/spaces/gaia-benchmark/leaderboard|HuggingFace GAIA Leaderboard]], [[https://awesomeagents.ai/leaderboards/agentic-ai-benchmarks-leaderboard/|Awesome Agents]] ^ Rank ^ Agent / Model ^ Score (% Overall) ^ | 1 | Claude Sonnet 4.5 (Anthropic) | 74.6% | | 2 | Claude Opus 4.5 (Anthropic) | 72.1% | | 3 | Claude Sonnet 4 (Anthropic) | 69.8% | | 4 | GPT-5 Mini (OpenAI) | 44.8% | | 5 | Claude 3.7 Sonnet Thinking | 43.9% | | 6 | Claude 3.7 Sonnet | 43.9% | | 7 | Gemini 2.5 Pro (Google) | 33.3% | | 8 | DeepSeek R1 0528 | 27.9% | | 9 | Mistral Medium 3.1 | 23.3% | | 10 | Tongyi DeepResearch 30B (Alibaba) | 20.6% | //Note: Scores vary significantly by evaluation harness. Awesome Agents reports higher scores using agentic scaffolding (Claude Sonnet 4.5 at 74.6%) vs. the LayerLens/PricePerToken tracker which measures base model capability.// ===== BFCL V4 (Function Calling) ===== Berkeley Function Calling Leaderboard -- accuracy of tool/function calling. Source: [[https://awesomeagents.ai/leaderboards/agentic-ai-benchmarks-leaderboard/|Awesome Agents]] ^ Rank ^ Model ^ Score (%) ^ | 1 | GLM-4.5 (Zhipu AI) | 70.9% | | 2 | Claude Opus 4.1 (Anthropic) | 70.4% | | 3 | Claude Sonnet 4 (Anthropic) | 69.8% | | 4 | GPT-5 (OpenAI) | 68.5% | | 5 | Gemini 2.5 Pro (Google) | 67.2% | ===== HumanEval (Code Generation) ===== Python code generation from function docstrings -- 164 problems. Source: [[https://pricepertoken.com/leaderboards/benchmark/humaneval|PricePerToken]], [[https://llm-stats.com/benchmarks/humaneval|LLM Stats]] ^ Rank ^ Model ^ Score (% pass@1) ^ | 1 | Claude Sonnet 4.5 Thinking (Anthropic) | 97.6% | | 2 | DeepSeek-R1 | 97.4% | | 3 | Grok 4 (xAI) | 97.0% | | 4 | Claude Sonnet 4.5 (Anthropic) | 97.0% | | 5 | Gemini 3 Pro Preview (Google) | 97.0% | | 6 | Claude Opus 4.5 (Anthropic) | 97.0% | | 7 | Claude Opus 4.6 (Anthropic) | 97.0% | | 8 | GLM-5 (Zhipu AI) | 97.0% | | 9 | o4-mini High (OpenAI) | 96.3% | | 10 | Claude Sonnet 4 (Anthropic) | 96.3% | ===== MATH (Mathematical Problem Solving) ===== Algebra, geometry, number theory, and calculus competition problems. Source: [[https://pricepertoken.com/leaderboards/benchmark/math|PricePerToken]] ^ Rank ^ Model ^ Score (%) ^ | 1 | Claude Opus 4.6 (Anthropic) | 95.6% | | 2 | o4-mini High (OpenAI) | 94.6% | | 3 | GLM-5 (Zhipu AI) | 94.0% | | 4 | o3-mini (OpenAI) | 93.1% | | 5 | Qwen3 30B A3B (Alibaba) | 93.0% | | 6 | DeepSeek-R1 | 92.7% | | 7 | QwQ 32B (Alibaba) | 92.1% | | 8 | Grok 3 Beta (xAI) | 92.0% | | 9 | Claude Opus 4 (Anthropic) | 91.2% | | 10 | Gemini 2.0 Flash (Google) | 90.7% | ===== Tau2-bench (Multi-turn Customer Service) ===== Multi-turn conversations with tool use in customer service scenarios. Source: [[https://awesomeagents.ai/leaderboards/agentic-ai-benchmarks-leaderboard/|Awesome Agents]] ^ Rank ^ Model ^ Telecom ^ Retail ^ | 1 | Claude Opus 4.6 (Anthropic) | 99.3% | 91.9% | | 2 | Claude Sonnet 4.5 (Anthropic) | 98.1% | 89.4% | | 3 | GPT-5 (OpenAI) | 96.7% | 87.2% | ===== Key Takeaways ===== * **Anthropic Claude** models dominate most agentic benchmarks (SWE-bench, GAIA, Tau2-bench) * **Code generation** (HumanEval) is near-saturated -- top 8 models all score 97%+ * **Math reasoning** is led by Claude Opus 4.6 and OpenAI o-series models * **Open-source** models (GLM-5, Qwen3, DeepSeek-R1) compete strongly at fraction of the cost * **Function calling** (BFCL) -- open-source GLM-4.5 beats proprietary models * Scores vary by evaluation harness -- always check methodology when comparing //Benchmarks are point-in-time snapshots. Check the linked sources for the most current data.//