====== Benchmark Leaderboard ======

Current top scores across major AI benchmarks. Data sourced from official leaderboards and research trackers.

**Last updated:** March 25, 2026

===== SWE-bench Verified =====

Software engineering benchmark -- resolving real GitHub issues from popular Python repos.

Source: [[https://www.swebench.com|swebench.com]], [[https://llm-stats.com/benchmarks/swe-bench-verified|llm-stats.com]]

^ Rank ^ Agent / Model ^ Score (% Resolved) ^
| 1 | Claude Opus 4.5 (Anthropic) | 80.9% |
| 2 | MiniMax M2.5 (MiniMax, 230B) | 80.2% |
| 3 | GPT-5.2 (OpenAI) | 80.0% |
| 4 | Claude Sonnet 4.6 (Anthropic) | 79.6% |
| 5 | Gemini 3 Flash (Google) | 78.0% |
| 6 | GLM-5 (Zhipu AI, 744B) | 77.8% |
| 7 | Kimi K2.5 (Moonshot AI, 1T) | 76.8% |
| 8 | Seed 2.0 Pro (ByteDance) | 76.5% |
| 9 | Claude Sonnet 4.5 (Anthropic) | 75.2% |
| 10 | DeepSeek-R1 (DeepSeek) | 74.0% |

===== GAIA (General AI Assistants) =====

Multi-step real-world tasks requiring tool use, web browsing, and reasoning.

Source: [[https://huggingface.co/spaces/gaia-benchmark/leaderboard|HuggingFace GAIA Leaderboard]], [[https://awesomeagents.ai/leaderboards/agentic-ai-benchmarks-leaderboard/|Awesome Agents]]

^ Rank ^ Agent / Model ^ Score (% Overall) ^
| 1 | Claude Sonnet 4.5 (Anthropic) | 74.6% |
| 2 | Claude Opus 4.5 (Anthropic) | 72.1% |
| 3 | Claude Sonnet 4 (Anthropic) | 69.8% |
| 4 | GPT-5 Mini (OpenAI) | 44.8% |
| 5 | Claude 3.7 Sonnet Thinking | 43.9% |
| 6 | Claude 3.7 Sonnet | 43.9% |
| 7 | Gemini 2.5 Pro (Google) | 33.3% |
| 8 | DeepSeek R1 0528 | 27.9% |
| 9 | Mistral Medium 3.1 | 23.3% |
| 10 | Tongyi DeepResearch 30B (Alibaba) | 20.6% |

//Note: Scores vary significantly by evaluation harness. Awesome Agents reports higher scores using agentic scaffolding (Claude Sonnet 4.5 at 74.6%) vs. the LayerLens/PricePerToken tracker which measures base model capability.//

===== BFCL V4 (Function Calling) =====

Berkeley Function Calling Leaderboard -- accuracy of tool/function calling.

Source: [[https://awesomeagents.ai/leaderboards/agentic-ai-benchmarks-leaderboard/|Awesome Agents]]

^ Rank ^ Model ^ Score (%) ^
| 1 | GLM-4.5 (Zhipu AI) | 70.9% |
| 2 | Claude Opus 4.1 (Anthropic) | 70.4% |
| 3 | Claude Sonnet 4 (Anthropic) | 69.8% |
| 4 | GPT-5 (OpenAI) | 68.5% |
| 5 | Gemini 2.5 Pro (Google) | 67.2% |

===== HumanEval (Code Generation) =====

Python code generation from function docstrings -- 164 problems.

Source: [[https://pricepertoken.com/leaderboards/benchmark/humaneval|PricePerToken]], [[https://llm-stats.com/benchmarks/humaneval|LLM Stats]]

^ Rank ^ Model ^ Score (% pass@1) ^
| 1 | Claude Sonnet 4.5 Thinking (Anthropic) | 97.6% |
| 2 | DeepSeek-R1 | 97.4% |
| 3 | Grok 4 (xAI) | 97.0% |
| 4 | Claude Sonnet 4.5 (Anthropic) | 97.0% |
| 5 | Gemini 3 Pro Preview (Google) | 97.0% |
| 6 | Claude Opus 4.5 (Anthropic) | 97.0% |
| 7 | Claude Opus 4.6 (Anthropic) | 97.0% |
| 8 | GLM-5 (Zhipu AI) | 97.0% |
| 9 | o4-mini High (OpenAI) | 96.3% |
| 10 | Claude Sonnet 4 (Anthropic) | 96.3% |

===== MATH (Mathematical Problem Solving) =====

Algebra, geometry, number theory, and calculus competition problems.

Source: [[https://pricepertoken.com/leaderboards/benchmark/math|PricePerToken]]

^ Rank ^ Model ^ Score (%) ^
| 1 | Claude Opus 4.6 (Anthropic) | 95.6% |
| 2 | o4-mini High (OpenAI) | 94.6% |
| 3 | GLM-5 (Zhipu AI) | 94.0% |
| 4 | o3-mini (OpenAI) | 93.1% |
| 5 | Qwen3 30B A3B (Alibaba) | 93.0% |
| 6 | DeepSeek-R1 | 92.7% |
| 7 | QwQ 32B (Alibaba) | 92.1% |
| 8 | Grok 3 Beta (xAI) | 92.0% |
| 9 | Claude Opus 4 (Anthropic) | 91.2% |
| 10 | Gemini 2.0 Flash (Google) | 90.7% |

===== Tau2-bench (Multi-turn Customer Service) =====

Multi-turn conversations with tool use in customer service scenarios.

Source: [[https://awesomeagents.ai/leaderboards/agentic-ai-benchmarks-leaderboard/|Awesome Agents]]

^ Rank ^ Model ^ Telecom ^ Retail ^
| 1 | Claude Opus 4.6 (Anthropic) | 99.3% | 91.9% |
| 2 | Claude Sonnet 4.5 (Anthropic) | 98.1% | 89.4% |
| 3 | GPT-5 (OpenAI) | 96.7% | 87.2% |

===== Key Takeaways =====

  * **Anthropic Claude** models dominate most agentic benchmarks (SWE-bench, GAIA, Tau2-bench)
  * **Code generation** (HumanEval) is near-saturated -- top 8 models all score 97%+
  * **Math reasoning** is led by Claude Opus 4.6 and OpenAI o-series models
  * **Open-source** models (GLM-5, Qwen3, DeepSeek-R1) compete strongly at fraction of the cost
  * **Function calling** (BFCL) -- open-source GLM-4.5 beats proprietary models
  * Scores vary by evaluation harness -- always check methodology when comparing

//Benchmarks are point-in-time snapshots. Check the linked sources for the most current data.//