Current top scores across major AI benchmarks. Data sourced from official leaderboards and research trackers.
Last updated: March 25, 2026
Software engineering benchmark – resolving real GitHub issues from popular Python repos.
Source: swebench.com, llm-stats.com
| Rank | Agent / Model | Score (% Resolved) |
|---|---|---|
| 1 | Claude Opus 4.5 (Anthropic) | 80.9% |
| 2 | MiniMax M2.5 (MiniMax, 230B) | 80.2% |
| 3 | GPT-5.2 (OpenAI) | 80.0% |
| 4 | Claude Sonnet 4.6 (Anthropic) | 79.6% |
| 5 | Gemini 3 Flash (Google) | 78.0% |
| 6 | GLM-5 (Zhipu AI, 744B) | 77.8% |
| 7 | Kimi K2.5 (Moonshot AI, 1T) | 76.8% |
| 8 | Seed 2.0 Pro (ByteDance) | 76.5% |
| 9 | Claude Sonnet 4.5 (Anthropic) | 75.2% |
| 10 | DeepSeek-R1 (DeepSeek) | 74.0% |
Multi-step real-world tasks requiring tool use, web browsing, and reasoning.
Source: HuggingFace GAIA Leaderboard, Awesome Agents
| Rank | Agent / Model | Score (% Overall) |
|---|---|---|
| 1 | Claude Sonnet 4.5 (Anthropic) | 74.6% |
| 2 | Claude Opus 4.5 (Anthropic) | 72.1% |
| 3 | Claude Sonnet 4 (Anthropic) | 69.8% |
| 4 | GPT-5 Mini (OpenAI) | 44.8% |
| 5 | Claude 3.7 Sonnet Thinking | 43.9% |
| 6 | Claude 3.7 Sonnet | 43.9% |
| 7 | Gemini 2.5 Pro (Google) | 33.3% |
| 8 | DeepSeek R1 0528 | 27.9% |
| 9 | Mistral Medium 3.1 | 23.3% |
| 10 | Tongyi DeepResearch 30B (Alibaba) | 20.6% |
Note: Scores vary significantly by evaluation harness. Awesome Agents reports higher scores using agentic scaffolding (Claude Sonnet 4.5 at 74.6%) vs. the LayerLens/PricePerToken tracker which measures base model capability.
Berkeley Function Calling Leaderboard – accuracy of tool/function calling.
Source: Awesome Agents
| Rank | Model | Score (%) |
|---|---|---|
| 1 | GLM-4.5 (Zhipu AI) | 70.9% |
| 2 | Claude Opus 4.1 (Anthropic) | 70.4% |
| 3 | Claude Sonnet 4 (Anthropic) | 69.8% |
| 4 | GPT-5 (OpenAI) | 68.5% |
| 5 | Gemini 2.5 Pro (Google) | 67.2% |
Python code generation from function docstrings – 164 problems.
Source: PricePerToken, LLM Stats
| Rank | Model | Score (% pass@1) |
|---|---|---|
| 1 | Claude Sonnet 4.5 Thinking (Anthropic) | 97.6% |
| 2 | DeepSeek-R1 | 97.4% |
| 3 | Grok 4 (xAI) | 97.0% |
| 4 | Claude Sonnet 4.5 (Anthropic) | 97.0% |
| 5 | Gemini 3 Pro Preview (Google) | 97.0% |
| 6 | Claude Opus 4.5 (Anthropic) | 97.0% |
| 7 | Claude Opus 4.6 (Anthropic) | 97.0% |
| 8 | GLM-5 (Zhipu AI) | 97.0% |
| 9 | o4-mini High (OpenAI) | 96.3% |
| 10 | Claude Sonnet 4 (Anthropic) | 96.3% |
Algebra, geometry, number theory, and calculus competition problems.
Source: PricePerToken
| Rank | Model | Score (%) |
|---|---|---|
| 1 | Claude Opus 4.6 (Anthropic) | 95.6% |
| 2 | o4-mini High (OpenAI) | 94.6% |
| 3 | GLM-5 (Zhipu AI) | 94.0% |
| 4 | o3-mini (OpenAI) | 93.1% |
| 5 | Qwen3 30B A3B (Alibaba) | 93.0% |
| 6 | DeepSeek-R1 | 92.7% |
| 7 | QwQ 32B (Alibaba) | 92.1% |
| 8 | Grok 3 Beta (xAI) | 92.0% |
| 9 | Claude Opus 4 (Anthropic) | 91.2% |
| 10 | Gemini 2.0 Flash (Google) | 90.7% |
Multi-turn conversations with tool use in customer service scenarios.
Source: Awesome Agents
| Rank | Model | Telecom | Retail |
|---|---|---|---|
| 1 | Claude Opus 4.6 (Anthropic) | 99.3% | 91.9% |
| 2 | Claude Sonnet 4.5 (Anthropic) | 98.1% | 89.4% |
| 3 | GPT-5 (OpenAI) | 96.7% | 87.2% |
Benchmarks are point-in-time snapshots. Check the linked sources for the most current data.