Benchmark Leaderboard

Current top scores across major AI benchmarks. Data sourced from official leaderboards and research trackers.

Last updated: March 25, 2026

SWE-bench Verified

Software engineering benchmark – resolving real GitHub issues from popular Python repos.

Rank	Agent / Model	Score (% Resolved)
1	Claude Opus 4.5 (Anthropic)	80.9%
2	MiniMax M2.5 (MiniMax, 230B)	80.2%
3	GPT-5.2 (OpenAI)	80.0%
4	Claude Sonnet 4.6 (Anthropic)	79.6%
5	Gemini 3 Flash (Google)	78.0%
6	GLM-5 (Zhipu AI, 744B)	77.8%
7	Kimi K2.5 (Moonshot AI, 1T)	76.8%
8	Seed 2.0 Pro (ByteDance)	76.5%
9	Claude Sonnet 4.5 (Anthropic)	75.2%
10	DeepSeek-R1 (DeepSeek)	74.0%

GAIA (General AI Assistants)

Multi-step real-world tasks requiring tool use, web browsing, and reasoning.

Source: HuggingFace GAIA Leaderboard, Awesome Agents

Rank	Agent / Model	Score (% Overall)
1	Claude Sonnet 4.5 (Anthropic)	74.6%
2	Claude Opus 4.5 (Anthropic)	72.1%
3	Claude Sonnet 4 (Anthropic)	69.8%
4	GPT-5 Mini (OpenAI)	44.8%
5	Claude 3.7 Sonnet Thinking	43.9%
6	Claude 3.7 Sonnet	43.9%
7	Gemini 2.5 Pro (Google)	33.3%
8	DeepSeek R1 0528	27.9%
9	Mistral Medium 3.1	23.3%
10	Tongyi DeepResearch 30B (Alibaba)	20.6%

Note: Scores vary significantly by evaluation harness. Awesome Agents reports higher scores using agentic scaffolding (Claude Sonnet 4.5 at 74.6%) vs. the LayerLens/PricePerToken tracker which measures base model capability.

BFCL V4 (Function Calling)

Berkeley Function Calling Leaderboard – accuracy of tool/function calling.

Source: Awesome Agents

Rank	Model	Score (%)
1	GLM-4.5 (Zhipu AI)	70.9%
2	Claude Opus 4.1 (Anthropic)	70.4%
3	Claude Sonnet 4 (Anthropic)	69.8%
4	GPT-5 (OpenAI)	68.5%
5	Gemini 2.5 Pro (Google)	67.2%

HumanEval (Code Generation)

Python code generation from function docstrings – 164 problems.

Source: PricePerToken, LLM Stats

Rank	Model	Score (% pass@1)
1	Claude Sonnet 4.5 Thinking (Anthropic)	97.6%
2	DeepSeek-R1	97.4%
3	Grok 4 (xAI)	97.0%
4	Claude Sonnet 4.5 (Anthropic)	97.0%
5	Gemini 3 Pro Preview (Google)	97.0%
6	Claude Opus 4.5 (Anthropic)	97.0%
7	Claude Opus 4.6 (Anthropic)	97.0%
8	GLM-5 (Zhipu AI)	97.0%
9	o4-mini High (OpenAI)	96.3%
10	Claude Sonnet 4 (Anthropic)	96.3%

MATH (Mathematical Problem Solving)

Algebra, geometry, number theory, and calculus competition problems.

Source: PricePerToken

Rank	Model	Score (%)
1	Claude Opus 4.6 (Anthropic)	95.6%
2	o4-mini High (OpenAI)	94.6%
3	GLM-5 (Zhipu AI)	94.0%
4	o3-mini (OpenAI)	93.1%
5	Qwen3 30B A3B (Alibaba)	93.0%
6	DeepSeek-R1	92.7%
7	QwQ 32B (Alibaba)	92.1%
8	Grok 3 Beta (xAI)	92.0%
9	Claude Opus 4 (Anthropic)	91.2%
10	Gemini 2.0 Flash (Google)	90.7%

Tau2-bench (Multi-turn Customer Service)

Multi-turn conversations with tool use in customer service scenarios.

Source: Awesome Agents

Rank	Model	Telecom	Retail
1	Claude Opus 4.6 (Anthropic)	99.3%	91.9%
2	Claude Sonnet 4.5 (Anthropic)	98.1%	89.4%
3	GPT-5 (OpenAI)	96.7%	87.2%

Key Takeaways

Anthropic Claude models dominate most agentic benchmarks (SWE-bench, GAIA, Tau2-bench)
Code generation (HumanEval) is near-saturated – top 8 models all score 97%+
Math reasoning is led by Claude Opus 4.6 and OpenAI o-series models
Open-source models (GLM-5, Qwen3, DeepSeek-R1) compete strongly at fraction of the cost
Function calling (BFCL) – open-source GLM-4.5 beats proprietary models
Scores vary by evaluation harness – always check methodology when comparing

Benchmarks are point-in-time snapshots. Check the linked sources for the most current data.

AI Agent Knowledge Base

Sidebar

Table of Contents

Benchmark Leaderboard

SWE-bench Verified

GAIA (General AI Assistants)

BFCL V4 (Function Calling)

HumanEval (Code Generation)

MATH (Mathematical Problem Solving)

Tau2-bench (Multi-turn Customer Service)

Key Takeaways

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Benchmark Leaderboard

SWE-bench Verified

GAIA (General AI Assistants)

BFCL V4 (Function Calling)

HumanEval (Code Generation)

MATH (Mathematical Problem Solving)

Tau2-bench (Multi-turn Customer Service)

Key Takeaways

Page Tools