Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Humanity's Last Exam (HLE) is the most challenging AI benchmark ever created, featuring 2,500 expert-level questions across more than 100 academic subjects. Designed by a global consortium of nearly 1,000 researchers including the Center for AI Safety (CAIS) and Scale AI, it tests the absolute frontier of AI reasoning where top models score between 30-46%.
As benchmarks like MMLU and GPQA became saturated – with frontier models scoring above 90% – HLE was created to provide a meaningful ceiling for AI evaluation. Every question was rigorously vetted: any question solvable by leading models at the time of creation was discarded, ensuring the benchmark sits just beyond current AI capabilities.
HLE consists of a public set of 2,500 questions and a hidden private test set of 500 questions to prevent overfitting. Evaluation uses zero-shot mode with strict exact-match grading and no chain-of-thought prompts allowed.
HLE spans an extraordinary breadth and depth of human knowledge:
The emphasis is on depth of reasoning, niche knowledge, and multi-step logical deduction rather than broad surface-level recall.
Frontier models score dramatically below human expert performance (~90% in their domains), revealing a persistent 2.5x AI-human gap:
| Model | Accuracy |
|---|---|
| Gemini 3.1 Pro Preview | 45.9% |
| GPT-5.4 (xhigh) | 41.6% |
| Gemini 3 Pro | 37.5% |
| Claude Opus 4.6 (Thinking) | 34.4% |
| GPT-5 Pro | 31.6% |
| Grok 4 | 24.5% |
Scores have risen substantially from early 2025 baselines (GPT-4o at 2.7%, Claude 3.5 Sonnet at 4.1%, o1 at 8%), indicating rapid but still insufficient progress.
HLE reveals several important patterns in AI capabilities:
# HLE evaluation structure (simplified) class HLEEvaluator: def __init__(self, questions, private_set_size=500): self.public_questions = questions[:2500] self.private_questions = questions[2500:2500 + private_set_size] def evaluate(self, model): correct = 0 for q in self.private_questions: # Zero-shot, no chain-of-thought response = model.generate( prompt=q.question, temperature=0, max_tokens=100 # Short factual answers only ) # Strict exact-match grading if self.normalize(response) == self.normalize(q.answer): correct += 1 accuracy = correct / len(self.private_questions) calibration = self.compute_calibration_error(model) return {"accuracy": accuracy, "calibration_error": calibration}