Humanity's Last Exam

Humanity's Last Exam (HLE) is the most challenging AI benchmark ever created, featuring 2,500 expert-level questions across more than 100 academic subjects. Designed by a global consortium of nearly 1,000 researchers including the Center for AI Safety (CAIS) and Scale AI, it tests the absolute frontier of AI reasoning where top models score between 30-46%.

Overview

As benchmarks like MMLU and GPQA became saturated – with frontier models scoring above 90% – HLE was created to provide a meaningful ceiling for AI evaluation. Every question was rigorously vetted: any question solvable by leading models at the time of creation was discarded, ensuring the benchmark sits just beyond current AI capabilities.

HLE consists of a public set of 2,500 questions and a hidden private test set of 500 questions to prevent overfitting. Evaluation uses zero-shot mode with strict exact-match grading and no chain-of-thought prompts allowed.

Subjects Covered

HLE spans an extraordinary breadth and depth of human knowledge:

Mathematics - Advanced proofs, combinatorics, number theory
Natural Sciences - Physics, chemistry, biology at research level
Humanities - Medieval philology, ancient languages, philosophy
Specialized Domains - Advanced organic chemistry, conceptual physics, world-class competition mathematics
Cross-disciplinary - Questions requiring synthesis across multiple fields

The emphasis is on depth of reasoning, niche knowledge, and multi-step logical deduction rather than broad surface-level recall.

Results

Frontier models score dramatically below human expert performance (~90% in their domains), revealing a persistent 2.5x AI-human gap:

Model	Accuracy
Gemini 3.1 Pro Preview	45.9%
GPT-5.4 (xhigh)	41.6%
Gemini 3 Pro	37.5%
Claude Opus 4.6 (Thinking)	34.4%
GPT-5 Pro	31.6%
Grok 4	24.5%

Scores have risen substantially from early 2025 baselines (GPT-4o at 2.7%, Claude 3.5 Sonnet at 4.1%, o1 at 8%), indicating rapid but still insufficient progress.

Key Findings

HLE reveals several important patterns in AI capabilities:

High calibration error - Models are significantly overconfident. Gemini 3 Pro shows 57.2% calibration error, meaning it reports much higher confidence than its actual accuracy warrants.
Esoteric knowledge gaps - On highly specialized topics, AI performance approaches random guessing while human experts maintain 80-90% accuracy.
Multi-step reasoning failures - Questions requiring long chains of deduction remain disproportionately difficult.
Multi-modal challenges - Questions involving images, diagrams, or notation add significant difficulty.

# HLE evaluation structure (simplified)
class HLEEvaluator:
    def __init__(self, questions, private_set_size=500):
        self.public_questions = questions[:2500]
        self.private_questions = questions[2500:2500 + private_set_size]
 
    def evaluate(self, model):
        correct = 0
        for q in self.private_questions:
            # Zero-shot, no chain-of-thought
            response = model.generate(
                prompt=q.question,
                temperature=0,
                max_tokens=100  # Short factual answers only
            )
            # Strict exact-match grading
            if self.normalize(response) == self.normalize(q.answer):
                correct += 1
 
        accuracy = correct / len(self.private_questions)
        calibration = self.compute_calibration_error(model)
        return {"accuracy": accuracy, "calibration_error": calibration}

Design Principles

Adversarial filtering - Questions solvable by any leading model at creation time were removed
Expert authorship - Nearly 1,000 researchers contributed domain-specific questions
Private holdout - 500 hidden questions prevent benchmark contamination
Zero-shot evaluation - No prompt engineering or chain-of-thought allowed
Exact-match grading - Eliminates subjectivity in evaluation

AI Agent Knowledge Base

Sidebar

Table of Contents

Humanity's Last Exam

Overview

Subjects Covered

Results

Key Findings

Design Principles

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Humanity's Last Exam

Overview

Subjects Covered

Results

Key Findings

Design Principles

References

See Also

Page Tools