Humanity's Last Exam

Humanity's Last Exam (HLE) is the most challenging AI benchmark ever created, featuring 2,500 expert-level questions across more than 100 academic subjects. Designed by a global consortium of nearly 1,000 researchers including the Center for AI Safety (CAIS) and Scale AI, it tests the absolute frontier of AI reasoning where top models score between 30-46%.

Overview

As benchmarks like MMLU and GPQA became saturated – with frontier models scoring above 90% – HLE was created to provide a meaningful ceiling for AI evaluation. Every question was rigorously vetted: any question solvable by leading models at the time of creation was discarded, ensuring the benchmark sits just beyond current AI capabilities.

HLE consists of a public set of 2,500 questions and a hidden private test set of 500 questions to prevent overfitting. Evaluation uses zero-shot mode with strict exact-match grading and no chain-of-thought prompts allowed.

Subjects Covered

HLE spans an extraordinary breadth and depth of human knowledge:

Mathematics - Advanced proofs, combinatorics, number theory
Natural Sciences - Physics, chemistry, biology at research level
Humanities - Medieval philology, ancient languages, philosophy
Specialized Domains - Advanced organic chemistry, conceptual physics, world-class competition mathematics
Cross-disciplinary - Questions requiring synthesis across multiple fields

The emphasis is on depth of reasoning, niche knowledge, and multi-step logical deduction rather than broad surface-level recall.

Results

Frontier models score dramatically below human expert performance (~90% in their domains), revealing a persistent 2.5x AI-human gap:

Model	Accuracy
Gemini 3.1 Pro Preview	45.9%
GPT-5.4 (xhigh)	41.6%
Gemini 3 Pro	37.5%
Claude Opus 4.6 (Thinking)	34.4%
GPT-5 Pro	31.6%
Grok 4	24.5%

Scores have risen substantially from early 2025 baselines (GPT-4o at 2.7%, Claude 3.5 Sonnet at 4.1%, o1 at 8%), indicating rapid but still insufficient progress.

Key Findings

HLE reveals several important patterns in AI capabilities:

High calibration error - Models are significantly overconfident. Gemini 3 Pro shows 57.2% calibration error, meaning it reports much higher confidence than its actual accuracy warrants.
Esoteric knowledge gaps - On highly specialized topics, AI performance approaches random guessing while human experts maintain 80-90% accuracy.
Multi-step reasoning failures - Questions requiring long chains of deduction remain disproportionately difficult.
Multi-modal challenges - Questions involving images, diagrams, or notation add significant difficulty.

# HLE evaluation structure (simplified)
class HLEEvaluator:
    def __init__(self, questions, private_set_size=500):
        self.public_questions = questions[:2500]
        self.private_questions = questions[2500:2500 + private_set_size]
 
    def evaluate(self, model):
        correct = 0
        for q in self.private_questions:
            # Zero-shot, no chain-of-thought
            response = model.generate(
                prompt=q.question,
                temperature=0,
                max_tokens=100  # Short factual answers only
            )
            # Strict exact-match grading
            if self.normalize(response) == self.normalize(q.answer):
                correct += 1
 
        accuracy = correct / len(self.private_questions)
        calibration = self.compute_calibration_error(model)
        return {"accuracy": accuracy, "calibration_error": calibration}

Design Principles

Adversarial filtering - Questions solvable by any leading model at creation time were removed
Expert authorship - Nearly 1,000 researchers contributed domain-specific questions
Private holdout - 500 hidden questions prevent benchmark contamination
Zero-shot evaluation - No prompt engineering or chain-of-thought allowed
Exact-match grading - Eliminates subjectivity in evaluation

Table of Contents