Table of Contents

Humanity's Last Exam

Humanity's Last Exam (HLE) is the most challenging AI benchmark ever created, featuring 2,500 expert-level questions across more than 100 academic subjects. Designed by a global consortium of nearly 1,000 researchers including the Center for AI Safety (CAIS) and Scale AI, it tests the absolute frontier of AI reasoning where top models score between 30-46%.

Overview

As benchmarks like MMLU and GPQA became saturated – with frontier models scoring above 90% – HLE was created to provide a meaningful ceiling for AI evaluation. Every question was rigorously vetted: any question solvable by leading models at the time of creation was discarded, ensuring the benchmark sits just beyond current AI capabilities.

HLE consists of a public set of 2,500 questions and a hidden private test set of 500 questions to prevent overfitting. Evaluation uses zero-shot mode with strict exact-match grading and no chain-of-thought prompts allowed.

Subjects Covered

HLE spans an extraordinary breadth and depth of human knowledge:

The emphasis is on depth of reasoning, niche knowledge, and multi-step logical deduction rather than broad surface-level recall.

Results

Frontier models score dramatically below human expert performance (~90% in their domains), revealing a persistent 2.5x AI-human gap:

Model Accuracy
Gemini 3.1 Pro Preview 45.9%
GPT-5.4 (xhigh) 41.6%
Gemini 3 Pro 37.5%
Claude Opus 4.6 (Thinking) 34.4%
GPT-5 Pro 31.6%
Grok 4 24.5%

Scores have risen substantially from early 2025 baselines (GPT-4o at 2.7%, Claude 3.5 Sonnet at 4.1%, o1 at 8%), indicating rapid but still insufficient progress.

Key Findings

HLE reveals several important patterns in AI capabilities:

# HLE evaluation structure (simplified)
class HLEEvaluator:
    def __init__(self, questions, private_set_size=500):
        self.public_questions = questions[:2500]
        self.private_questions = questions[2500:2500 + private_set_size]
 
    def evaluate(self, model):
        correct = 0
        for q in self.private_questions:
            # Zero-shot, no chain-of-thought
            response = model.generate(
                prompt=q.question,
                temperature=0,
                max_tokens=100  # Short factual answers only
            )
            # Strict exact-match grading
            if self.normalize(response) == self.normalize(q.answer):
                correct += 1
 
        accuracy = correct / len(self.private_questions)
        calibration = self.compute_calibration_error(model)
        return {"accuracy": accuracy, "calibration_error": calibration}

Design Principles

References

See Also