====== Humanity's Last Exam ======

Humanity's Last Exam (HLE) is the most challenging AI benchmark ever created, featuring **2,500 expert-level questions** across more than 100 academic subjects. Designed by a global consortium of nearly 1,000 researchers including the **Center for AI Safety (CAIS)** and **Scale AI**, it tests the absolute frontier of AI reasoning where top models score between 30-46%.

===== Overview =====

As benchmarks like MMLU and GPQA became saturated -- with frontier models scoring above 90% -- HLE was created to provide a meaningful ceiling for AI evaluation. Every question was rigorously vetted: any question solvable by leading models at the time of creation was discarded, ensuring the benchmark sits just beyond current AI capabilities.

HLE consists of a public set of 2,500 questions and a hidden private test set of 500 questions to prevent overfitting. Evaluation uses **zero-shot mode** with strict exact-match grading and no chain-of-thought prompts allowed.

===== Subjects Covered =====

HLE spans an extraordinary breadth and depth of human knowledge:

  * **Mathematics** - Advanced proofs, combinatorics, number theory
  * **Natural Sciences** - Physics, chemistry, biology at research level
  * **Humanities** - Medieval philology, ancient languages, philosophy
  * **Specialized Domains** - Advanced organic chemistry, conceptual physics, world-class competition mathematics
  * **Cross-disciplinary** - Questions requiring synthesis across multiple fields

The emphasis is on **depth of reasoning**, niche knowledge, and multi-step logical deduction rather than broad surface-level recall.

===== Results =====

Frontier models score dramatically below human expert performance (~90% in their domains), revealing a persistent 2.5x AI-human gap:

^ Model ^ Accuracy ^
| **Gemini 3.1 Pro Preview** | 45.9% |
| **GPT-5.4 (xhigh)** | 41.6% |
| **Gemini 3 Pro** | 37.5% |
| **Claude Opus 4.6 (Thinking)** | 34.4% |
| **GPT-5 Pro** | 31.6% |
| **Grok 4** | 24.5% |

Scores have risen substantially from early 2025 baselines (GPT-4o at 2.7%, Claude 3.5 Sonnet at 4.1%, o1 at 8%), indicating rapid but still insufficient progress.

===== Key Findings =====

HLE reveals several important patterns in AI capabilities:

  * **High calibration error** - Models are significantly overconfident. Gemini 3 Pro shows 57.2% calibration error, meaning it reports much higher confidence than its actual accuracy warrants.
  * **Esoteric knowledge gaps** - On highly specialized topics, AI performance approaches random guessing while human experts maintain 80-90% accuracy.
  * **Multi-step reasoning failures** - Questions requiring long chains of deduction remain disproportionately difficult.
  * **Multi-modal challenges** - Questions involving images, diagrams, or notation add significant difficulty.

<code python>
# HLE evaluation structure (simplified)
class HLEEvaluator:
    def __init__(self, questions, private_set_size=500):
        self.public_questions = questions[:2500]
        self.private_questions = questions[2500:2500 + private_set_size]
    
    def evaluate(self, model):
        correct = 0
        for q in self.private_questions:
            # Zero-shot, no chain-of-thought
            response = model.generate(
                prompt=q.question,
                temperature=0,
                max_tokens=100  # Short factual answers only
            )
            # Strict exact-match grading
            if self.normalize(response) == self.normalize(q.answer):
                correct += 1
        
        accuracy = correct / len(self.private_questions)
        calibration = self.compute_calibration_error(model)
        return {"accuracy": accuracy, "calibration_error": calibration}
</code>

===== Design Principles =====

  * **Adversarial filtering** - Questions solvable by any leading model at creation time were removed
  * **Expert authorship** - Nearly 1,000 researchers contributed domain-specific questions
  * **Private holdout** - 500 hidden questions prevent benchmark contamination
  * **Zero-shot evaluation** - No prompt engineering or chain-of-thought allowed
  * **Exact-match grading** - Eliminates subjectivity in evaluation

===== References =====

  * [[https://agi.safe.ai|Center for AI Safety - HLE Overview and Results]]
  * [[https://scale.com/leaderboard/humanitys_last_exam_text_only_preview|Scale AI SEAL Leaderboard - HLE]]
  * [[https://artificialanalysis.ai/evaluations/humanitys-last-exam|Artificial Analysis - HLE Evaluation]]
  * [[https://en.wikipedia.org/wiki/Humanity%27s_Last_Exam|Wikipedia - Humanity's Last Exam]]
  * [[https://lastexam.ai|Official HLE Site (Nature publication)]]

===== See Also =====

  * [[gaia_benchmark]] - General AI assistant benchmark testing real-world tool use
  * [[terminal_bench]] - CLI/DevOps agent benchmark from Stanford
  * [[computer_use_benchmark]] - GUI interaction benchmarks for agents
  * [[agent_simulation_environments]] - Environments for evaluating agent capabilities