====== Humanity's Last Exam ====== Humanity's Last Exam (HLE) is the most challenging AI benchmark ever created, featuring **2,500 expert-level questions** across more than 100 academic subjects. Designed by a global consortium of nearly 1,000 researchers including the **Center for AI Safety (CAIS)** and **Scale AI**, it tests the absolute frontier of AI reasoning where top models score between 30-46%. ===== Overview ===== As benchmarks like MMLU and GPQA became saturated -- with frontier models scoring above 90% -- HLE was created to provide a meaningful ceiling for AI evaluation. Every question was rigorously vetted: any question solvable by leading models at the time of creation was discarded, ensuring the benchmark sits just beyond current AI capabilities. HLE consists of a public set of 2,500 questions and a hidden private test set of 500 questions to prevent overfitting. Evaluation uses **zero-shot mode** with strict exact-match grading and no chain-of-thought prompts allowed. ===== Subjects Covered ===== HLE spans an extraordinary breadth and depth of human knowledge: * **Mathematics** - Advanced proofs, combinatorics, number theory * **Natural Sciences** - Physics, chemistry, biology at research level * **Humanities** - Medieval philology, ancient languages, philosophy * **Specialized Domains** - Advanced organic chemistry, conceptual physics, world-class competition mathematics * **Cross-disciplinary** - Questions requiring synthesis across multiple fields The emphasis is on **depth of reasoning**, niche knowledge, and multi-step logical deduction rather than broad surface-level recall. ===== Results ===== Frontier models score dramatically below human expert performance (~90% in their domains), revealing a persistent 2.5x AI-human gap: ^ Model ^ Accuracy ^ | **Gemini 3.1 Pro Preview** | 45.9% | | **GPT-5.4 (xhigh)** | 41.6% | | **Gemini 3 Pro** | 37.5% | | **Claude Opus 4.6 (Thinking)** | 34.4% | | **GPT-5 Pro** | 31.6% | | **Grok 4** | 24.5% | Scores have risen substantially from early 2025 baselines (GPT-4o at 2.7%, Claude 3.5 Sonnet at 4.1%, o1 at 8%), indicating rapid but still insufficient progress. ===== Key Findings ===== HLE reveals several important patterns in AI capabilities: * **High calibration error** - Models are significantly overconfident. Gemini 3 Pro shows 57.2% calibration error, meaning it reports much higher confidence than its actual accuracy warrants. * **Esoteric knowledge gaps** - On highly specialized topics, AI performance approaches random guessing while human experts maintain 80-90% accuracy. * **Multi-step reasoning failures** - Questions requiring long chains of deduction remain disproportionately difficult. * **Multi-modal challenges** - Questions involving images, diagrams, or notation add significant difficulty. # HLE evaluation structure (simplified) class HLEEvaluator: def __init__(self, questions, private_set_size=500): self.public_questions = questions[:2500] self.private_questions = questions[2500:2500 + private_set_size] def evaluate(self, model): correct = 0 for q in self.private_questions: # Zero-shot, no chain-of-thought response = model.generate( prompt=q.question, temperature=0, max_tokens=100 # Short factual answers only ) # Strict exact-match grading if self.normalize(response) == self.normalize(q.answer): correct += 1 accuracy = correct / len(self.private_questions) calibration = self.compute_calibration_error(model) return {"accuracy": accuracy, "calibration_error": calibration} ===== Design Principles ===== * **Adversarial filtering** - Questions solvable by any leading model at creation time were removed * **Expert authorship** - Nearly 1,000 researchers contributed domain-specific questions * **Private holdout** - 500 hidden questions prevent benchmark contamination * **Zero-shot evaluation** - No prompt engineering or chain-of-thought allowed * **Exact-match grading** - Eliminates subjectivity in evaluation ===== References ===== * [[https://agi.safe.ai|Center for AI Safety - HLE Overview and Results]] * [[https://scale.com/leaderboard/humanitys_last_exam_text_only_preview|Scale AI SEAL Leaderboard - HLE]] * [[https://artificialanalysis.ai/evaluations/humanitys-last-exam|Artificial Analysis - HLE Evaluation]] * [[https://en.wikipedia.org/wiki/Humanity%27s_Last_Exam|Wikipedia - Humanity's Last Exam]] * [[https://lastexam.ai|Official HLE Site (Nature publication)]] ===== See Also ===== * [[gaia_benchmark]] - General AI assistant benchmark testing real-world tool use * [[terminal_bench]] - CLI/DevOps agent benchmark from Stanford * [[computer_use_benchmark]] - GUI interaction benchmarks for agents * [[agent_simulation_environments]] - Environments for evaluating agent capabilities