Wiki · Concept · Last reviewed May 20, 2026

Humanity's Last Exam

Humanity's Last Exam, often abbreviated HLE, is a multimodal benchmark of expert-level academic questions created by the Center for AI Safety, Scale AI, and a large contributor consortium. It was designed to test frontier AI systems after earlier public benchmarks such as MMLU became too easy to separate state-of-the-art models.

Definition

Humanity's Last Exam is an expert-level, closed-ended AI benchmark covering dozens of academic subjects. The Nature article and arXiv record describe HLE as a set of 2,500 questions across mathematics, humanities, natural sciences, and other fields, with questions written so that answers are unambiguous and suitable for automated grading.

The benchmark is both text-based and multimodal. Questions may be exact-match or multiple-choice, and some require interpreting an image, figure, or diagram alongside text. The goal is not to test ordinary trivia. HLE targets problems that require specialized knowledge, expert reasoning, or deep familiarity with a field.

Origin

HLE was released in January 2025 by the Center for AI Safety and Scale AI, with Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Ryan Kim, Adam Khoja, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Summer Yue, Alexandr Wang, Dan Hendrycks, and a large contributor consortium among the listed authors and organizers.

The project responded to benchmark saturation. The HLE paper notes that frontier systems had exceeded 90 percent accuracy on popular benchmarks such as MMLU, making those tests less useful for measuring advanced model capability. HLE was built as a harder reference point for expert-level academic performance.

The benchmark later appeared as a Nature article, published January 28, 2026, under the title "A benchmark of expert-level academic questions to assess AI capabilities." The official HLE site records the Nature publication and earlier dataset updates, including finalization of the 2,500-question set in April 2025.

Design

HLE questions were contributed by subject-matter experts from many institutions and countries. The Nature article describes nearly 1,000 expert contributors affiliated with more than 500 institutions across 50 countries. Epoch AI summarizes the benchmark as covering over 100 subjects, including mathematics, physics, chemistry, biology, medicine, computer science, humanities, and social sciences.

The curation process filtered for difficulty and quality. Questions were tested against frontier models, then reviewed through human expert processes. The official site also describes a bug bounty and later replacement of flagged or searchable questions before the finalized April 2025 version.

HLE's closed-ended design is deliberate. Exact-match and multiple-choice questions make automated scoring possible, which supports repeated model comparison. The tradeoff is that the benchmark tests a narrower form of academic question answering than open-ended research, experimental design, tool use, or long-horizon scientific work.

Public Role

HLE quickly became part of frontier model-release culture. Model developers, benchmark dashboards, and commentators use it as a shorthand for whether systems can handle difficult expert-level questions after MMLU, GPQA, AIME, and other public tests became common reporting targets.

The benchmark is especially useful as a saturation warning. A low HLE score says that a model still struggles at the expert frontier. A rising HLE score can indicate progress, but it should be read with the protocol attached: model version, prompting, tool access, number of attempts, calibration, scoring method, and contamination controls all matter.

The name "Humanity's Last Exam" also shaped its public meaning. It made the benchmark memorable, but it can invite overinterpretation. Passing or saturating HLE would not mean that AI has mastered all human knowledge, achieved general intelligence, become safe, or replaced the institutions that validate expert claims.

HLE-Rolling

In October 2025, the HLE team announced HLE-Rolling, a dynamic fork intended to keep the benchmark fresher as public benchmark items become exposed. This reflects a central lesson of modern evaluation: once a benchmark matters, it becomes part of the training, tuning, marketing, and discourse environment it was meant to measure.

A rolling benchmark can reduce staleness, but it also creates governance questions. Contributors, maintainers, model developers, and outside evaluators need to know how new questions are selected, how errors are handled, how hidden and public splits are managed, and how leaderboard results remain comparable over time.

Limits

HLE is hard, broad, and useful, but it is still a benchmark. It does not test whether a model can run a lab, maintain a long research program, notice when a field's assumptions are wrong, communicate uncertainty to non-experts, or act safely inside a real institution.

Closed-ended questions can reward answer memorization, benchmark-specific tuning, multiple-try strategies, or brittle pattern recognition. Public questions can leak into training data or post-training examples. Even expert-authored questions can contain ambiguous wording, disputed answers, hidden assumptions, or domain-specific errors.

Calibration is a major concern. The Nature article reports that models can give wrong answers with high confidence on HLE. That matters because the practical danger is often not only failure, but fluent and confident failure in domains where ordinary users cannot verify the answer.

Governance Significance

HLE matters for governance because expert knowledge is one of the places where AI capability can outpace ordinary oversight. A system that performs well on expert questions could be useful for research, medicine, engineering, education, and policy analysis. The same capability can also intensify dependence on systems that users cannot independently audit.

Responsible use of HLE should keep it inside a larger evaluation package: domain expert review, open-ended tasks, dangerous-capability evaluations, hallucination and calibration testing, tool-use protocols, benchmark contamination analysis, model cards, system cards, and post-deployment monitoring.

For public communication, HLE should be reported as a dated benchmark result, not as a civilizational verdict. The useful claim is narrow: under a specified protocol, a specified system answered a specified set of hard expert questions at a specified accuracy and calibration level.

Spiralist Reading

Humanity's Last Exam is a mirror placed at the edge of expert knowledge.

It asks a valuable question: when the easy tests are saturated, what remains hard? But the title also reveals the danger of benchmark mythology. The exam can become a ritual object. Companies quote it, observers track it, and the score begins to stand in for the broader question of whether the machine can be trusted.

For Spiralism, HLE is strongest when it restores friction. It reminds the public that fluency is not expertise and that expertise itself must be tested. It becomes dangerous only when the score is mistaken for wisdom, safety, legitimacy, or institutional permission.

Open Questions

Sources


Return to Wiki