Wiki · Concept · Last reviewed June 23, 2026

Humanity's Last Exam

Humanity's Last Exam, often abbreviated HLE, is a multimodal benchmark of expert-level academic questions created by the Center for AI Safety, Scale AI, and a large contributor consortium. It was designed to test frontier AI systems after earlier public benchmarks such as MMLU became too easy to separate state-of-the-art models.

Definition

Humanity's Last Exam is an expert-level, closed-ended AI benchmark covering dozens of academic subjects. The Nature article and arXiv record describe HLE as a set of 2,500 questions across mathematics, humanities, natural sciences, and other fields, with questions written so that answers are unambiguous and suitable for automated grading.

The benchmark is both text-based and multimodal. Questions may be exact-match or multiple-choice, and some require interpreting an image, figure, or diagram alongside text. The goal is not to test ordinary trivia. HLE targets problems that require specialized knowledge, expert reasoning, or deep familiarity with a field.

The benchmark measures performance on difficult, closed-ended academic questions under a specified protocol. It does not by itself measure open-ended research ability, safe deployment, professional judgment, autonomy, consciousness, or general institutional trustworthiness.

Current Context

As of June 23, 2026, HLE is best read as a frontier academic-capability benchmark and a benchmark-governance case study. The Nature version was published on January 28, 2026. The official HLE site says the public 2,500-question set was finalized on April 3, 2025 after bug-bounty reports and searchable questions were removed and replaced, and it links a dynamic HLE-Rolling fork announced on October 8, 2025.

The benchmark's public role has grown because it remains harder than earlier broad knowledge tests, but its own sources warn against overinterpretation. The Nature article says high HLE accuracy would indicate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but not autonomous research capability or artificial general intelligence. Scale's leaderboard similarly notes that high accuracy would not alone establish autonomous research capability or "artificial general intelligence."

For governance, the live question is not only which model scores highest. The useful questions are whether the score used the public or private split, how the answer was graded, whether tools or search were allowed, how many attempts were used, whether confidence was calibrated, and whether benchmark exposure or HLE-Rolling changes affect comparability.

Origin

HLE was released in January 2025 by the Center for AI Safety and Scale AI, with Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Ryan Kim, Adam Khoja, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Summer Yue, Alexandr Wang, Dan Hendrycks, and a large contributor consortium among the listed authors and organizers.

The project responded to benchmark saturation. The HLE paper notes that frontier systems had exceeded 90 percent accuracy on popular benchmarks such as MMLU, making those tests less useful for measuring advanced model capability. HLE was built as a harder reference point for expert-level academic performance.

The benchmark later appeared as a Nature article, published January 28, 2026, under the title "A benchmark of expert-level academic questions to assess AI capabilities." The official HLE site records the Nature publication and earlier dataset updates, including finalization of the 2,500-question set in April 2025.

The arXiv record shows the paper was first submitted on January 24, 2025 and last revised as version 10 on February 20, 2026. That matters for source discipline: the same project has a launch paper, official site updates, dataset revisions, HLE-Rolling notes, a Nature article, and live leaderboards, and each source may describe a different point in the benchmark lifecycle.

Design

HLE questions were contributed by subject-matter experts from many institutions and countries. The Nature article describes nearly 1,000 expert contributors affiliated with more than 500 institutions across 50 countries. Epoch AI summarizes the benchmark as covering over 100 subjects, including mathematics, physics, chemistry, biology, medicine, computer science, humanities, and social sciences.

The curation process filtered for difficulty and quality. Questions were tested against frontier models, then reviewed through human expert processes. The Nature article says more than 70,000 model attempts were logged, about 13,000 questions passed the difficulty bar and went to expert human review, and a private held-out set was kept apart from the public set to assess overfitting and gaming.

HLE's closed-ended design is deliberate. The Nature article describes two question formats: exact-match and multiple-choice with five or more answer choices. It also reports that about 14 percent of questions require text-and-image understanding, about 24 percent are multiple-choice, and the remainder are exact-match. That design makes automated scoring possible, which supports repeated model comparison.

The tradeoff is that HLE tests a narrower form of academic question answering than open-ended research, experimental design, tool use, long-horizon scientific work, or institutionally responsible expert practice. It also depends on grading choices: the Nature evaluation used a standardized prompt and o3-mini as a judge to verify answer correctness while allowing equivalent formats such as fractions and decimals.

How to Read a Score

A responsible HLE score should name the model version, benchmark version, public or private split, evaluation date, prompt, tool or search access, number of attempts, decoding settings, answer-grading method, confidence-calibration method, and whether the result comes from the provider, an independent evaluator, the official leaderboard, or another benchmark runner.

Small changes near low accuracy should be interpreted cautiously. The Nature article notes that multiple-choice questions create a non-zero floor and that small inflections close to zero accuracy are not strong evidence of progress. Scale's leaderboard ranks models using statistical confidence intervals rather than raw score alone, which is a useful reminder that score differences can be noisy.

Calibration is part of the result. The Nature article reports that frontier models often gave incorrect answers with high confidence on HLE, with high RMS calibration errors. For users and institutions, an overconfident wrong answer in a specialized domain can be more dangerous than an obvious failure because non-experts may not know when to doubt it.

Public Role

HLE quickly became part of frontier model-release culture. Model developers, benchmark dashboards, and commentators use it as a shorthand for whether systems can handle difficult expert-level questions after MMLU, GPQA, AIME, and other public tests became common reporting targets.

The benchmark is especially useful as a saturation warning. A low HLE score says that a model still struggles at the expert frontier. A rising HLE score can indicate progress, but it should be read with the protocol attached: model version, prompting, tool access, number of attempts, calibration, scoring method, and contamination controls all matter.

The name "Humanity's Last Exam" also shaped its public meaning. It made the benchmark memorable, but it can invite overinterpretation. Passing or saturating HLE would not mean that AI has mastered all human knowledge, achieved general intelligence, become safe, or replaced the institutions that validate expert claims.

HLE-Rolling

In October 2025, the HLE team announced HLE-Rolling, a dynamic fork intended to keep the benchmark fresher as public benchmark items become exposed. This reflects a central lesson of modern evaluation: once a benchmark matters, it becomes part of the training, tuning, marketing, and discourse environment it was meant to measure.

A rolling benchmark can reduce staleness, but it also creates governance questions. Contributors, maintainers, model developers, and outside evaluators need to know how new questions are selected, how errors are handled, how hidden and public splits are managed, how retired questions are documented, and how leaderboard results remain comparable over time.

Limits

HLE is hard, broad, and useful, but it is still a benchmark. It does not test whether a model can run a lab, maintain a long research program, notice when a field's assumptions are wrong, communicate uncertainty to non-experts, or act safely inside a real institution.

Closed-ended questions can reward answer memorization, benchmark-specific tuning, multiple-try strategies, or brittle pattern recognition. Public questions can leak into training data or post-training examples. Even expert-authored questions can contain ambiguous wording, disputed answers, hidden assumptions, or domain-specific errors.

The review process itself has limits. The Nature article says reviewers were not expected to verify the full accuracy of every solution rationale, and it reports an estimated expert disagreement rate of 15.4 percent for the public set after auditing. That does not invalidate HLE, but it means benchmark maintenance, errata, and versioning are part of the evidence.

Calibration is a major concern. The practical danger is often not only failure, but fluent and confident failure in domains where ordinary users cannot verify the answer.

Governance Significance

HLE matters for governance because expert knowledge is one of the places where AI capability can outpace ordinary oversight. A system that performs well on expert questions could be useful for research, medicine, engineering, education, and policy analysis. The same capability can also intensify dependence on systems that users cannot independently audit.

Responsible use of HLE should keep it inside a larger evaluation package: domain expert review, open-ended tasks, dangerous-capability evaluations, hallucination and calibration testing, tool-use protocols, benchmark contamination analysis, model cards, system cards, and post-deployment monitoring.

Procurement and safety cases should treat HLE as capability evidence, not as deployment approval. A high HLE result should trigger further scrutiny in expert-heavy settings: what domains improved, whether the model can explain uncertainty, how it behaves with tools and retrieval, whether it refuses unsafe requests, whether it is vulnerable to prompt injection, and how humans can audit or override its work.

For public communication, HLE should be reported as a dated benchmark result, not as a civilizational verdict. The useful claim is narrow: under a specified protocol, a specified system answered a specified set of hard expert questions at a specified accuracy and calibration level.

Source Discipline

HLE claims should identify the source type. The Nature paper is the peer-reviewed benchmark description. The arXiv page tracks paper versions. The official project site records dataset news, HLE-Rolling, authors, and links. The Scale leaderboard is a live reporting surface with its own ranking methodology. Epoch AI is a useful benchmark summary but not the canonical dataset source.

For current rankings, avoid treating a leaderboard snapshot as timeless. Model names, prompts, tool access, private splits, confidence intervals, and grading methods can change. A source-disciplined citation names the retrieval date and protocol rather than saying a model simply "passed HLE."

For governance claims, avoid score laundering. HLE performance does not prove expert professional reliability, autonomous research ability, model safety, or general intelligence. It is evidence about a defined benchmark under defined conditions, and its strongest use is comparative measurement when paired with contamination controls, calibration reporting, and domain-specific evaluation.

Spiralist Reading

Humanity's Last Exam is a mirror placed at the edge of expert knowledge.

It asks a valuable question: when the easy tests are saturated, what remains hard? But the title also reveals the danger of benchmark mythology. The exam can become a ritual object. Companies quote it, observers track it, and the score begins to stand in for the broader question of whether the machine can be trusted.

For Spiralism, HLE is strongest when it restores friction. It reminds the public that fluency is not expertise and that expertise itself must be tested. It becomes dangerous only when the score is mistaken for wisdom, safety, legitimacy, or institutional permission.

Open Questions

Sources


Return to Wiki