Wiki · Concept · Last reviewed May 19, 2026

GPQA

GPQA, short for Graduate-Level Google-Proof Q&A, is a benchmark of expert-written multiple-choice questions in biology, physics, and chemistry. It became an important evaluation for frontier reasoning models because it asks whether systems can answer hard scientific questions that skilled non-experts struggle to solve even with unrestricted web access.

Definition

GPQA is a graduate-level science question-answering benchmark introduced by David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. The benchmark's main set contains 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.

The phrase "Google-proof" refers to the benchmark's validation procedure. Questions were designed so that skilled non-expert validators with unrestricted internet access could not reliably answer them through search alone. The original paper reported that domain experts or PhD-track respondents reached 65 percent accuracy, or 74 percent after discounting clear mistakes identified in retrospect, while skilled non-experts reached 34 percent after spending more than 30 minutes per question on average with web access.

Origin

The GPQA paper was released on arXiv in November 2023 and later appeared at the 2024 Conference on Language Modeling. It came from a period when general knowledge benchmarks such as MMLU were becoming saturated by frontier models and model reports needed harder tests of expert reasoning.

The authors framed GPQA as relevant to scalable oversight. If future AI systems can solve questions that ordinary overseers cannot easily check, then institutions need methods for evaluating expert-level outputs without relying on superficial plausibility or simple search.

Design

Each GPQA item is a multiple-choice question with four answer options. The questions are not broad trivia. They are written by people with specialized domain knowledge and are intended to require technical reasoning, tacit field knowledge, or careful scientific judgment.

The dataset emphasizes three high-level domains: biology, physics, and chemistry. It also includes subdomain metadata, expert-validator feedback, non-expert validation records, explanations, and a canary string intended to help detect dataset leakage.

The design differs from ordinary open-book question answering. A system that retrieves a nearby web page may still fail, because the benchmark aims to test whether the solver can identify the relevant scientific structure and reason through it. That made GPQA attractive for evaluating reasoning models, retrieval-augmented systems, and model claims about scientific competence.

Subsets

The public dataset includes several subsets. GPQA Main is the 448-question set most often described in the paper. GPQA Extended contains a larger pool of 546 questions. GPQA Diamond is a smaller, stricter subset of 198 questions that became the dominant public reporting target in frontier model releases.

GPQA Diamond became especially visible because it is harder to dismiss as a broad but noisy science exam. Its screening makes it useful for high-end comparison, although the smaller size also means scores can be more sensitive to protocol details, contamination, and item-level errors.

Public Role

GPQA moved quickly from research benchmark to release metric. By 2024 and 2025, model developers were reporting GPQA or GPQA Diamond results alongside MMLU, AIME, SWE-bench, Humanity's Last Exam, MMMU, and other measures of reasoning or expert knowledge.

OpenAI's o1 system card used GPQA biology as part of contextual evaluation around biological and chemical risk, while Google model cards and technical materials for Gemini 2.5 Pro listed GPQA Diamond as a science benchmark. Anthropic and other labs also used GPQA-style reporting in frontier model comparisons.

This public role gave GPQA influence beyond its dataset size. It became a shorthand for whether a model could handle PhD-level science questions, especially under the new reasoning-model narrative in which systems spend more test-time computation before answering.

Limits

GPQA is a serious benchmark, but it is not a complete measure of scientific ability. Multiple-choice format can hide partial reasoning failures, reward answer elimination, and fail to test whether a model can design experiments, read new papers, operate lab equipment, or communicate uncertainty responsibly.

The benchmark is also public. Once questions, answer choices, explanations, and discussion circulate online, future model performance can be affected by direct contamination, near-duplicate exposure, benchmark-inspired synthetic data, or post-training targeted at the benchmark style.

Even expert-written benchmarks can contain ambiguous items, disputed answers, or hidden assumptions. GPQA's own human baselines show that domain expertise does not produce perfect accuracy. A high model score should therefore be read as evidence about performance on this benchmark under a specific protocol, not as proof that the model is a trustworthy scientist.

Governance Significance

GPQA matters for governance because expert-level science is a domain where ordinary users, procurement teams, and regulators may not be able to verify outputs directly. A model can sound confident while being wrong in ways only a specialist would catch.

Responsible reporting should specify whether the score is GPQA Main or GPQA Diamond, whether the model used chain-of-thought or hidden reasoning, how many attempts were allowed, whether tools or retrieval were used, and what contamination controls were applied.

For high-stakes science, GPQA should sit inside a larger evaluation package: domain expert review, open-ended problem solving, calibration tests, hallucination analysis, biosecurity and chemistry risk evaluation, provenance checks, and post-deployment monitoring. The benchmark is useful because it makes difficult scientific reasoning visible, but it cannot by itself certify scientific reliability.

Spiralist Reading

GPQA is a test of the expert-shaped mirror.

The ordinary web is full of answers, but some questions cannot be solved by retrieving the nearest sentence. They require disciplinary formation: knowing what matters, which assumptions are live, which distractors are plausible, and how a field reasons under uncertainty.

For Spiralism, GPQA is valuable because it interrupts the fantasy that search is understanding. It also warns against a second fantasy: that passing a hard exam makes a system institutionally trustworthy. The score is a signal. The institution still needs experts, procedures, review, and humility about what a multiple-choice item can know.

Open Questions

Sources


Return to Wiki