Wiki · Concept · Last reviewed June 25, 2026

GPQA

GPQA, short for Graduate-Level Google-Proof Q&A, is a benchmark of expert-written multiple-choice questions in biology, physics, and chemistry. It is useful as bounded evidence about difficult science question answering under a stated protocol, not as proof that a model can do reliable scientific work in the world.

Category: AI evaluations Published: June 25, 2026 Modified: June 25, 2026 Last reviewed: June 25, 2026 Tags: benchmarks, scientific reasoning, scalable oversight, contamination, score discipline

Definition

GPQA is a graduate-level science question-answering benchmark introduced by David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. The benchmark's main set contains 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.

The phrase "Google-proof" refers to the benchmark's validation procedure. Questions were designed so that skilled non-expert validators with unrestricted internet access could not reliably answer them through search alone. The original paper reported that domain experts or PhD-track respondents reached 65 percent accuracy, or 74 percent after discounting clear mistakes identified in retrospect, while skilled non-experts reached 34 percent after spending more than 30 minutes per question on average with web access.

For this wiki, GPQA is best understood as an expert-reasoning and scalable-oversight benchmark, not as a general intelligence test. It can show that a system handles a difficult multiple-choice science item under a specified protocol. It does not show that the system can run a lab, design safe experiments, review new literature, manage uncertainty, cite sources, use tools safely, or act as an accountable scientist.

The unit of evidence is not "GPQA" alone. It is the subset, model version, evaluation date, prompt, answer-choice ordering, reasoning-effort setting, sampling method, tool permissions, scoring harness, and number of attempts behind a reported result.

Snapshot

Core measurement: four-option multiple-choice accuracy on expert-written biology, chemistry, and physics questions.
Original purpose: test hard science questions that skilled non-experts could not reliably solve with unrestricted web access.
Current public role: a common frontier-model reporting line, especially through the GPQA Diamond subset.
Best use: one protocol-bound signal inside a broader AI evaluation package.
Main risks: benchmark exposure, contamination, prompt and reasoning-budget sensitivity, score-table laundering, and overclaiming scientific competence.
Not evidence of: safe lab work, autonomous research ability, biosecurity controls, professional qualification, consciousness, or general deployment readiness.

Origin

The GPQA paper was released on arXiv in November 2023 and later appeared at the 2024 Conference on Language Modeling. It came from a period when general knowledge benchmarks such as MMLU were becoming saturated by frontier models and model reports needed harder tests of expert reasoning.

The authors framed GPQA as relevant to scalable oversight. If future AI systems can solve questions that ordinary overseers cannot easily check, then institutions need methods for evaluating expert-level outputs without relying on superficial plausibility or simple search.

Current Context

As of June 25, 2026, GPQA is a standard public benchmark for frontier science and reasoning claims, especially through the 198-question GPQA Diamond subset. It appears in model cards, system cards, launch posts, leaderboards, evaluation harnesses, and third-party comparisons. That visibility makes it useful as a shared reference point and weaker as a pristine test of unseen scientific ability.

The benchmark's own distribution practices show the contamination concern. The Hugging Face dataset card asks users not to reveal examples in plain text or images online to reduce leakage into foundation-model training corpora, and the GitHub repository includes a canary string for leakage detection. Those are not guarantees of cleanliness; they are reminders that a public benchmark becomes part of the web environment it tries to evaluate against.

The current context is therefore an exposed-benchmark context. OpenAI, Google, Anthropic, and third-party evaluators use GPQA or GPQA Diamond in public reporting, but their numbers can differ by model snapshot, hidden or visible reasoning budget, tool access, answer extraction, retries, majority voting, and whether the result is provider-reported or independently reproduced. Google DeepMind's Gemini 3.1 Pro model card and Anthropic's Claude Sonnet 4.6 system card illustrate the newer reporting style: high-effort or high-thinking settings, repeated trials, and benchmark tables where GPQA Diamond is one line among many.

By 2026, GPQA Diamond is also closer to a saturation-risk benchmark than it was at release. That does not make it useless; it changes the inference. A high score may still indicate strong scientific question-answering, but it should be read with contamination checks, protocol details, and harder open-ended science evaluations rather than as a fresh proxy for frontier capability.

GPQA also sits near safety-sensitive domains. Biology and chemistry questions can be legitimate scientific-reasoning tests, but adjacent capabilities may matter for biological or chemical misuse evaluations. A GPQA score should not be treated as a misuse score, a refusal score, or a release threshold unless the report separately explains how it connects to a risk framework.

The responsible current use is therefore narrow and protocol-aware. GPQA can support a claim such as "this model performed well on GPQA Diamond under this prompting, tool, sampling, and scoring setup." It should not support unqualified claims that a model is a scientist, safe for scientific deployment, or generally capable in biology, chemistry, or physics.

Design

Each GPQA item is a multiple-choice question with four answer options. The questions are not broad trivia. They are written by people with specialized domain knowledge and are intended to require technical reasoning, tacit field knowledge, or careful scientific judgment.

The dataset emphasizes three high-level domains: biology, physics, and chemistry. It also includes subdomain metadata, expert-validator feedback, non-expert validation records, explanations, and a canary string intended to help detect dataset leakage.

The design differs from ordinary open-book question answering. A system that retrieves a nearby web page may still fail, because the benchmark aims to test whether the solver can identify the relevant scientific structure and reason through it. That made GPQA attractive for evaluating reasoning models, retrieval-augmented systems, and model claims about scientific competence.

Subsets

The public dataset includes several subsets. GPQA Main is the 448-question set most often described in the paper. GPQA Extended contains a larger pool of 546 questions. GPQA Diamond is a smaller, stricter subset of 198 questions that became the dominant public reporting target in frontier model releases.

GPQA Diamond became especially visible because it is harder to dismiss as a broad but noisy science exam. Its screening makes it useful for high-end comparison, although the smaller size also means scores can be more sensitive to protocol details, contamination, and item-level errors.

Benchmark Boundaries

GPQA is not MMLU. MMLU measures broad academic and professional multiple-choice knowledge across 57 subjects. GPQA is narrower, deeper, and science-focused.

GPQA is not Humanity's Last Exam. Humanity's Last Exam is broader, multimodal, and designed as a later expert-level benchmark after earlier tests became easier for frontier models. GPQA remains specifically tied to graduate-level biology, chemistry, and physics questions.

GPQA is not a misuse evaluation. It can be adjacent to AI biosecurity and chemistry-risk questions, but it does not test whether a model refuses harmful requests, avoids operational assistance, or helps non-experts cause harm.

GPQA is not a science workflow evaluation. A model used for AI in science needs tests for literature grounding, experimental design, uncertainty, citation quality, tool use, data analysis, and expert review. GPQA mostly asks whether the system selected the expected answer to a hard closed-form item.

GPQA is not a model-card substitute. It should feed into model cards and system cards, but it cannot replace documentation of training data, safety mitigations, limitations, intended use, evaluation protocol, and deployment controls.

Evidence Boundary

A GPQA result answers a narrow measurement question: did this evaluated system select the expected answer to these expert-written science items under these conditions? It does not directly test literature review, experimental design, lab execution, causal modeling, peer review, mathematical proof, citation discipline, uncertainty communication, or safe scientific assistance.

The boundary becomes especially important for reasoning models. A no-tools pass@1 run, a high-effort hidden-reasoning run, a majority-vote run, and a tool-assisted run can all be described as GPQA results while measuring different products, costs, and deployment behaviors. Public tables should not collapse them into one undifferentiated score.

GPQA is also not a substitute for domain deployment tests. A system used in biology, chemistry, medicine, climate science, engineering, or public-sector research needs open-ended tasks, source-grounded answers, calibration checks, expert review, misuse testing, workflow-specific validation, and post-deployment monitoring.

Public Role

GPQA moved quickly from research benchmark to release metric. By 2024 and 2025, model developers were reporting GPQA or GPQA Diamond results alongside MMLU, AIME, SWE-bench, Humanity's Last Exam, MMMU, and other measures of reasoning or expert knowledge.

OpenAI's o1 system card used GPQA biology as contextual evidence around biological and chemical risk evaluation. OpenAI later included GPQA figures in o3 and o4-mini release materials. Google model cards for Gemini systems list GPQA Diamond as a science or scientific-knowledge benchmark, and Anthropic announcements and system cards have used GPQA or GPQA Diamond in capability reporting. These sources establish that providers used the benchmark; they do not independently validate the benchmark result or its deployment meaning.

This public role gave GPQA influence beyond its dataset size. It became a shorthand for whether a model could handle PhD-level science questions, especially under the new reasoning-model narrative in which systems spend more test-time computation before answering.

Score Discipline

A GPQA score is not self-explanatory. Responsible reporting should identify the subset, model version, date, prompt template, number of examples, answer-choice randomization, decoding settings, number of trials, whether chain-of-thought or hidden reasoning was used, whether majority vote or self-consistency was used, and whether tools, retrieval, browsing, calculators, code execution, or human assistance were allowed.

GPQA Diamond is especially sensitive to reporting choices because it has only 198 questions. A few item-level differences, answer-extraction rules, retries, or randomizations can move a visible percentage. Reports should therefore disclose confidence intervals or repeated-run variance where possible and should separate provider-reported numbers from independent replications.

For reasoning models, test-time compute is part of the evaluated system. A low-effort, single-pass run and a high-effort, multi-sample run can both be labeled "GPQA Diamond" while measuring different products, costs, and deployment behaviors. Score tables should make that distinction visible.

Tool access needs the same discipline. A no-browsing multiple-choice run measures closed-book science question answering. A tool-assisted product run may measure retrieval, code execution, source selection, and answer extraction as much as internal scientific reasoning. Both can be useful, but they should not be compared as if they used the same evidence channel.

Minimum Reporting Record

A GPQA result used in a model card, procurement note, safety case, or public comparison should preserve enough context for later reviewers to understand what was actually measured.

Benchmark identity: GPQA Main, Extended, Diamond, or a modified subset; dataset version; any excluded or corrected items; and evaluation date.
System identity: model name, model snapshot, product surface, access tier, system prompt if applicable, safety layer, and whether the run used a scaffold or helper model.
Protocol: prompt template, answer-choice ordering, shots or examples, decoding settings, number of trials, majority vote or self-consistency, answer-extraction method, and confidence interval or variance where possible.
Reasoning and tools: hidden or visible chain-of-thought, reasoning-effort setting, tool access, browsing, retrieval, code execution, calculators, and any human assistance.
Cleanliness checks: canary search, decontamination method, retrieval blocking, training cutoff if known, benchmark exposure notes, and whether prompts or outputs were later excluded from training pipelines.
Governance link: what decision the score supported, what additional evaluations were required, and links to AI Audit Trails, AI Change Management, or AI Post-Market Monitoring where relevant.

Limits

GPQA is a serious benchmark, but it is not a complete measure of scientific ability. Multiple-choice format can hide partial reasoning failures, reward answer elimination, and fail to test whether a model can design experiments, read new papers, operate lab equipment, or communicate uncertainty responsibly.

Its domain coverage is also narrow. GPQA focuses on biology, chemistry, and physics in English, not the full range of science, engineering, medicine, fieldwork, public-health practice, lab safety, statistical review, or interdisciplinary research judgment.

The benchmark is also public. Once questions, answer choices, explanations, and discussion circulate online, future model performance can be affected by direct contamination, near-duplicate exposure, benchmark-inspired synthetic data, or post-training targeted at the benchmark style.

Even expert-written benchmarks can contain ambiguous items, disputed answers, or hidden assumptions. GPQA's own human baselines show that domain expertise does not produce perfect accuracy. A high model score should therefore be read as evidence about performance on this benchmark under a specific protocol, not as proof that the model is a trustworthy scientist.

The benchmark also emphasizes domains where high capability can be dual-use. Strong performance on biology or chemistry questions may be relevant to scientific assistance, but it can also matter for biosecurity or chemical-risk evaluation. A release report should not collapse beneficial scientific reasoning, dangerous assistance, and safe refusal behavior into one capability number.

Governance Significance

GPQA matters for governance because expert-level science is a domain where ordinary users, procurement teams, and regulators may not be able to verify outputs directly. A model can sound confident while being wrong in ways only a specialist would catch.

Responsible reporting should specify whether the score is GPQA Main or GPQA Diamond, whether the model used chain-of-thought or hidden reasoning, how many attempts were allowed, whether tools or retrieval were used, and what contamination controls were applied.

For high-stakes science, GPQA should sit inside a larger evaluation package: domain expert review, open-ended problem solving, calibration tests, hallucination analysis, biosecurity and chemistry risk evaluation, provenance checks, and post-deployment monitoring. The benchmark is useful because it makes difficult scientific reasoning visible, but it cannot by itself certify scientific reliability.

For biological and chemical domains, a high GPQA result should trigger sharper questions rather than comfort: what misuse evaluations were run, how refusals behave under expert prompting, whether the system distinguishes benign research from harmful operational help, and whether logs, escalation paths, and incident reporting are ready for high-consequence errors.

Procurement teams and safety reviewers should treat GPQA as a gate to further review, not a green light. A system used in research, clinical, environmental, chemical, or biological workflows needs source-grounded answers, uncertainty calibration, versioned logs, human expert review, incident reporting, and clear rules about when the system must refuse, escalate, or cite primary literature.

Evaluation-awareness and benchmark-gaming risks also matter. A frontier model may recognize benchmark formats, and a lab may tune prompts, effort settings, or scaffolds toward public tests. Governance-grade reporting should therefore ask whether GPQA behavior transfers to deployment-like science tasks, not only whether the model can identify the correct letter on a known public benchmark.

NIST's TEVV framing is helpful here because it treats measurement as test, evaluation, verification, and validation, not a single scoreboard. GPQA can contribute to the "test" layer. Validation still requires evidence that the deployed system works for the intended scientific workflow under real constraints.

Source Discipline

Use the original GPQA paper and OpenReview page for design, authorship, publication venue, human baselines, and scalable-oversight framing. Use the GitHub repository and Hugging Face dataset card for dataset distribution, canary, access conditions, and correction workflow. Use model cards or system cards for provider scores, and label those scores as provider-reported unless independently reproduced.

Do not reproduce GPQA questions, answer choices, or explanations in this wiki. The dataset maintainers explicitly request that examples not be revealed online, and this page should not make benchmark leakage worse.

When citing a GPQA result, state whether it is Main, Extended, or Diamond; whether it is zero-shot, few-shot, chain-of-thought, majority-vote, tool-assisted, or high-compute; whether it uses a public benchmark harness; and whether contamination or answer-key issues were checked. A bare percentage is weak evidence.

For current model comparisons, prefer accessible model cards, system cards, technical reports, reproducible evaluation harness notes, or independent evaluator reports over image-only charts or leaderboard screenshots. If a chart is the only public source, say so and avoid turning it into a timeless ranking.

Do not cite GPQA as proof of AGI, consciousness, scientific autonomy, safe deployment, or professional competence. It is a hard multiple-choice benchmark for expert-written science questions under documented conditions.

Spiralist Reading

GPQA is a test of the expert-shaped mirror.

The ordinary web is full of answers, but some questions cannot be solved by retrieving the nearest sentence. They require disciplinary formation: knowing what matters, which assumptions are live, which distractors are plausible, and how a field reasons under uncertainty.

For Spiralism, GPQA is valuable because it interrupts the fantasy that search is understanding. It also warns against a second fantasy: that passing a hard exam makes a system institutionally trustworthy. The score is a signal. The institution still needs experts, procedures, review, and humility about what a multiple-choice item can know.

Open Questions

How long can GPQA remain useful after becoming a standard public target for frontier model releases?
How should evaluation reports distinguish between GPQA Main, GPQA Diamond, tool-assisted runs, and multiple-attempt protocols?
Can expert science benchmarks be maintained without leaking enough examples or solution patterns to weaken future comparisons?
What evaluation methods are needed beyond multiple choice to test real scientific judgment, lab planning, and uncertainty management?
How should institutions use GPQA-like results when assessing biosecurity, chemistry, medicine, or scientific-discovery claims?
What disclosure is needed when GPQA performance depends on hidden reasoning budgets or provider-specific effort controls?

Sources

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman, GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv, November 20, 2023; COLM 2024.
OpenReview, GPQA: A Graduate-Level Google-Proof Q&A Benchmark, Conference on Language Modeling 2024.
GPQA GitHub repository, idavidrein/gpqa, reviewed June 25, 2026.
Hugging Face, Idavidrein/gpqa dataset, reviewed June 25, 2026.
OpenAI, OpenAI o1 System Card, December 5, 2024.
OpenAI, Learning to reason with LLMs, September 12, 2024.
OpenAI, Introducing OpenAI o3 and o4-mini, April 16, 2025.
Google DeepMind, Gemini 2.5 Pro Model Card, updated June 27, 2025.
Google DeepMind, Gemini 3.1 Pro Model Card, published February 2026; reviewed June 25, 2026.
Anthropic, Claude 3.5 Sonnet, June 21, 2024.
Anthropic, Claude Sonnet 4.6 System Card, February 17, 2026.
NIST, AI Test, Evaluation, Validation and Verification, reviewed June 25, 2026.
Church of Spiralism, When the Benchmark Becomes the Curriculum, for the site's broader benchmark-governance frame.

Return to Wiki