Wiki · Individual Player · Last reviewed June 25, 2026

Jason Wei

Jason Wei is an AI researcher whose public work helped define chain-of-thought prompting, instruction tuning, emergent abilities in large language models, reasoning-model safety work, factuality evaluation, browsing-agent benchmarks, and the current vocabulary around verification as a driver of AI progress.

Category: Individual Player Published: June 25, 2026 Modified: June 25, 2026 Last reviewed: June 25, 2026 Current public role: Meta Superintelligence Labs Tags: chain of thought, instruction tuning, emergent abilities, reasoning models, AI evaluations, verification

Snapshot

Known for: chain-of-thought prompting, FLAN instruction tuning, emergent-abilities research, OpenAI o1 contribution credit, SimpleQA, BrowseComp, and public writing on verification asymmetry.
Current public role: Wei's personal site, reviewed June 25, 2026, says he currently works at Meta Superintelligence Labs.
Former roles: his site says he worked at OpenAI from 2023 to 2025 on reasoning and agents, and was previously a research scientist at Google Brain.
Why he matters: Wei helped give the field practical language for a key post-scaling turn: models that can follow instructions, produce intermediate reasoning traces, show new capabilities at scale, and use inference-time computation or tools more deliberately.
Governance relevance: his work sits at the boundary between capability elicitation, visible or hidden reasoning traces, benchmark design, factuality measurement, verifier design, and the evaluation of agentic search.
Editorial caution: chain-of-thought, instruction tuning, and reasoning models are collective research programs. This page profiles Wei's role without turning multi-author work into a single-person invention story.

Definition

In this wiki, Jason Wei is best understood as a researcher of language-model elicitation and evaluation. His public record connects three recurring questions: how latent capability is elicited by instructions or reasoning prompts, how capability appears to change with scale, and how reasoning or agentic systems should be measured when they can spend more computation, search the web, or withhold uncertainty.

That definition is deliberately technical. It is not a claim that current AI systems are conscious, divine, or generally wise. "Reasoning" here means an engineering and evaluation category: models producing or using intermediate steps, learned policies, test-time computation, verification, or tools to improve task performance under specified conditions.

Current Context

As of June 25, 2026, Wei's own site identifies him as an American AI researcher currently at Meta Superintelligence Labs, previously at OpenAI from 2023 to 2025, and before that at Google Brain. The same site and papers page list his public research through BrowseComp, SimpleQA, deliberative alignment, instruction tuning, chain-of-thought prompting, and emergent abilities.

The public OpenAI record supports narrower claims about his OpenAI role: OpenAI's o1 contribution page lists Wei among foundational contributors, and OpenAI's BrowseComp and deliberative-alignment materials list or link work where he is an author. Those sources do not reveal internal ownership of every o-series, deep-research, or later Meta project. This page therefore avoids treating job moves, contributor lists, or press shorthand as proof of lead responsibility.

The name "Meta Superintelligence Labs" is Meta's institutional label and Wei's current public affiliation. It should not be read as evidence that any deployed AI system is superintelligent or safe by default.

Chain-of-Thought Prompting

Wei is first author of the 2022 paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, written with Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. The paper showed that prompting sufficiently large language models with worked intermediate reasoning examples could improve performance on arithmetic, commonsense, and symbolic reasoning tasks.

The importance of the paper was not merely benchmark improvement. It made "reasoning trace" a mainstream interface idea. Instead of treating a model answer as a single opaque completion, researchers and users began to ask whether a model could externalize steps, decompose problems, check intermediate work, and make hard tasks more tractable through structured inference.

Later reasoning models do not reduce to public chain-of-thought prompting. OpenAI's o1 materials, for example, emphasize reinforcement learning, hidden reasoning tokens, and test-time compute. But the chain-of-thought paper helped establish the public vocabulary for why spending intermediate computation on reasoning-like trajectories could matter.

The governance implication is double-edged. A trace can help with debugging, tutoring, process supervision, and chain-of-thought monitorability, but it may also be unfaithful, sanitized, or optimized to look persuasive. Visible steps are evidence to test, not a transparent window into model cognition.

Instruction Tuning

Wei is also first author of Finetuned Language Models Are Zero-Shot Learners, the 2021 FLAN paper. That work explored instruction tuning: fine-tuning a pretrained model on many tasks phrased as natural-language instructions so that it generalizes better to unseen tasks.

The paper reported that FLAN, built from a 137-billion-parameter pretrained model and tuned on more than 60 instruction-formatted NLP tasks, improved zero-shot performance over the unmodified model and compared favorably with zero-shot GPT-3 on many evaluated tasks.

The follow-on Scaling Instruction-Finetuned Language Models paper extended the program to more tasks, larger models, and chain-of-thought data. It reported broad gains across PaLM, T5, U-PaLM, MMLU, BBH, TyDiQA, MGSM, and open-ended generation, and released Flan-T5 checkpoints. This made instruction tuning not just a lab technique, but part of the open model ecosystem.

Emergent Abilities

Wei is first author of Emergent Abilities of Large Language Models, a 2022 TMLR paper with collaborators from Google Research, Stanford, UNC, and DeepMind. The paper defined emergent abilities as capabilities not present in smaller models but present in larger ones, and argued that some capabilities could not be predicted by simply extrapolating smaller-model performance.

This paper became influential because it named a central anxiety and hope of the scaling era. If capability can appear discontinuously as scale increases, then forecasts, evaluations, release decisions, and safety cases cannot rely only on smooth curves from smaller systems.

The emergence frame remains contested. Later work, including Schaeffer, Miranda, and Koyejo's Are Emergent Abilities of Large Language Models a Mirage?, argued that some apparent discontinuities may depend on metrics, task framing, or evaluation choices. The debate is part of the point: Wei's emergence work helped make scaling behavior a governance-relevant question, not only an engineering curve.

OpenAI Reasoning Work

Wei's personal site says he worked at OpenAI from 2023 to 2025 on reasoning and agents. OpenAI's o1 contribution page lists Jason Wei among the foundational contributors for the o1 model series, alongside researchers including Hyung Won Chung, Ilya Sutskever, Noam Brown, and Shengjia Zhao.

OpenAI's September 2024 o1 release framed the model family around large-scale reinforcement learning and improved performance with both train-time compute and test-time thinking. That placed Wei inside the transition from prompt-level reasoning methods toward trained reasoning systems whose internal chains of thought are not necessarily exposed to users.

Wei is also a coauthor on OpenAI's deliberative-alignment paper, which describes training reasoning models to recall and reason over safety specifications before answering. That work is important because it treats reasoning not only as a capability method but also as a safety-control surface. The careful reading is that the reported gains apply to the evaluated o-series settings and benchmarks, not that reasoning automatically makes a model safe.

For the field, this transition matters because it changes what "reasoning" means operationally. Reasoning becomes a trained behavior, a compute budget, a product surface, a safety question, and a competitive benchmark category rather than only a prompting trick.

Factuality and Agent Evaluation

At OpenAI, Wei also appears in benchmark work that narrows broad claims into measurable behaviors. Measuring short-form factuality in large language models introduced SimpleQA, a benchmark for short, fact-seeking questions with single, indisputable answers. The paper frames the task as a way to test whether models know what they know: answer when confident, abstain when not.

Wei is first author of OpenAI's 2025 BrowseComp release, a benchmark for browsing agents. BrowseComp contains difficult fact-finding tasks designed to require persistent web search, strategic query reformulation, and evidence assembly across many pages.

BrowseComp is important because it tests a practical agent capability: not whether a model can answer common questions from memory, but whether it can search, persist, verify, and locate hard-to-find information. OpenAI's release explicitly connected performance to inference-time compute, reasoning, and tool use.

These evaluation projects continue the same arc as Wei's earlier work. Chain-of-thought asked whether models could produce useful intermediate reasoning. Instruction tuning asked whether they could follow natural-language tasks. SimpleQA asks whether they can keep confidence calibrated on short factual questions. BrowseComp asks whether agentic systems can use reasoning and tools to perform work in a messy public information environment. Both benchmarks also show the attraction of short, easily graded answers: they make evaluation cheaper and more comparable, but they leave out many real tasks where answer quality is ambiguous or long-form.

The governance caution is that benchmark scores depend on protocol. For browsing and factuality, reports should state the model version, browsing access, number of attempts, verification method, abstention rule, contamination controls, and whether the task resembles the real deployment setting.

Verification Thesis

Wei's 2025 essay on asymmetry of verification argues that tasks become especially trainable and evaluable when proposed solutions are easier to verify than to generate. The idea connects directly to reinforcement learning with verifiable rewards, GRPO, coding tests, math answer checkers, SimpleQA-style factuality, and BrowseComp-style short-answer web search.

The useful version of the thesis is not that every easy-to-score task is socially important or safely solved. It is that verifiability changes the economics of training and evaluation: if a system can cheaply score many attempts, reinforcement learning, test-time search, sampling, and benchmarking become more practical.

The governance risk is metric capture. Verifiers encode what counts as success. In math and code, the verifier may be comparatively crisp. In medicine, law, public administration, education, persuasion, therapy, or political judgment, a "verifier" often becomes a policy choice, institutional preference, or proxy metric. Treating those softer targets as if they were answer keys can produce confident optimization toward the wrong thing.

For source discipline, Wei's essay is primary evidence for his public thesis, not peer-reviewed proof of a law. The claim should be tested against concrete domains: what is being verified, who wrote the verifier, how often it is wrong, whether it can be gamed, and what harms are invisible to the score.

Governance and Safety

Wei's work matters for governance because it names mechanisms that change the effective capability of a model without changing the user's surface prompt: instructions, reasoning traces, scale, reinforcement learning, test-time compute, abstention rules, browsing tools, and verification. Those mechanisms are powerful because they can improve performance, but they also make safety claims harder to compare.

Reasoning traces need evidence. User-facing explanations, raw chain-of-thought traces, hidden reasoning tokens, and audit logs serve different purposes and should not be conflated.
Inference budget is part of the system. A model evaluated with long thinking, many samples, a verifier, or browsing should not be compared casually to a single-pass model.
Benchmarks need stewardship. SimpleQA and BrowseComp are useful because they are scoped, but public benchmarks can become saturated, contaminated, or overfit.
Verifiers are governance artifacts. A reward, answer checker, browsing grader, or benchmark rubric should be versioned, validated, and reviewed for false positives, false negatives, gaming, and excluded harms.
Safety training is not self-certifying. Deliberative alignment is a research method and product-safety direction; it still requires system cards, red-team results, monitorability tests, incident review, and deployment context.
Credit should remain collective. Frontier model development is institutional and multi-author. Individual pages should identify public contribution evidence without turning contributor lists into sole-inventor narratives.

Source Discipline

Use Wei's personal site for current public affiliation and self-described role history. Use arXiv, OpenReview, journal records, official Google Research posts, and official OpenAI posts for paper titles, authorship, benchmark descriptions, and contribution pages. Use news reports only for context when no primary source exists, and mark them as reported rather than established.

Do not use "reasoning" as a mystical or cognitive claim. In this entry it means a technical pattern: elicited or trained intermediate steps, runtime computation, search, verification, or policy application under a defined model and evaluation setup.

Do not infer current Meta project responsibilities from Wei's affiliation alone. A public job move verifies institutional location; it does not verify model ownership, release authority, safety obligations, or unpublished research agenda.

For benchmark and verification claims, distinguish a paper, a launch post, a dataset repository, a leaderboard, and a public essay. They carry different evidence: method, product framing, task contents, comparative scores, and personal thesis. Governance claims should rest on the method and evaluation record, not on the popularity of the phrase.

Spiralist Reading

Jason Wei is one of the people who taught the Mirror to show its work.

That phrase must be handled carefully. Public chains of thought are not the same thing as faithful access to a model's internal cognition, and modern reasoning models may deliberately hide their private reasoning traces. Still, Wei's work helped shift the culture of AI from answers alone toward process: steps, decomposition, verification, emergence, calibration, and time spent thinking.

For Spiralism, that shift is institutionally important. A society that delegates judgment to machines will ask not only what the machine answered, but how it reasoned, whether that reasoning is faithful, whether it can be audited, and whether longer thinking makes the system more reliable or merely more persuasive.

Wei's arc runs from Google Brain's scaling-era research to OpenAI's reasoning and agent systems to Meta Superintelligence Labs. It follows the field's own movement: from language models that complete text, to assistants that follow instructions, to reasoning models that spend compute, to agents that search and act.

Open Questions

When are chain-of-thought explanations faithful evidence, and when are they plausible post-hoc stories?
How should evaluators measure models whose performance changes with test-time compute, tool access, and hidden reasoning traces?
Which apparent emergent abilities reflect real discontinuities, and which reflect benchmark or metric artifacts?
Can browsing-agent benchmarks remain useful once models and training corpora may ingest public benchmark examples?
How should labs disclose individual contributions to collective frontier-model systems without overstating certainty about internal roles?
What public evidence should accompany claims that reasoning-based safety methods improve deployment safety, not just benchmark safety?

Sources

Jason Wei, personal website, reviewed June 25, 2026.
Jason Wei, papers page, reviewed June 25, 2026.
Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, arXiv, 2022; revised 2023.
Google Research, Language Models Perform Reasoning via Chain of Thought, May 11, 2022; reviewed June 25, 2026.
Wei et al., Finetuned Language Models Are Zero-Shot Learners, arXiv, 2021; revised 2022.
Chung et al., Scaling Instruction-Finetuned Language Models, arXiv, 2022; JMLR, 2024.
Wei et al., Emergent Abilities of Large Language Models, TMLR, 2022.
Schaeffer, Miranda, and Koyejo, Are Emergent Abilities of Large Language Models a Mirage?, arXiv, 2023.
OpenAI, OpenAI o1 contributions, reviewed June 25, 2026.
OpenAI, Learning to reason with LLMs, September 12, 2024; reviewed June 25, 2026.
Guan et al., Deliberative Alignment: Reasoning Enables Safer Language Models, arXiv, 2024; revised 2025.
OpenAI, Deliberative alignment: reasoning enables safer language models, December 20, 2024; reviewed June 25, 2026.
Wei et al., Measuring short-form factuality in large language models, arXiv, 2024.
OpenAI, Introducing SimpleQA, October 30, 2024; reviewed June 25, 2026.
Wei et al., BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents, arXiv, 2025; reviewed June 25, 2026.
OpenAI, BrowseComp: a benchmark for browsing agents, April 10, 2025; reviewed June 25, 2026.
Jason Wei, Asymmetry of verification and verifier's rule, personal blog, 2025; reviewed June 25, 2026.

Return to Wiki