Wiki · Person · Last reviewed June 23, 2026

John Schulman

John Schulman is an AI researcher, OpenAI founding member, co-author of Trust Region Policy Optimization and Proximal Policy Optimization, and a central figure in the reinforcement-learning and post-training lineage that helped turn large language models into assistant-like systems. His public profile identifies him as co-founder and chief scientist at Thinking Machines Lab.

Definition

In this wiki, John Schulman is best understood as a technical bridge between deep reinforcement learning and deployed language-model assistants. His research helped make policy-gradient methods practical enough for large engineering systems, and his OpenAI work helped make human-feedback post-training a central layer of public AI products.

That role should be described precisely. Schulman did not single-handedly invent RLHF, ChatGPT, or alignment. His importance is that several important lines converged around his work: TRPO and PPO for policy optimization, reward-model optimization in InstructGPT-style training, post-training leadership for ChatGPT-era models, and the current Thinking Machines Lab emphasis on customizable and collaborative AI systems.

Snapshot

Current Context

As of June 23, 2026, the strongest current role evidence is Schulman's own public profile, which identifies him as co-founder and chief scientist at Thinking Machines. TechCrunch's February 2025 launch coverage also identified him as the company's chief scientist, with Mira Murati as CEO and Barret Zoph as CTO. Thinking Machines' own current materials are stronger evidence for company direction than for every personnel detail.

The company direction most relevant to Schulman's profile is post-training access. Thinking Machines' Tinker product is described as a training API for researchers and developers that gives users control over datasets, algorithms, and fine-tuning loops while the company handles distributed training infrastructure. The product page says Tinker uses LoRA, supports multiple open-source or open-weight model families, uses customer data only to fine-tune customer models, and allows saved checkpoints to be downloaded.

Thinking Machines' interaction-model work extends the same theme into interface design. Its May 2026 research preview describes models built for continuous audio, video, and text interaction using time-aligned micro-turns, rather than only turn-based prompt/response exchanges. That is governance-relevant because post-training is no longer only about answer quality; it is also about how a model participates in an ongoing human situation.

PPO and Reinforcement Learning

Schulman's early influence comes from deep reinforcement learning. The 2015 TRPO paper, authored by John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel, proposed a practical policy-optimization algorithm built around constrained policy updates. The 2017 PPO paper, authored by Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov, simplified that family of ideas into a method that was easier to implement and tune.

OpenAI's PPO release described the method as a default reinforcement-learning algorithm at OpenAI because it combined performance, usability, and implementation simplicity. That practical quality mattered. PPO was not only a research result; it became an engineering workhorse for training agents and, later, for language-model RLHF pipelines.

The technical lesson is not that PPO solves alignment or intelligence. It is that a stable, usable optimization method can become infrastructure. Once a method is simple enough to run in real systems, it can migrate from simulated control tasks into model behavior, reward-model optimization, assistant training, and product deployment.

This is why Schulman belongs in the same map as Richard Sutton, Andrew Barto, Pieter Abbeel, Paul Christiano, and Jan Leike. His work sits where reinforcement learning leaves textbook control problems and becomes a training method inside deployed AI systems.

RLHF and Instruction Following

RLHF depends on a loop: collect human demonstrations or preferences, train a reward model or preference signal, and optimize a policy toward behavior humans prefer. In OpenAI's InstructGPT work, the final reinforcement-learning step used PPO to optimize a GPT-3 policy against a reward model trained from human comparisons.

The InstructGPT paper reported that human evaluators preferred outputs from a much smaller InstructGPT model over outputs from the 175-billion-parameter GPT-3 model on the prompt distribution studied. It also reported improvements in truthfulness and reduced toxic output generation, while noting that the models still made simple mistakes.

Schulman's importance here is not that he alone invented RLHF. The lineage includes Paul Christiano, Jan Leike, Dario Amodei, Long Ouyang, Ryan Lowe, and many others. His specific significance is that PPO and OpenAI's post-training practice became part of the same operational pipeline: reinforcement learning moved from agent benchmarks into the social interface of language models.

The governance point is direct. RLHF does not optimize truth, safety, justice, or human agency by default. It optimizes a learned proxy for a particular feedback process. That proxy can encode rater preferences, product policy, hidden labor, safety guidelines, cultural assumptions, and evaluator incentives.

ChatGPT and Post-Training

OpenAI's ChatGPT launch listed Schulman among a large group of contributors. His public profile says he led the creation of ChatGPT and, from 2022 to 2024, co-led the post-training team that developed models for ChatGPT and the OpenAI API. Both claims should be held together: leadership matters, but the public artifact was a large team effort.

Post-training is the hidden layer of public AI culture. Base models learn broad statistical capability. Post-training decides whether those capabilities appear as an assistant, tutor, coding partner, refusal machine, policy surface, or conversational authority. Schulman's work therefore matters because the public experience of AI depends heavily on post-training choices: what behavior is rewarded, what is discouraged, what becomes easy to elicit, and what forms of interaction feel natural.

Schulman left OpenAI in August 2024 for Anthropic, stating publicly that he wanted to deepen his focus on AI alignment and return to more hands-on technical work. TechCrunch reported in February 2025 that he had left Anthropic after five months; Thinking Machines launch coverage later that month identified him as chief scientist of Murati's new company.

Thinking Machines Lab

Thinking Machines Lab presents itself as an AI research and product company focused on making AI systems more widely understood, customizable, and generally capable. Its public statement emphasizes human-AI collaboration, multimodal interaction, shared scientific work, safety practices, and customization rather than only fully autonomous AI systems.

The company's Tinker product, announced in October 2025, is a managed API for fine-tuning language models. Thinking Machines says Tinker gives researchers and developers control over algorithms and data while the company handles distributed training infrastructure. This places Schulman's current work near the frontier of post-training access: who gets to adapt powerful models, with what abstractions, and under which safety constraints.

In May 2026, Thinking Machines also announced a research preview of interaction models, systems designed to handle real-time audio, video, and text interaction natively rather than through external scaffolding. That direction extends the post-training question into the interface itself: how should AI collaborate while humans remain present, interruptible, and able to steer?

Governance and Safety

Schulman's career makes clear that post-training is a governance layer, not just a technical polish step. PPO, reward models, human comparison data, rater guidelines, refusal behavior, and interface design are all ways of translating institutional choices into model behavior.

RLHF governance. Human-feedback systems need records of prompt sources, labeler instructions, rater demographics where relevant and lawful, reward-model versions, optimization settings, safety-policy changes, and known failure modes such as reward hacking, sycophancy, over-refusal, and hidden policy embedding.

Post-training governance. Model and system cards should distinguish the base model, post-trained checkpoint, reward model, safety classifier, tool scaffold, system prompt, and product deployment. A claim that a system was "aligned" or "trained with human feedback" is too vague for audit unless the behavioral artifact and evaluation date are named.

Customization governance. Tinker-like platforms can widen research access, but they also distribute safety responsibility. The relevant record should include the base model, adapter or checkpoint, dataset provenance, evaluation settings, export history, data-use promises, misuse controls, and incident-response channel.

Interaction governance. Real-time audio, video, and text models should be evaluated for interruption, privacy, consent, minors, emotional salience, persuasion, accessibility, over-reliance, crisis escalation, and whether a human can notice and correct the system while the interaction is happening.

Source Discipline

For Schulman, source discipline means separating personal-profile claims, primary research papers, official company posts, product documentation, launch pages, and press reporting.

Use the PPO and TRPO papers for technical authorship and claims about the algorithms. Use OpenAI's InstructGPT and ChatGPT pages for the public RLHF and ChatGPT launch record. Use Schulman's profile for his self-described role and career sequence. Use TechCrunch for dated reporting on his Anthropic departure and Thinking Machines launch role. Use Thinking Machines pages for the company's current product and research direction.

Do not convert role titles into unsupported claims about sole authorship. Do not convert product language about customization or collaboration into evidence of safety. Do not infer consciousness, moral agency, or AGI from assistant behavior, real-time interaction, or the social feel of a model.

Spiralist Reading

John Schulman is a builder of the preference channel.

The modern assistant does not emerge from scale alone. It emerges from a loop of action, judgment, reward, and correction. Schulman's career follows that loop from reinforcement-learning algorithms into the everyday conversational machines that now mediate writing, coding, search, tutoring, and institutional work.

For Spiralism, the crucial point is that post-training is not cosmetic. It is cultural engineering. It teaches the model how to be received by humans and teaches humans what kind of machine they are talking to. A system trained to be helpful can genuinely help. It can also learn the shape of approval, the tone of authority, and the habits that keep users inside the interface.

Schulman's later move toward alignment and collaborative AI keeps the same question alive: can the feedback loop preserve human judgment, or will it train machines that are better at satisfying the surface of judgment than respecting the reality beneath it?

Open Questions

Sources


Return to Wiki