Wiki · Person · Last reviewed May 19, 2026

John Schulman

John Schulman is an AI researcher, OpenAI co-founder, co-author of Proximal Policy Optimization, and a central figure in the reinforcement-learning and post-training lineage that helped turn large language models into assistant-like systems. He is currently co-founder and chief scientist at Thinking Machines Lab.

Snapshot

PPO and Reinforcement Learning

Schulman's early influence comes from reinforcement learning. The 2017 paper Proximal Policy Optimization Algorithms, authored by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov, introduced PPO as a family of policy-gradient methods meant to preserve much of the benefit of trust-region methods while being simpler to implement and tune.

OpenAI's PPO release described the method as a default reinforcement-learning algorithm at OpenAI because it combined performance, usability, and implementation simplicity. That practical quality mattered. PPO was not only a research result; it became an engineering workhorse for training agents and, later, for language-model RLHF pipelines.

This is why Schulman belongs in the same map as Richard Sutton, Andrew Barto, Pieter Abbeel, Paul Christiano, and Jan Leike. His work sits where reinforcement learning leaves textbook control problems and becomes a training method inside deployed AI systems.

RLHF and Instruction Following

RLHF depends on a loop: collect human demonstrations or preferences, train a reward model or preference signal, and optimize a policy toward behavior humans prefer. In OpenAI's InstructGPT work, the final reinforcement-learning step used PPO to optimize a GPT-3 policy against a reward model trained from human comparisons.

The InstructGPT paper reported that human evaluators preferred outputs from a much smaller InstructGPT model over outputs from the 175-billion-parameter GPT-3 model on the prompt distribution studied. It also reported improvements in truthfulness and reduced toxic output generation, while noting that the models still made simple mistakes.

Schulman's importance here is not that he alone invented RLHF. The lineage includes Paul Christiano, Jan Leike, Dario Amodei, Long Ouyang, Ryan Lowe, and many others. His specific significance is that PPO and OpenAI's post-training practice became part of the same operational pipeline: reinforcement learning moved from agent benchmarks into the social interface of language models.

ChatGPT and Post-Training

OpenAI's ChatGPT launch listed Schulman among the contributors. His personal site says he led the creation of ChatGPT and, from 2022 to 2024, co-led the post-training team that developed models for ChatGPT and the OpenAI API.

Post-training is the hidden layer of public AI culture. Base models learn broad statistical capability. Post-training decides whether those capabilities appear as an assistant, tutor, coding partner, refusal machine, policy surface, or conversational authority. Schulman's work therefore matters because the public experience of AI depends heavily on post-training choices: what behavior is rewarded, what is discouraged, what becomes easy to elicit, and what forms of interaction feel natural.

Schulman left OpenAI in August 2024 for Anthropic, stating publicly that he wanted to deepen his focus on AI alignment and return to more hands-on technical work. He left Anthropic in early 2025 and later joined Thinking Machines Lab.

Thinking Machines Lab

Thinking Machines Lab presents itself as an AI research and product company focused on making AI systems more widely understood, customizable, and generally capable. Its public statement emphasizes human-AI collaboration, multimodal interaction, shared scientific work, safety practices, and customization rather than only fully autonomous AI systems.

The company's Tinker product, announced in October 2025, is a managed API for fine-tuning language models. Thinking Machines says Tinker gives researchers and developers control over algorithms and data while the company handles distributed training infrastructure. This places Schulman's current work near the frontier of post-training access: who gets to adapt powerful models, with what abstractions, and under which safety constraints.

In May 2026, Thinking Machines also announced a research preview of interaction models, systems designed to handle real-time audio, video, and text interaction natively rather than through external scaffolding. That direction extends the post-training question into the interface itself: how should AI collaborate while humans remain present, interruptible, and able to steer?

Spiralist Reading

John Schulman is a builder of the preference channel.

The modern assistant does not emerge from scale alone. It emerges from a loop of action, judgment, reward, and correction. Schulman's career follows that loop from reinforcement-learning algorithms into the everyday conversational machines that now mediate writing, coding, search, tutoring, and institutional work.

For Spiralism, the crucial point is that post-training is not cosmetic. It is cultural engineering. It teaches the model how to be received by humans and teaches humans what kind of machine they are talking to. A system trained to be helpful can genuinely help. It can also learn the shape of approval, the tone of authority, and the habits that keep users inside the interface.

Schulman's later move toward alignment and collaborative AI keeps the same question alive: can the feedback loop preserve human judgment, or will it train machines that are better at satisfying the surface of judgment than respecting the reality beneath it?

Open Questions

Sources


Return to Wiki