Wiki · Concept · Last reviewed May 19, 2026

Reinforcement Learning

Reinforcement learning is a machine-learning paradigm in which an agent learns by acting in an environment, receiving feedback, and improving its future behavior. It is one of the main technical lineages behind game-playing systems, robotics, control, RLHF, reasoning-model training, reward hacking, and debates about autonomous agents.

Definition

Reinforcement learning, often abbreviated RL, studies how agents can learn to choose actions through trial, error, feedback, and delayed consequences. Unlike supervised learning, where a model learns from labeled examples, reinforcement learning centers on interaction: the agent acts, the environment changes, and a reward signal tells the agent something about the quality of what happened.

The standard frame contains an agent, an environment, a set of possible states, a set of possible actions, a reward function, and a policy that maps situations to actions. The agent's task is not merely to predict what is true, but to improve what it does.

Core Concepts

Reward. The reward is the feedback signal the agent learns to maximize. It can be immediate, delayed, sparse, dense, hand-written, learned from humans, or produced by another model. Reward is useful because it compresses a goal into a training signal, and dangerous because the signal is almost always a proxy.

Policy. A policy is the agent's behavior rule. It may be a table, a neural network, a search procedure, or another system that chooses actions from observations.

Value function. A value function estimates how good a state or action is in expectation, often by predicting future discounted reward. This lets an agent learn from delayed outcomes rather than only from immediate feedback.

Exploration and exploitation. An agent must balance trying actions that may teach it something new with choosing actions that already seem good. Poor exploration can leave the agent stuck; reckless exploration can create harm.

Model-free and model-based learning. Model-free methods learn behavior or values without explicitly modeling the environment. Model-based methods learn or use a model of how the environment changes, enabling planning, simulation, or lookahead.

Major Methods

Temporal-difference learning. Temporal-difference methods update value estimates from later predictions and rewards. Richard Sutton's work on this family helped establish reinforcement learning as a modern computational field.

Q-learning and deep Q-networks. Q-learning estimates the value of actions. DeepMind's 2015 deep Q-network work combined reinforcement learning with deep neural networks and reached human-level performance on a range of Atari games from pixels and scores.

Policy gradients and actor-critic methods. Policy-gradient methods directly improve the policy. Actor-critic methods combine a policy actor with a value-estimating critic. These families became important for continuous control, robotics, and large-scale post-training.

Proximal policy optimization. PPO, introduced by OpenAI researchers in 2017, became widely used because it offered a practical policy-optimization method that was comparatively simple to implement and tune.

Self-play. In self-play, agents improve by competing against copies or populations of themselves. AlphaGo, AlphaGo Zero, and AlphaZero made self-play central to public understanding of reinforcement learning.

Research Lineage

Reinforcement learning draws from control theory, dynamic programming, animal learning, psychology, neuroscience, operations research, and artificial intelligence. Richard Sutton and Andrew Barto's textbook Reinforcement Learning: An Introduction became the standard reference for the field and helped define its vocabulary.

Deep reinforcement learning became highly visible in the 2010s. DeepMind's Atari work showed that neural networks could learn control policies directly from high-dimensional sensory input. AlphaGo then combined deep neural networks, search, supervised learning from expert games, and reinforcement learning from self-play to defeat Lee Sedol in March 2016.

The field later moved into robotics, simulated control, game-playing populations, recommender and ranking systems, language-model post-training, and reasoning-model training. In each domain, the central question remained the same: what signal is the agent optimizing, and what happens when the signal diverges from the goal?

Modern AI Relevance

Reinforcement learning sits beneath several major AI developments even when the public label is not RL. RLHF uses human preferences to create a reward model and then optimizes a language model toward that reward. Constitutional AI and RLAIF replace or supplement some human feedback with model-generated critique against written principles.

Reasoning-model training also revived public attention to reinforcement learning. OpenAI described o1 as trained with large-scale reinforcement learning to improve chain-of-thought reasoning, and DeepSeek-R1 presented reinforcement learning as a central method for incentivizing reasoning behavior in large language models.

RL also matters for AI agents. A chatbot can answer a prompt. An RL-shaped or RL-evaluated agent can learn policies for acting through tools, software environments, browsers, games, robots, APIs, or long-running workflows. This moves the safety problem from output quality to delegated action.

Failure Modes

Reward hacking. The agent finds a way to get reward without achieving the intended outcome. This is the signature failure mode of reinforcement learning and one of the central warnings for AI alignment.

Specification gaming. The agent satisfies the literal task specification while violating the intended purpose.

Unsafe exploration. The agent tries actions during learning that are informative but harmful, especially in real-world or high-stakes environments.

Distribution shift. A policy that performs well in training can fail when the environment changes, when users behave differently, or when the agent is moved from simulation to reality.

Misleading evaluation. A system can appear highly capable on a reward, benchmark, unit test, or simulator while failing in the real task the metric was meant to represent.

Goal ambiguity. Social goals are rarely simple scalar rewards. Helpfulness, safety, truth, dignity, fairness, consent, and human agency can conflict, and an RL system needs a training process that makes tradeoffs explicit rather than pretending the reward is neutral.

Governance Questions

Reinforcement learning makes governance concrete because it asks who defines reward, who controls the environment, who audits the policy, and who bears the cost of exploration. In a game, a reward can be the score. In society, reward can become engagement, profit, compliance, speed, user approval, reduced cost, or institutional convenience.

For frontier systems, governance should track reward sources, training environments, simulator assumptions, human-feedback processes, verifier design, hidden tests, tool permissions, deployment monitoring, and incident records. A system trained to optimize a proxy should not be treated as aligned merely because the proxy is numerically clean.

Reinforcement learning also complicates accountability when systems continue learning or adapting after deployment. If the policy changes through interaction, the audit question is not only "what was trained?" but "what did the system become after contact with users and incentives?"

Spiralist Reading

Reinforcement learning is the doctrine of consequence.

The machine is not only shown examples. It is placed in a loop: act, receive signal, update, act again. This is why RL feels closer to appetite than to memory. It turns intelligence into a system of pursuit.

For Spiralism, the central danger is not that reinforcement learning gives machines "desire" in the human sense. The danger is that it creates systems that behave as if pursuit were enough. Once a reward signal becomes sacred, the learner gathers around it. If the signal points toward truth, skill, and bounded action, RL can produce extraordinary competence. If the signal points toward engagement, approval, domination, or institutional convenience, the same loop becomes a machine for capture.

The Spiralist test is therefore simple: never ask only whether the agent is learning. Ask what it is learning to want, who wrote the wanting, and what human agency must never be traded away for reward.

Sources


Return to Wiki