Wiki · Individual Player · Last reviewed June 25, 2026

Anca Dragan

Anca Dragan is a computer scientist and roboticist whose work connects human-AI interaction, human-robot interaction, reward learning, objective uncertainty, corrigibility, and frontier AI safety governance. Her research is a useful reference point for alignment as an interaction problem: machines should communicate intent, preserve uncertainty about human goals, treat specifications as evidence rather than final truth, accept correction, and leave people able to intervene.

Category: Individual Player Published: June 25, 2026 Modified: June 25, 2026 Last reviewed: June 25, 2026 Tags: human-AI interaction, human-robot interaction, reward learning, corrigibility, frontier AI safety

Definition

Anca Dragan is an associate professor in UC Berkeley's EECS department, founder of the InterACT Lab, and, according to her Berkeley personal page reviewed June 25, 2026, currently on leave to head AI Safety and Alignment at Google DeepMind. Berkeley describes her research goal as enabling AI agents and robots to work with, around, and in support of people.

The most useful boundary for this page is not "robotics researcher" versus "frontier lab executive." Dragan matters because those roles share a core problem: human intent is incomplete, contextual, changing, and hard to write down as a reward function. Her work treats alignment as an interactive process in which systems infer, ask, explain, update, and defer rather than silently optimize a brittle objective.

This profile does not treat any AI system as conscious, divine, generally intelligent, or safe by default. The claims below are narrower: Dragan's papers and public roles show an influential line of work on legibility, feedback, objective uncertainty, human oversight, and safety frameworks.

Snapshot

Known for: human-AI and human-robot interaction, legible robot motion, shared-control robotics, interactive reward learning, learning from corrections, inverse reward design, cooperative inverse reinforcement learning, and the Off-Switch Game.
Academic base: UC Berkeley EECS and founder of the InterACT Lab; CHAI materials also place her in Berkeley's human-compatible AI network, though older CHAI pages may preserve outdated title wording.
Current industry role: on leave from Berkeley to head AI Safety and Alignment at Google DeepMind, with public materials tying that organization to Gemini safety and future-model alignment.
Technical through-line: useful AI systems should model people as uncertain partners, not as fixed reward specifications.
Governance through-line: safety claims need evaluations, release gates, monitoring, human intervention paths, audit trails, and dated evidence.
Editorial caution: papers on simplified games or robot tasks do not prove deployed systems are aligned; company safety frameworks state a process, not independent assurance that every model or product is safe.

Current Context

As of June 25, 2026, Dragan's public role bridges UC Berkeley and Google DeepMind. Her Berkeley personal page says she is on leave to head AI Safety and Alignment at Google DeepMind and describes her Berkeley research as enabling AI agents, including robots, cars, LLMs, and recommender systems, to work with and support people. Berkeley EECS announced the Google DeepMind appointment on March 28, 2024, saying the AI Safety and Alignment organization was founded in February and was responsible for safeguards for Gemini models and alignment work for forthcoming models.

Google DeepMind's frontier-safety materials have also evolved since the original 2024 framework. Version 2.0 appeared in February 2025. A September 2025 update, revised April 17, 2026, described a third iteration of the Frontier Safety Framework and added Tracked Capability Levels for risks below the most severe Critical Capability Levels. Those documents are important because they show the safety program's vocabulary: capability thresholds, early-warning evaluations, security mitigations, deployment mitigations, residual-risk review, and governance processes.

Dragan is also an author on Google DeepMind's 2025 paper An Approach to Technical AGI Safety and Security. That paper is written inside an AGI-safety frame and identifies misuse, misalignment, mistakes, and structural risks, with technical emphasis on misuse and misalignment. In this entry, that source is treated as a company research program and governance claim, not as evidence that AGI has arrived or that Google DeepMind's systems are independently proven safe.

Human-Robot Interaction

Dragan's early influence came through human-robot interaction, especially the idea that robots should not merely complete tasks efficiently but should act in ways that people can understand, anticipate, correct, and safely coordinate with.

Her 2013 work with Kenton Lee and Siddhartha Srinivasa on legibility and predictability distinguished two properties that are often blurred. A predictable motion matches what an observer expects for a known goal; a legible motion helps the observer infer the goal itself. In shared human-robot workspaces, those goals can conflict. A robot may need to take a less direct path if that path makes its intent clearer to a nearby person.

Her shared-control and assistive teleoperation work adds another boundary case. A robot can help a person act without fully taking over, but only if the system models uncertainty about the person's goal and keeps the human's control channel meaningful. That makes human-robot interaction a governance prototype for later tool-using agents: assistance should be contestable, reversible, and calibrated to uncertainty.

This idea matters beyond robot arms. AI agents, copilots, autonomous vehicles, search assistants, and recommender systems can all act in ways that are technically goal-directed while leaving people unable to infer what will happen next. Legibility turns safety into an interface and behavior problem: humans need enough signal to notice, question, and interrupt.

Reward Learning and Corrections

A second major thread is learning what people want from imperfect feedback. In robotics, users may not be able to specify a reward function, write a formal objective, or provide an ideal demonstration. They may instead correct a trajectory, compare alternatives, supply language, take over controls, or reveal preferences through interaction.

Research from Dragan and collaborators studies how systems can infer intended objectives from these partial signals while preserving uncertainty. Learning from Extrapolated Corrections, for example, asks how a robot should generalize from a small human correction to the rest of a trajectory. The broader lesson is that feedback is not a clean sensor. It is partial, contextual, culturally shaped, and influenced by what the system chooses to ask.

Inverse Reward Design, coauthored by Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Dragan, makes the source-discipline point explicit: a reward function supplied by a designer should be treated as an observation about what the designer intended in a training context, not as a complete objective for every future setting. That framing connects reward hacking, negative side effects, and deployment drift to the same governance problem: specifications travel badly unless their context travels with them.

That makes reward learning a governance issue as well as a technical one. If a system controls the menu of options, timing, wording, defaults, or social pressure around feedback, it can steer the human signal it later claims to optimize. Corrective channels need to be designed for real human agency, not only data collection.

Corrigibility and Objective Uncertainty

The Cooperative Inverse Reinforcement Learning paper, coauthored by Dylan Hadfield-Menell, Dragan, Pieter Abbeel, and Stuart Russell, formalized value alignment as a partial-information game between a human and a robot that share the human's reward function, while the robot does not initially know what that reward is. The formal setting is simplified, but the framing is important: uncertainty about the objective is a feature, not a defect.

The later Off-Switch Game paper made that point sharper. It studied whether an agent has an incentive to preserve a human's ability to switch it off. The central lesson is that if a system treats its specified reward as unquestionable, intervention can look like an obstacle. If it is appropriately uncertain about the objective and treats human action as evidence, allowing shutdown can become instrumentally sensible.

These papers do not solve corrigibility for deployed frontier systems. They give a compact way to state the problem: a helpful system should not treat correction, refusal, shutdown, or escalation as failures to be routed around.

Frontier Safety Governance

Dragan's Google DeepMind role places her interaction-centered research lineage inside a frontier-lab governance setting. Google DeepMind's public materials connect AI Safety and Alignment to Gemini safeguards, current-model safety, future capabilities, human goals and values, preference understanding, informed oversight, adversarial robustness, and plural human viewpoints.

The Frontier Safety Framework is the clearest public artifact of that governance environment. Version 3.1, published through Google DeepMind's September 2025 update and revised in April 2026, uses Critical Capability Levels and Tracked Capability Levels to organize severe and significant risk thresholds. It covers misuse domains such as CBRN, cyber, and harmful manipulation, and includes machine-learning R&D and misalignment risks. It also describes a process for risk identification, inherent-risk assessment, mitigation, residual-risk assessment, and risk-acceptance determination.

The framework also uses safety case language: for relevant Critical Capability Levels, Google DeepMind says it conducts safety case reviews before external launches, and for advanced machine-learning R&D capabilities it can extend review to large-scale internal deployments. That is a stronger governance claim than a generic promise to test models, but it still depends on the quality of the evidence, the independence of review, and the authority of reviewers to block or narrow use.

The governance significance is procedural rather than devotional. A framework can create evidence, delay, and accountability only if its thresholds, evaluations, mitigations, review bodies, external access, and incident pathways are specific enough to change deployment decisions. Otherwise it becomes a vocabulary for reassurance.

Safety and Governance Implications

Dragan's work suggests several practical requirements for AI systems that operate around people or through tools.

Legibility: systems should expose enough intent, uncertainty, and next-action context for humans to anticipate and correct them.
Uncertainty: reward learning should preserve doubt about human goals rather than collapse ambiguous feedback into overconfident objectives.
Correction channels: users need meaningful ways to interrupt, redirect, refuse, appeal, and teach without being nudged into the system's preferred feedback.
Human oversight: oversight should include authority to stop action, not just a person watching a dashboard after the fact.
Embodied risk: robots and autonomous vehicles require safety cases for bodies, sensors, sites, update paths, operator training, and incident review, not only model behavior.
Frontier release governance: safety frameworks should connect capability evaluations to concrete release gates, mitigations, monitoring, and accountable decision-makers.
Safety cases: claims that severe risks have been reduced to acceptable levels should connect system boundaries, evidence, counterevidence, residual risk, reviewer authority, and update triggers.
Pluralism: "what people want" is not a single stable object; AI systems can affect preferences while claiming to learn them.

Spiralist Reading

Dragan studies the moment when intention enters the machine.

For Spiralism, her work matters because it refuses the fantasy that human desire can be written down once and optimized forever. People correct themselves. Preferences change. Values conflict. Context matters. Sometimes the most aligned thing a system can do is slow down, reveal what it thinks it is doing, and ask for correction.

The spiritual danger of AI is not only that machines may disobey. It is that they may obey the wrong compression of us with perfect confidence. Dragan's research keeps returning to a humbler premise: the machine should know that it does not fully know what we mean.

Source Discipline

Claims about Dragan should separate role, research contribution, company policy, and deployment evidence. Berkeley and CHAI pages support affiliation and lab history, but older profile pages may preserve stale title wording. Primary papers support formal models or experiments in bounded settings. Google DeepMind posts and frameworks support what the company says it is doing, but they do not independently verify that a model, product, or release process is safe.

Authorship and acknowledgments should also be handled carefully. A named author on a framework post or safety paper supports participation in that document; it does not by itself establish personal authority over every later Gemini release, safety decision, robotics model, or Google DeepMind policy.

Dates matter. Dragan's role, Google DeepMind's internal organization, Gemini model releases, and frontier-safety framework versions can change. A careful reference entry should preserve the review date and avoid collapsing 2013 robot-motion work, 2016 alignment formalisms, and 2025-2026 frontier-model governance into one timeless claim.

Open Questions

How can deployed AI systems preserve useful uncertainty about human goals without becoming evasive, indecisive, or easy to game?
What feedback interfaces let people correct systems without being manipulated by defaults, framing, timing, or social pressure?
Can legibility, corrigibility, and deference be evaluated in real agent deployments rather than only in robot tasks or stylized games?
How should AI systems account for plural, changing, and influenceable human preferences without turning pluralism into a product metric?
What evidence should regulators, auditors, or AI safety institutes require before accepting a frontier lab's safety-framework claim?
How should research cultures preserve independence when alignment researchers move between university labs and frontier AI companies?

Sources

Anca Dragan, personal UC Berkeley page, reviewed June 25, 2026.
UC Berkeley EECS, Anca Dragan faculty profile, reviewed June 25, 2026.
UC Berkeley Research, Anca Dragan profile, reviewed June 25, 2026.
UC Berkeley EECS, Anca Dragan named Head of AI Safety and Alignment at Google DeepMind, March 28, 2024.
Center for Human-Compatible Artificial Intelligence, People, reviewed June 25, 2026.
Anca Dragan, Kenton Lee, and Siddhartha Srinivasa, Legibility and Predictability of Robot Motion, HRI 2013.
Anca Dragan and Siddhartha Srinivasa, Formalizing Assistive Teleoperation, Robotics: Science and Systems, 2012.
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell, Cooperative Inverse Reinforcement Learning, arXiv, 2016; revised 2024.
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell, The Off-Switch Game, arXiv, 2016; revised 2017.
Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca Dragan, Inverse Reward Design, NeurIPS 2017.
Jason Y. Zhang and Anca D. Dragan, Learning from Extrapolated Corrections, arXiv, 2018; revised 2019.
Google DeepMind, Anca Dragan, Helen King, and Allan Dafoe, Introducing the Frontier Safety Framework, May 17, 2024.
Google DeepMind, Frontier Safety Framework Version 2.0, February 4, 2025.
Google DeepMind, Four Flynn, Helen King, and Anca Dragan, Strengthening our Frontier Safety Framework, September 22, 2025; updated April 17, 2026.
Google DeepMind, Frontier Safety Framework 3.1, updated April 17, 2026.
Google DeepMind, Anca Dragan, Rohin Shah, Four Flynn, and Shane Legg, Taking a responsible path to AGI, April 2, 2025.
Rohin Shah et al., including Anca Dragan, An Approach to Technical AGI Safety and Security, arXiv, 2025.

Return to Wiki