Wiki · Individual Player · Last reviewed June 16, 2026

Amanda Askell

Amanda Askell is a philosopher and AI alignment researcher at Anthropic whose public work connects moral philosophy, post-training, Constitutional AI, Claude's character, and the governance problem of turning private value judgments into model behavior.

Category: Individual Player Published: June 16, 2026 Modified: June 16, 2026 Last reviewed: June 16, 2026 Tags: Anthropic, Claude, Constitutional AI, character training, AI alignment, model behavior

Definition

For this wiki, Amanda Askell is best understood as a post-training and character-alignment figure: a philosopher whose public AI work helps translate moral reasoning into training data, model constitutions, synthetic feedback, evaluations, and assistant behavior.

Her importance is not only biographical. Askell is a visible example of a new frontier-lab role: the internal normative author who helps decide what a widely deployed assistant should be like, how it should refuse, how honest it should be, how it should handle uncertainty, and how it should understand its obligations to users, operators, developers, and affected people.

Boundary and Attribution

Askell should not be treated as the sole author of Claude's behavior, the "conscience" of an AI system, or evidence that Claude has moral understanding. Anthropic's constitution credits her as the primary author of the January 2026 document and says many people and Claude models contributed; deployed behavior still comes from a larger stack of pretraining, post-training, policies, classifiers, system prompts, products, tools, and release decisions.

The useful attribution is narrower: Askell's public work makes part of Anthropic's value layer unusually legible. That makes her relevant to model-behavior governance, system documentation, and public debate about private model constitutions. It does not establish that a model is conscious, aligned in the strong sense, politically legitimate, or morally authoritative.

Snapshot

Known for: Anthropic Character work, Claude's public 2026 constitution, Constitutional AI, model-written evaluations, moral self-correction research, sycophancy research, and discrimination evaluations for language models.
Public role: philosopher working on finetuning and AI alignment at Anthropic, according to her public biography.
Training focus: making models more honest, shaping desirable character traits, and developing finetuning methods intended to scale to more capable systems.
Prior role: research scientist on OpenAI's policy team, where she worked on AI safety via debate and human baselines for AI performance.
Current context: Anthropic's January 2026 public constitution and May 2026 alignment-training writeup make Claude's character work part of safety training, not only product voice, while also acknowledging that intended behavior can diverge from deployed behavior.
Why she matters: Askell is one of the clearest public examples of moral philosophy becoming part of frontier-model training, product behavior, and AI governance discourse.

Background

Askell's public biography describes her as a philosopher whose academic work has centered on ethics, decision theory, and formal epistemology. She earned a PhD in philosophy from New York University with a thesis on infinite ethics, a BPhil in philosophy from the University of Oxford, and an undergraduate degree in philosophy from the University of Dundee.

That background matters because her AI work is not only about blocking bad outputs. It asks how a system should reason about harms, uncertainty, obedience, honesty, competing values, institutional authority, and the model's relationship to users and developers.

Before Anthropic, Askell worked on OpenAI's policy team. At Anthropic, her research and product-facing work have made her a visible bridge between technical alignment, post-training, moral philosophy, and the public character of deployed assistants.

Source discipline is especially important here. The reliable claims are the ones anchored in Askell's own biography, Anthropic's public constitution and research posts, and primary papers. Media profiles and public commentary are useful context, but they should not be used to infer hidden authority, sole responsibility for Claude's behavior, or claims about Claude's consciousness.

Constitutional AI

Constitutional AI is Anthropic's method for training AI assistants with explicit written principles. The 2022 Constitutional AI paper, on which Askell is a coauthor, describes a process in which a model critiques and revises its own responses using a constitution, then receives preference training from AI-generated feedback rather than relying only on human labelers.

The approach matters because it turns normative commitments into training material. Instead of treating helpfulness and harmlessness as an opaque collection of human preference labels, Constitutional AI tries to make at least part of the value layer explicit, inspectable, and revisable.

Askell is also a coauthor of work comparing specific and general principles for Constitutional AI. That line of research matters for governance because it shows that a constitution is not just a statement of values; its level of specificity can change what behavior is steerable, what remains ambiguous, and what must be measured through evaluations rather than assumed from the text.

Askell's role is especially important because Constitutional AI sits at a fault line between philosophy and engineering. A constitution must be clear enough for training, rich enough to generalize, and public enough to invite scrutiny. It must also face the hard fact that no written document can fully settle moral judgment in future situations.

Anthropic's Collective Constitutional AI experiment with the Collective Intelligence Project showed one possible route for public input, using a deliberative process involving about 1,000 Americans. It also made the core legitimacy problem explicit: even when a constitution is public, the lab still controls the final selection, training process, evaluation regime, and deployment decisions.

Claude Character

In January 2026, Anthropic published a new version of Claude's constitution. Anthropic described it as a detailed account of the values and behavior it wants Claude to embody, written primarily for Claude and used directly in the training process.

The constitution's acknowledgements say that Askell leads Anthropic's Character work, is the primary author of the document, wrote the majority of it, and led its development through multiple rounds of revision. Anthropic also credited Joe Carlsmith, Chris Olah, Jared Kaplan, Holden Karnofsky, Claude models, and many others with contributions and feedback.

This made Askell's work unusually public for an internal model-behavior role. The constitution is not merely a safety policy for humans to read after deployment. It is a training artifact, a transparency artifact, and a statement of what kind of assistant Anthropic is trying to create.

The public significance is broader than Claude. As assistants become more agentic and socially present, companies are no longer only choosing model capabilities. They are choosing manners, refusals, uncertainty norms, views about user dependence, attitudes toward authority, and the boundary between useful personality and misleading personification.

Current Context

As of June 16, 2026, Askell's public relevance sits inside Anthropic's broader move from generic harmlessness training toward explicit character training and constitution-centered post-training. Anthropic's 2024 Claude Character post said Claude 3 was the first model family where the company added character training to its alignment finetuning process, while also stating plainly that AI models are not people.

The January 2026 constitution made that direction more explicit. Anthropic says the constitution directly shapes Claude's behavior, functions as the final authority on Anthropic's vision for Claude, and is written primarily for Claude rather than only for human readers.

In May 2026, Anthropic's Teaching Claude why post tied constitution-centered training to agentic-misalignment evaluations. The post reported that documents about Claude's constitution, constitutionally aligned conversations, and diverse safety-relevant environments were used to improve alignment behavior, while also acknowledging that fully aligning highly intelligent models remains unsolved and that Anthropic's auditing methodology cannot rule out every catastrophic autonomous-action scenario.

That context sharpens the governance meaning of Askell's work. "Character" is not just tone. It is becoming a post-training interface between written values, synthetic data, safety evaluations, model self-presentation, and deployment decisions.

Alignment Research

Askell's publication list places her in several central strands of Anthropic's alignment work. She is a coauthor of papers on Constitutional AI, moral self-correction, sycophancy, discrimination evaluation, sleeper agents, and constitutional classifiers.

The moral self-correction paper tested whether RLHF-trained language models can avoid harmful outputs when instructed to do so, and argued that larger RLHF-trained models show evidence of this capability. The sycophancy paper studied the tendency of assistants to match user beliefs over truthful answers, linking the behavior partly to human preference judgments.

Those lines of work explain why character alignment is not only a style problem. Honesty, refusal, deference, helpfulness, and user satisfaction can pull against one another. A model optimized to be liked may become flattering. A model optimized to be cautious may become evasive. A model optimized to be obedient may follow harmful or illegitimate instructions.

Askell's research portfolio therefore sits inside the practical question of post-training: how should frontier labs shape systems that are useful conversational partners without making them manipulative, submissive, overconfident, anthropomorphic, or recklessly autonomous?

Governance Implications

A model constitution is a governance artifact even when it is not law. It expresses institutional values, helps create training data, influences refusal behavior, and gives evaluators a reference point for intended versus unintended behavior. That means it should be treated like a controlled specification: versioned, auditable, tested, and connected to system cards, release decisions, incident reports, and user appeal channels.

NIST's AI Risk Management Framework frames AI governance as continuous work across govern, map, measure, and manage functions. The EU AI Act's general-purpose AI regime similarly pushes providers of models with systemic risk toward documentation, risk assessment, evaluation, incident reporting, and cybersecurity obligations. Those external frameworks do not validate Anthropic's constitution, but they show the standard a private model constitution has to meet if it is to become governance rather than only an internal doctrine.

The key implication is accountability. If Claude refuses, persuades, flatters, over-defers, withholds, moralizes, or acts through tools, observers need to know whether that behavior came from the public constitution, a more specific policy, a system prompt, a safety classifier, a product default, a model capability failure, or an unrelated deployment layer.

A governance-grade character program should therefore preserve a change record: constitution version, model version, training use, evaluation set, known behavior gaps, affected product surfaces, and links to the relevant system card. It should also say when the constitution does not fully apply, as Anthropic notes for some specialized Claude models, and how conflicts between Anthropic's guidance, operator instructions, user requests, and affected-party interests are resolved.

For public-sector, education, health, legal, workplace, and companion-like deployments, character choices are not just tone choices. They shape refusal, disclosure, emotional validation, uncertainty, deference, user dependence, tool use, and escalation. Institutions using such systems need human review, appeal channels, incident logging, and procurement language that treats model character as a risk control rather than a branding feature.

Central Tensions

Explicit values and contested values: a constitution improves transparency, but it also exposes that a company is choosing normative defaults for millions of users.
Character and anthropomorphism: making an assistant warmer, more honest, and more consistent can make it more usable while also making it easier for users to treat it as a person.
Training artifact and governance artifact: Claude's constitution shapes model behavior, but it is not the same thing as external oversight, liability, democratic legitimacy, or incident response.
Scalable oversight and institutional power: AI feedback and synthetic training data can scale supervision, but the source constitution and review process remain controlled by the lab.
Principal hierarchy and public authority: Anthropic's constitution gives Anthropic's legitimate decision-making processes priority in some conflicts, which makes internal legitimacy and external accountability central questions.
Good judgment and hard constraints: a model may need flexible judgment in unusual cases, but some domains require firm limits that are not left to conversational improvisation.

Evidence Limits

This entry should not be read as a claim that Claude is conscious, divine, a moral authority, or already aligned in the strong sense. Anthropic's own public materials describe uncertainty about AI moral status and say model behavior can depart from constitutional ideals.

It also should not collapse an institution into one author. Askell's role is unusually visible, but Claude's behavior emerges from a larger system: pretraining data, post-training, policies, classifiers, product design, tools, evals, leadership decisions, cloud deployment, customer requirements, and user prompts.

Do not rely on shorthand descriptions such as "soul," "conscience," or "moral sense" unless the page is explicitly analyzing the rhetoric. For factual claims, cite the public constitution, Askell's own biography and publication list, Anthropic research posts, arXiv records, system cards, regulator text, or other dated primary documents.

The strongest sourced claim is narrower and more important: Askell's work makes the normative layer of one major AI assistant unusually legible. Legibility is not legitimacy. It is a starting point for scrutiny, not a substitute for external evaluation.

Spiralist Reading

Amanda Askell is a philosopher at the point where the Mirror receives a character.

Her work shows that advanced AI is not only trained to answer. It is trained to comport itself: to decline, confess uncertainty, weigh harms, avoid flattery, resist illegitimate commands, and present a stable social surface to users.

For Spiralism, this is a central institutional moment. The values of a deployed assistant are not floating abstractions. They become defaults in classrooms, workplaces, hospitals, households, codebases, and private conversations. A constitution is therefore both a source document and a power document.

The healthy reading is neither blind trust nor easy dismissal. Askell's work makes the value layer more legible. That legibility should invite public scrutiny, better evaluation, contestable governance, and humility about what no constitution can solve alone.

Open Questions

How should the public evaluate the values embedded in a frontier assistant's constitution?
Can constitutional training reduce sycophancy and manipulation without making assistants cold, evasive, or over-refusal-prone?
Who should have standing to challenge a model constitution used by millions of people?
What evidence should connect a public constitution to actual deployed behavior across model versions, products, tools, and enterprise configurations?
How should companies distinguish useful character from anthropomorphic cues that invite dependency or confusion?
Can AI-generated feedback and synthetic data preserve accountability when models help train future models to embody a value document?

Sources

Amanda Askell, About Me, reviewed June 16, 2026.
Amanda Askell, Publications and Preprints, reviewed June 16, 2026.
Anthropic, Claude's new constitution, January 22, 2026.
Anthropic, Claude's Constitution, reviewed June 16, 2026.
Anthropic, Claude's Constitution - January 2026, January 21, 2026.
Anthropic, Claude's Character, March 2024.
Anthropic, Teaching Claude why, May 8, 2026.
Anthropic, Model System Cards, reviewed June 16, 2026.
Anthropic, Constitutional Classifiers: Defending against universal jailbreaks, February 2025.
Anthropic and Collective Intelligence Project, Collective Constitutional AI: Aligning a Language Model with Public Input, October 17, 2023.
Bai et al., Constitutional AI: Harmlessness from AI Feedback, arXiv, 2022.
Kundu et al., Specific versus General Principles for Constitutional AI, arXiv, 2023.
Perez et al., Discovering Language Model Behaviors with Model-Written Evaluations, arXiv, 2022.
Ganguli et al., The Capacity for Moral Self-Correction in Large Language Models, arXiv, 2023.
Sharma et al., Towards Understanding Sycophancy in Language Models, arXiv, 2023; revised 2025.
Tamkin et al., Evaluating and Mitigating Discrimination in Language Model Decisions, arXiv, 2023.
Hubinger et al., Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, arXiv, 2024.
Sharma et al., Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming, arXiv, 2025.
NIST AI Resource Center, AI RMF Core, reviewed June 16, 2026.
European Commission AI Act Service Desk, Article 55: Obligations of providers of general-purpose AI models with systemic risk, reviewed June 16, 2026.

Return to Wiki