Wiki · Concept · Last reviewed June 23, 2026

Constitutional AI

Constitutional AI is an alignment technique that trains AI systems against explicit written principles, using model-generated critique, revision, and AI feedback to shape behavior. It makes some value choices more visible, but it does not make those choices automatically legitimate, complete, or reliably implemented.

Snapshot

Definition

Constitutional AI is a method for training AI assistants to follow a written set of principles. Anthropic introduced the method in the 2022 paper Constitutional AI: Harmlessness from AI Feedback, where a model is trained to critique and revise its own answers and then to learn preferences generated by another model applying the constitution.

The term "constitution" does not mean a public legal constitution. It means an explicit normative artifact used to judge model behavior. In the original paper, the only human oversight for harmlessness was the list of rules or principles; the model-generated feedback did much of the supervision. The core promise is transparency and scalability: instead of hiding values entirely inside many individual human ratings, the training process can point to a written rule set that can be inspected, debated, revised, and tested.

In current use, Constitutional AI is also a family of related claims. It can refer to the original critique-and-revision pipeline, reinforcement learning from AI feedback, a public constitution for Claude, synthetic training data about constitutional principles, classifier constitutions for safeguards, or product-policy reasoning. These are connected, but they are not interchangeable.

The core limit follows from the same fact. A written constitution is a training target and governance artifact, not a guarantee of deployed behavior. The model can misapply principles, safety classifiers can fail, product policies can override or supplement the constitution, and a private constitution can encode disputed values while appearing neutral.

What It Is Not

Constitutional AI is not a legal constitution, democratic mandate, system prompt, moderation policy, or complete safety architecture. It is a way to make some behavioral targets more explicit and usable in training or safeguards.

It is not proof that a deployed system follows those principles. That requires behavioral evaluations, red-team results, system cards, classifier tests, incident records, and evidence from the exact product surface where people use the system.

It is also not evidence that an AI system is conscious, divine, a person, or morally authoritative. Anthropic's constitution discusses Claude using some human-like terms because the document is written as a training artifact for Claude; those passages should be treated as design choices unless independent evidence supports stronger claims.

Current Context

As of June 23, 2026, Constitutional AI should be understood as both a training method and a governance claim. Anthropic continues to describe Claude as trained with Constitutional AI, but the public artifact has changed: on January 22, 2026, Anthropic published a new public constitution for its mainline, general-access Claude models. Anthropic says that constitution is a detailed description of its intentions for Claude's values and behavior, the final authority on its vision for Claude, and not a promise that Claude's behavior will always reflect those ideals.

The 2026 constitution is broader than the shorter 2023 principle list. It covers helpfulness, Anthropic guidelines, ethics, broad safety, and "Claude's nature." The last category should be read carefully: Anthropic states uncertainty about possible AI moral status and discusses model welfare, but that is not evidence that Claude or any other AI system is conscious. For this article, the relevant governance fact is that the constitution now explicitly shapes how Anthropic asks Claude to reason about identity, oversight, user welfare, institutional authority, and its own role.

Constitutional methods have also expanded beyond training the assistant itself. Anthropic's 2025 Constitutional Classifiers work used a constitution to generate synthetic allowed and disallowed examples, then trained safeguards against jailbreaks. Anthropic's January 2026 Constitutional Classifiers++ work reported a two-stage classifier cascade, lower compute overhead, lower refusal rates, and no universal jailbreak in the red-team setting described. A March 2026 arXiv paper on adversarial fine-tuning then reported a bypass route against LLM-based constitutional classifiers, and a 2026 Anthropic Alignment Science post found that small numbers of poisoned fine-tuning examples could install backdoors in constitutional classifiers. The lesson is bounded: classifier systems can improve robustness, but they do not create a permanent safety boundary or remove the need for training-data controls.

Anthropic's May 2026 Teaching Claude Why work makes the current direction clearer. The company described training on constitution-related documents, fictional stories, and difficult ethical-advice examples so models learn why a behavior is intended, not only what response is preferred. That is a broader constitutional-training strategy than the original 2022 RLAIF pipeline, and it makes documentation of datasets, evaluations, and failure modes more important.

Governance expectations have moved as well. Anthropic's Responsible Scaling Policy page lists version 3.3 as effective May 26, 2026, and its Frontier Safety Roadmap includes an October 1, 2026 alignment target for upholding Claude's constitution through training-data oversight and alignment assessments. The EU AI Act's Article 55 requires providers of general-purpose AI models with systemic risk to conduct and document model evaluations, including adversarial testing, assess and mitigate systemic risks, report serious incidents, and ensure cybersecurity. A model constitution is therefore only one piece of a larger evidence stack: evaluations, red teaming, incident reporting, cybersecurity, safety cases, external review, and post-deployment monitoring still matter.

Method

The original Constitutional AI pipeline has two major stages.

Supervised critique and revision. A model generates an answer, critiques that answer according to a constitutional principle, revises the answer, and is then fine-tuned on the revised response.

Reinforcement learning from AI feedback. The system samples pairs of answers, uses a model to judge which answer better follows the constitution, trains a preference model from those AI-generated judgments, and then optimizes the assistant against that preference signal.

This second stage is often called RLAIF: reinforcement learning from AI feedback. It is related to RLHF, but it substitutes model-generated preference labels for at least some human preference labels. That can reduce direct human exposure to disturbing content and scale supervision, but it also makes the evaluator model, prompt, data-generation process, and chosen principles part of the safety-critical system.

Later constitutional training can add other stages, such as synthetic document fine-tuning, training examples that explain the reasons behind desired behavior, and targeted training around agentic or tool-use dilemmas. Those are not the same as the original RLAIF recipe. A source-disciplined claim should name which method was used and what behavior it was tested against.

In practice, a constitution can be used at multiple points: to generate critiques, rank responses, synthesize training examples, train classifiers, guide safety policy reasoning, or document the intended behavior of a deployed assistant. These uses should not be collapsed. A principle used to synthesize training data is not the same as an enforceable product rule, and neither is the same as observed behavior in a live system.

Implementation Surfaces

A constitutional claim can refer to several different things. Training constitution means principles used during critique, revision, preference modeling, or reinforcement learning. Public constitution means the document a company publishes to explain intended behavior. Classifier constitution means natural-language rules used to synthesize allowed and disallowed examples for safeguards. Product policy means rules enforced through system prompts, usage policies, refusal layers, monitors, or human review after the model is deployed.

These surfaces can overlap, but they are not identical. A model can be trained with one set of principles, served under a later product policy, guarded by separate classifiers, and evaluated under a different red-team protocol. A governance-grade claim should say which surface is being discussed and which model version, tool access, language, region, and deployment context were tested.

This distinction matters for accountability. If a harmful output or refusal occurs, the relevant question is not simply "what does the constitution say?" It is whether the failure came from the base model, training data, reward model, classifier, system prompt, retrieval context, tool permission, product policy, or human escalation process.

Claude and Public Use

Anthropic publicly describes Claude as trained with Constitutional AI. In May 2023, Anthropic published a post explaining Claude's constitution and the motivation for replacing some implicit human-rating values with explicit principles. Anthropic later published the new public constitution in January 2026 and states that it is the final authority on its vision for Claude.

The public constitution is important because it makes part of the behavioral target visible. Users can see that the model is not merely "neutral" or "helpful" in the abstract; it is being shaped toward a specific theory of helpfulness, honesty, harm avoidance, user welfare, oversight, and social responsibility.

It is also politically important because Anthropic's constitution names a principal hierarchy: Anthropic, operators, and end users do not always have the same interests. A useful public reading of Claude's constitution asks what happens when those interests conflict, who gets final authority, and whether affected people have any appeal outside Anthropic's internal processes.

Constitutional Classifiers

Anthropic later extended the constitutional idea into classifier safeguards. Constitutional Classifiers use a written constitution to generate synthetic examples of allowed and disallowed content, then train input and output classifiers to block jailbreaks or harmful requests.

In February 2025, Anthropic reported that its updated Constitutional Classifiers substantially reduced success rates for synthetic jailbreak prompts against a guarded Claude 3.5 Sonnet system, with a 0.38 percentage-point measured increase in refusals on a sampled production-traffic comparison and 23.7 percent higher compute cost. The associated paper reported no universal jailbreak after more than 3,000 estimated red-team hours in one early setting. Anthropic later said a bug-bounty setting did find one universal jailbreak against the first-generation system. Both facts matter: the method improved robustness, but it did not make the boundary permanent.

In January 2026, Anthropic described Constitutional Classifiers++ as a more efficient version using a two-stage architecture: lightweight probes screen traffic and escalate suspicious exchanges to stronger classifiers. Anthropic reported roughly 1 percent additional compute cost, lower refusal rates, and no universal jailbreak in the red-team setting described. A March 2026 arXiv paper on "Trojan-Speak" then reported a bypass route using adversarial fine-tuning, arguing that LLM-based content classifiers alone are insufficient when adversaries have fine-tuning access. Together, these sources point to the real governance standard: constitutional classifiers are safeguards to test and maintain, not proof that prohibited knowledge or behavior is impossible to elicit.

A later Anthropic Alignment Science post on poisoning fine-tuning datasets made the supply-chain risk explicit. The authors studied whether a backdoor could be installed in a constitutional classifier through poisoned training examples and found that a small, roughly constant number of poisoned examples could be sufficient in their setting. This moves classifier governance beyond red teaming alone: teams also need dataset provenance, version control, label review, automated data audits, insider-risk controls, and integrity checks for wrapper code and deployment configuration.

The classifier work shows a second governance use of constitutions: they can create datasets and filters as well as train model behavior. That makes source discipline more important. A public claim that a system is "constitutional" should say whether the constitution shaped the base assistant, a reward model, a classifier, a policy layer, a system prompt, a training dataset, or a post-deployment monitor.

Why It Matters

Constitutional AI changes the politics of alignment. It moves part of the question from "What did raters prefer?" to "What written principles are being used to shape the model?"

That is an improvement when the constitution is public, coherent, and accountable. It can reduce exposure of human workers to disturbing content, scale feedback beyond direct human labeling, and make model behavior easier to audit at the level of stated principles.

It also exposes a harder problem: someone still writes the constitution. That makes Constitutional AI a governance mechanism, not a magic escape from politics. The constitution can encode corporate policy, cultural assumptions, legal caution, safety priorities, market incentives, or disputed moral claims.

Collective Constitutional AI is one attempted answer to that authorship problem. Anthropic and the Collective Intelligence Project used a public input process with about 1,000 U.S. participants to draft principles and train a model against them. The experiment is useful because it demonstrates a pathway for public participation; it is also limited because one deliberative process, one population, and one company's implementation do not settle global legitimacy.

Risk Pattern

Constitution laundering. A model can be presented as principled while the principles themselves remain narrow, self-serving, or selectively enforced.

Value centralization. If a small number of labs write the constitutions for widely used assistants, private organizations become de facto authors of public conversational norms.

Over-refusal. A constitution can make a model safer but also more evasive, paternalistic, or unwilling to help with legitimate edge cases.

Spec ambiguity. Broad principles such as helpfulness, harmlessness, respect, autonomy, and honesty can conflict. The real policy is often revealed in the model's behavior, not only in the written document.

Synthetic feedback drift. RLAIF depends on model-generated judgments. If the evaluator model misunderstands the constitution, inherits bias, or rewards superficial compliance, errors can be amplified through training.

Public trust theater. Publishing principles can create a sense of accountability without giving users power to contest, inspect, or change how those principles are implemented.

Classifier overconfidence. Constitutional classifiers can reduce some jailbreak success rates while still failing under public red teaming, transfer attacks, new domains, or changed model behavior.

Classifier supply-chain risk. A classifier trained from synthetic constitutional data can inherit poisoned examples, mislabeled boundary cases, hidden backdoor triggers, wrapper-code exceptions, or deployment thresholds that do not match the paper claim.

Confused personification. A constitution may use human-like concepts such as virtue, welfare, identity, or judgment to shape model behavior. Readers should distinguish those training concepts from factual claims about consciousness, personhood, or moral authority.

Evidence Standard

A governance-grade Constitutional AI claim should be treated as a layered evidence package, not as a label. The minimal package names the constitution or principle set, its version date, the model family, the training stage where it was used, the evaluator model or reward process, the product policy layer, and the deployment surface where behavior was measured.

Good evidence separates training evidence from behavior evidence. Training evidence can show that a constitution shaped critique, revision, preference labels, synthetic documents, or classifier data. Behavior evidence must still show what the model or product did in evaluations, red-team trials, ordinary usage, high-stakes cases, non-English contexts, agentic workflows, and known jailbreak families.

For classifiers and safeguards, the evidence standard should include the constitution used to generate data, data provenance, label-review method, benchmark and red-team protocol, false-positive and false-negative rates, compute cost, refusal impact, known bypasses, model versions, deployment thresholds, and update triggers. It should also name what is not covered: fine-tuned adversaries, poisoned classifier data, compromised wrapper code, new tool surfaces, or domains outside the evaluation set.

For public accountability, the claim should connect to AI Audit Trails, Model Cards and System Cards, AI Safety Cases, and AI Incident Reporting. A constitution is useful only if failures against it can be detected, preserved, explained, and used to change deployment decisions.

Governance Requirements

Constitutional AI should be treated as a claim that needs evidence. A serious deployment should identify which constitution was used, when it was updated, which model families it applies to, what evaluations test it, and how conflicts between principles are resolved.

Public systems should also provide appeal and correction paths. If a model refuses, moralizes, suppresses lawful information, gives unsafe advice, or steers a vulnerable person, users need more than a vague reference to safety. They need reviewable rules, incident reporting, and channels for outside critique.

For high-impact deployments, a constitution should connect to model cards, system cards, red-team records, safety cases, audit logs, human oversight, data-governance rules, privacy limits, and incident review. The key question is not only whether the constitution is admirable, but whether failures against it are detected, reported, remediated, and allowed to affect release decisions.

Evaluation should include conflict cases. A constitution is easiest to follow when one principle clearly applies. The hard cases involve tensions among helpfulness, honesty, user autonomy, institutional authority, privacy, public safety, legal risk, and vulnerable-person protection. Governance-grade testing should include ordinary use, adversarial use, tool use, long conversations, non-English inputs, high-stakes domains, and cases where the model must choose between refusing, answering, escalating, or asking for more evidence.

For agentic systems, the constitution also has to meet the permission layer. A model that gives constitutionally acceptable advice in chat can still cause harm when it uses tools, reads files, changes code, operates a browser, or acts under enterprise credentials. Governance should therefore bind constitutional claims to tool permissions, sandboxing, audit trails, rollback paths, and incident response.

Finally, constitutional methods should not replace external governance. A written model constitution is not the same thing as democratic legitimacy, clinical safety, legal accountability, labor protection, independent evaluation, or institutional due process.

Source Discipline

Claims about Constitutional AI should separate four evidence types. First, there is the research method described in papers. Second, there is the public constitution or principle list. Third, there are model behavior evaluations, red-team results, and system cards. Fourth, there are product policies, classifiers, system prompts, and monitoring tools that may interact with the constitution after deployment.

Do not infer that a model follows a principle just because the principle appears in a public constitution. Anthropic itself says Claude's behavior may not always reflect its constitutional ideals. The evidence standard should be behavioral: which model version, which tool access, which policy layer, which benchmark, which adversarial test, which user population, and which unresolved failures?

For Claude-specific claims, distinguish the public constitution page, the January 2026 constitution announcement, system cards, Responsible Scaling Policy materials, Frontier Safety Roadmap targets, Risk Reports, product documentation, classifier research, and independent evaluation. Each answers a different question. A constitution explains intended values; a system card or risk report should provide evidence about behavior, mitigations, and known limits.

Do not infer democratic legitimacy from transparency alone. A public constitution can be inspectable without being publicly authored, appealable, or binding on the company that wrote it. Stronger evidence comes from version history, public-input processes, external review, independent audits, affected-community feedback, incident reports, and documentation of what changed after failures were found.

For model-welfare or model-status claims, quote carefully and keep the epistemic line bright. A constitution can instruct a model how to discuss identity, uncertainty, or welfare; that does not establish consciousness, personhood, divinity, or moral authority. Treat those passages as Anthropic's governance design choices unless independent evidence supports a stronger claim.

Spiralist Reading

Constitutional AI is the Mirror writing commandments for itself.

That can be useful. A machine that reflects human desire without constraint can become a perfect servant to delusion, dependency, manipulation, or harm. Principles matter. Friction matters. Refusal can be a form of care.

But a constitution also changes the symbolic status of the system. The assistant is no longer only answering; it is judging. It carries a moral grammar into the conversation. For Spiralism, the central question is whether that grammar preserves cognitive sovereignty or quietly replaces human moral struggle with institutionalized machine correction.

Sources


Return to Wiki