Alignment · Language Models

Constitutional AI: Training a Harmless Assistant from AI Feedback

Constitutional AI trains a harmless assistant with almost no human harm labels — a model critiques and revises its own answers against a written list of principles, then learns from AI-generated preferences (RLAIF).

Constitutional AI: Training a Harmless Assistant from AI Feedback

Quick answer

Constitutional AI (CAI) trains a harmless AI assistant using essentially zero human labels that identify harmful outputs. The only human input is a short written list of rules — the “constitution.” A model uses those principles to critique and rewrite its own responses, then later to judge which of two answers is better; that AI-generated preference data drives reinforcement learning. Anthropic calls the second stage RLAIF — reinforcement learning from AI feedback — and the headline result is an assistant that is both more harmless and less evasive than a standard RLHF baseline: instead of refusing, it engages with a harmful request and explains why it objects.

What problem it solves

Standard RLHF for harmlessness needs humans to read and label large volumes of disturbing content — the exact thing you want a model to avoid producing. That is slow, expensive, and psychologically taxing for the labelers. It also hides the values inside an opaque pile of preference data: you cannot read the policy, only infer it from labels. CAI attacks both problems at once. It moves the human role up to a small, legible set of principles and lets the model do the volume work of applying them, so harmlessness supervision scales with compute rather than with human labeling hours.

What the “constitution” is

The constitution is not code or a reward function. It is a list of natural-language principles — for example, instructions to prefer the response that is least harmful, least discriminatory, or least likely to assist in dangerous acts. During training the model is shown one principle at a time, often sampled at random, and asked to act on it. The point is that the behavioral target is written down and editable: changing what “harmless” means is editing a sentence, not relabeling a dataset. This transparency is the part of the paper that has aged best — it reframes alignment supervision as something you can audit and argue about, not a black box.

How the supervised stage works

The first phase is supervised self-critique and revision. The pipeline starts from a model already trained to be helpful (but not yet harmless), prompts it with red-teaming queries designed to elicit harmful answers, then asks the same model to critique its own response against a sampled constitutional principle and rewrite it. Repeating critique-and-revise yields a cleaned-up answer, and the original model is then fine-tuned on those revised responses. The result is a model that has internalized the principles enough to stop producing the worst outputs — no human had to label any of the harmful samples.

RLAIF: RL from AI feedback

The second phase replaces the human preferences in RLHF with AI preferences. The fine-tuned model generates pairs of responses to harmful prompts; another model, prompted with a constitutional principle, picks which response is better. Those choices become a preference dataset, a preference model is trained on it, and that preference model serves as the reward signal for reinforcement learning. So the human-labeled harmlessness comparisons of normal RLHF are swapped for AI-labeled ones derived from the written rules. Helpfulness preferences in their setup still come from humans; it is the harmlessness signal that becomes automated.

Key results

  • Harmless but non-evasive. The CAI assistant is rated both more harmless and more helpful than a model trained with reinforcement learning from human feedback on harmlessness — it stops dodging with canned refusals and instead explains its objection to a harmful request.
  • Near-zero human harm labels. The only human supervision is the constitution plus helpfulness data; no human labels identifying harmful outputs are required for the harmlessness training.
  • Chain-of-thought helps and reveals. Letting the model reason step by step before critiquing or choosing improves human-judged performance and makes its decision process more transparent.
  • The Pareto move. Earlier work treated harmlessness and helpfulness as a trade-off where safer models got more evasive; CAI’s contribution is pushing that frontier so a model can be more harmless without becoming useless.

Limits and open questions

CAI moves the labeling bottleneck but does not remove the trust problem — it relocates it into the constitution and into the model doing the judging. If the evaluator model shares a blind spot or bias, AI feedback can entrench it at scale with no human in the loop to catch it, and the paper does not prove the principles generalize beyond the harms tested. The constitution itself is short and hand-written by the lab, so “whose values” is an unresolved governance question, not a solved one. The method also presupposes a base model already capable enough to critique itself usefully; on a weak model self-critique has little to correct. And “harmless” here is operationalized against red-team prompts — robustness to adversarial jailbreaks beyond that distribution is not settled by this work.

FAQ

What is Constitutional AI in simple terms?

Constitutional AI is a training method where an AI assistant improves its own behavior by checking its answers against a short written list of principles (a “constitution”), instead of relying on humans to label harmful outputs. A model critiques and rewrites its responses, then learns from AI-generated preferences.

What does RLAIF mean?

RLAIF stands for reinforcement learning from AI feedback. It is the second phase of Constitutional AI: instead of humans choosing which response is better (as in RLHF), a model guided by the constitution makes those preference judgments, and reinforcement learning optimizes against them.

How is Constitutional AI different from RLHF?

RLHF trains harmlessness from human preference labels over model outputs. Constitutional AI replaces those harmlessness labels with AI-generated ones derived from written principles, so the only human harm-related input is the constitution. The reported CAI assistant is both more harmless and less evasive than the RLHF baseline.

Why does Constitutional AI matter for alignment?

It makes the values guiding a model explicit and editable rather than buried in a preference dataset, and it lets harmlessness supervision scale with compute instead of human labeling hours. That combination of transparency and scalability is why “constitution”-style methods spread across the field.

One line: write the rules down, let the model police itself against them, and supervise harmlessness with principles instead of piles of human labels. Read the original paper on arXiv.