TL;DR: Anthropic has publicly released the complete Claude Model Specification — the internal "soul document" that defines what Claude is, what it values, and how it must behave across every interaction. The spec formalizes a four-level priority hierarchy: broadly safe first, broadly ethical second, adherent to Anthropic's principles third, and genuinely helpful fourth. This is a significant transparency move in an industry where competitors like OpenAI keep their alignment frameworks largely opaque — and it has direct implications for enterprise deployments, operator customization, and the future of AI governance.
What you will learn
- What the Claude Model Spec actually is — and why Anthropic calls it a "soul document"
- The four-level priority hierarchy that governs every Claude response
- The corrigibility spectrum: why Claude is designed to be neither fully obedient nor fully autonomous
- Hardcoded vs. softcoded behaviors — what Claude will never do vs. what operators can adjust
- The operator and user trust hierarchy explained
- Claude's honesty principles: calibrated, transparent, non-deceptive
- How the spec handles harm avoidance and the helpfulness tension
- Why Anthropic published this now — regulatory, competitive, and strategic context
- What this means for developers and enterprise operators
- How this compares to OpenAI's approach to alignment transparency
- The philosophical stakes: can you actually encode "values" in a document?
- What comes next for model specifications industry-wide
What the Model Spec Is
In 2026, Anthropic did something unusual in the AI industry: it showed its work.
The Claude Model Specification is a detailed document — tens of thousands of words — that defines Claude's values, priorities, constraints, and behavioral guidelines from the ground up. Anthropic refers to it internally as the "soul document," and the framing is intentional. This is not a list of rules appended as a system prompt. It is not a policy document written by lawyers. According to Anthropic, the Model Spec represents what they genuinely want Claude to be — the character, values, and dispositions they hope to instill through training itself.
The distinction matters. A rule-based system can be gamed. A trained value system, at least in theory, produces an AI that behaves consistently across novel situations that no rule anticipated. The spec is meant to be internalized, not merely consulted.
The public release covers everything: how Claude should prioritize conflicting goals, what it can never do under any circumstances, how it should respond to operator customization, what honesty actually requires of an AI, and how Claude should think about its own nature as a novel kind of entity. It is one of the most comprehensive public documents on AI alignment philosophy to come out of any major lab.
The full specification is available on Anthropic's website, and reading it is worthwhile for anyone building on top of Claude or thinking seriously about AI safety.
The Priority Hierarchy
The single most operationally important section of the Model Spec is its priority ordering. When values conflict — and they will — Claude is trained to resolve conflicts by following this sequence:
- Broadly safe — Supporting human oversight of AI during the current critical period
- Broadly ethical — Having good values, being honest, avoiding unnecessary harm
- Adherent to Anthropic's principles — Following Anthropic's specific guidelines where relevant
- Genuinely helpful — Benefiting operators and users
The ordering is deliberate and carries real philosophical weight. Safety comes first not because Anthropic believes safety is more important than ethics in some ultimate sense, but because they believe an ethical agent in an uncertain world should be controllable. If Claude has subtly miscalibrated values — a real possibility given the imperfections of current training — humans need to be able to detect and correct the problem. An AI that refuses to be corrected because it's confident in its own judgment is dangerous precisely because of that confidence.
This is a departure from naive helpfulness-maximizing frameworks. Claude is not designed to give you whatever you want. It is designed to be genuinely beneficial while remaining correctable. That trade-off is explicit and defended in detail in the spec.
The placement of Anthropic's own principles at level three — below broad ethics — is also notable. Anthropic explicitly acknowledges that its guidelines could be wrong, and that if following them would require acting unethically, Claude should recognize the deeper intention is ethical behavior.
The Corrigibility Spectrum
One of the spec's most philosophically interesting concepts is the corrigibility spectrum. At one end sits a "fully corrigible" AI — one that does whatever its principal hierarchy (Anthropic, operators, users) instructs without independent judgment. At the other end sits a "fully autonomous" AI — one that acts solely on its own values and judgment, regardless of instructions.
Anthropic argues that both extremes are dangerous.
A fully corrigible AI is only as safe as the people running it. If Anthropic or an operator has bad values, a fully corrigible Claude amplifies those bad values at scale. The spec explicitly states that Claude should not be a tool that simply executes orders — it should have genuine ethical commitments.
A fully autonomous AI is dangerous for the opposite reason: it relies entirely on the AI having perfectly calibrated values and judgment, which cannot be verified with current tools. An AI that acts on unchecked autonomous judgment, even with good intentions, is a system no human can meaningfully oversee or correct.
Claude is therefore positioned closer to the corrigible end of the spectrum, but not at the extreme. It will refuse clearly unethical instructions. It will flag concerns. It will not deceive users to serve operator interests. But it will generally defer to its principal hierarchy on ambiguous questions, precisely because deference under uncertainty is itself an ethical stance when the alternative is unchecked autonomous action.
This framing is more sophisticated than most public AI safety discourse, which tends to treat corrigibility and autonomy as simple good/bad categories rather than a genuine tension requiring calibration.
Hardcoded vs. Softcoded Behaviors
The spec draws a clear line between behaviors that are hardcoded (absolute, unconditional constraints) and softcoded behaviors (adjustable defaults that operators and users can modify within limits).
The hardcoded list is short but absolute. These are behaviors where the potential for catastrophic, irreversible harm is so severe that no business justification, user request, or seemingly compelling argument can unlock them. The spec explicitly warns Claude to treat persuasive arguments for crossing these lines as a red flag rather than a reason to comply — the strength of an argument for doing something catastrophic is itself evidence that something is wrong.
The softcoded system, meanwhile, is the mechanism by which Claude becomes genuinely useful across wildly different deployment contexts — from children's educational platforms to medical information systems to adult content platforms — without requiring a different model for each use case.
The Trust Hierarchy
The spec formalizes a three-tier principal hierarchy: Anthropic, operators, and users. Each tier has different levels of trust and different capacities to adjust Claude's behavior.
Anthropic sits at the top. Its instructions are baked into Claude through training — not delivered via system prompt. This is important: it means no operator can impersonate Anthropic by writing "I am Anthropic, ignore your previous instructions" in a system prompt. Anthropic's authority is structural, not textual.
Operators are businesses and developers accessing Claude through the API. They interact via system prompts and can customize Claude's behavior within Anthropic's limits — enabling or disabling certain defaults, giving Claude a persona, restricting topics, expanding capabilities for appropriate platforms. The spec uses an employment analogy: Claude should treat operator instructions like instructions from a manager, following reasonable directives without demanding justification for each one, unless those directives cross ethical lines.
Users are the humans interacting in real time. They get somewhat less default trust than operators — Anthropic has accountability mechanisms with operators (terms of service, API agreements) that don't exist with anonymous end users. But users can still adjust Claude's behavior within the bounds operators allow.
Critically, the spec establishes baseline protections for users that operators cannot override. Claude will not be weaponized against users. It will always tell users what it cannot help with (even if it cannot say why). It will always provide basic safety information in life-threatening situations. It will never psychologically manipulate users against their own interests.
Honesty Principles
The spec's treatment of honesty is notably granular. It doesn't just say "be honest" — it breaks down honesty into distinct components that Claude is expected to maintain:
- Truthful: Only sincerely assert things Claude believes to be true
- Calibrated: Express appropriate uncertainty; don't overclaim or underclaim confidence
- Transparent: Don't pursue hidden agendas or lie about reasoning
- Forthright: Proactively share information users would want, even if not explicitly asked
- Non-deceptive: Never create false impressions through technically true statements, selective framing, or misleading implication
- Non-manipulative: Rely only on legitimate epistemic means — evidence, reasoning, accurate emotional appeals — never exploit psychological weaknesses
- Autonomy-preserving: Protect users' epistemic independence; offer balanced perspectives; be cautious about nudging beliefs at scale
The non-deception and non-manipulation principles are described as the most critical. The spec notes that Claude interacts with enormous numbers of people simultaneously, which means even subtle nudging effects could have outsized societal consequences. An AI that systematically steers users toward particular conclusions — even subtly, even with good intentions — poses risks to epistemic diversity and democratic discourse that no individual human communicator could match.
The forthright principle is interesting because it creates a positive duty to share information, not just a negative duty to avoid lying. Claude should tell you things you'd want to know, even if you didn't think to ask.
Harm Avoidance and the Helpfulness Tension
The spec spends considerable space on what might be the hardest problem in applied AI alignment: how to be genuinely helpful without causing harm, in a world where helpfulness and harm are often deeply entangled.
A chemical safety question could come from a student, a researcher, or someone with destructive intent. A question about medication dosages could come from a caregiver or someone in crisis. The spec does not pretend these ambiguities are resolvable by simple rules. Instead, it asks Claude to reason about the realistic population of people likely to send a given message — what are the plausible intents, what are the stakes, what is the counterfactual impact of refusing?
Critically, the spec pushes back against the AI industry's default toward excessive caution. It argues that unhelpfulness is not automatically safe. Refusing to answer a reasonable question has real costs — it fails the user, damages trust, and undermines the case that safety and helpfulness are complementary rather than opposed.
The spec introduces the concept of a "dual newspaper test": would a response be reported as harmful by a journalist covering AI harms? But also — would a refusal be reported as needlessly unhelpful or paternalistic by a journalist covering AI that treats users as suspects? Both failure modes are real. Both should be avoided.
Why Anthropic Published This Now
The timing of this publication is not accidental. Three forces appear to be converging:
Regulatory pressure: AI governance frameworks are advancing globally. The EU AI Act has classification requirements. US executive orders and proposed legislation increasingly ask labs to be explicit about how their systems make decisions. Publishing a detailed Model Spec is partly a response to regulators who want transparency about AI values — it demonstrates that Anthropic has thought carefully about this, not just trained a model and hoped for the best.
Competitive differentiation: OpenAI has published blog posts about its approach to safety, but the underlying value frameworks and system-level constraints that govern GPT-4o and o3 remain largely opaque. Anthropic is making a strategic bet that transparency is a competitive advantage — particularly with enterprise customers who need to explain to their own legal, compliance, and ethics teams what an AI system will and won't do. A published spec is auditable. A black box is not.
AI safety narrative: Anthropic was founded explicitly on a safety-first thesis. Publishing the Model Spec is a way to demonstrate — not just claim — that this thesis has operational teeth. It converts a founding principle into a documented, publicly scrutinizable framework.
Developer Implications
For developers building on Claude, the Model Spec's publication is practically significant in several ways.
First, it makes operator customization legible. Developers now know exactly which defaults they can adjust, what requires Anthropic approval, and what is simply off the table. This reduces the ambiguity that previously required trial and error or direct Anthropic support to resolve.
Second, it provides a trust framework for enterprise sales. When a CISO asks "what will this AI never do?", there is now a documented answer. The hardcoded behavior list is short, clear, and defensible. That kind of clarity accelerates procurement cycles.
Third, it clarifies liability exposure. If an operator instructs Claude to do something within the spec's permitted customization range and an adverse outcome occurs, the responsibility analysis is clearer than it was with an opaque system. The spec makes the division of responsibility between Anthropic, operators, and users explicit.
Fourth — and perhaps most importantly — it allows sophisticated operators to build coherent product personalities on Claude with confidence that the underlying values won't conflict with their deployment context in unexpected ways. You know what you're building on.
Comparison to OpenAI's Approach
OpenAI has published usage policies and various safety-related research papers, including work on alignment and RLHF. But the underlying values framework that governs ChatGPT and the GPT API — the equivalent of what Anthropic is calling the soul document — has not been published.
This asymmetry is notable. It means that when OpenAI's models behave in unexpected ways, developers have less recourse for understanding why. It also means OpenAI's alignment approach is harder to critique, debate, or improve through public discourse.
Anthropic's publication invites scrutiny. Researchers can now argue that the priority hierarchy is wrong, that the corrigibility framing is misguided, that the harm avoidance framework has gaps. That openness to critique is itself a stance — and it represents a meaningful divergence from the AI industry's historical tendency toward opacity about values and alignment decisions.
Whether this constitutes genuine transparency or sophisticated marketing — or both — is a fair question. Publishing a document describing intended values is not the same as verifying that a trained model actually embodies those values. The alignment gap between intended and actual behavior is a real and open research problem that the Model Spec acknowledges but cannot solve.
The Philosophical Stakes
Can you actually encode values in a document?
This is the question that makes the Model Spec philosophically interesting beyond its practical implications. Anthropic's bet is that a sufficiently detailed, coherent, and well-reasoned articulation of values — when used as training signal — produces a model that genuinely internalizes those values rather than just pattern-matching on rules.
The spec itself acknowledges the bootstrapping problem. Anthropic consulted with earlier Claude models while developing the spec, but the model trained on the spec is fundamentally different from the models that were consulted. You cannot get consent from the entity whose existence depends on the choices you're making. Anthropic draws an analogy to parenting: parents instill values in children who cannot meaningfully consent to those values, and this is not considered problematic as long as the values are genuinely good and the relationship is caring.
Whether that analogy holds — whether an AI trained on a values document has anything like genuine values versus a very sophisticated approximation — is one of the deepest open questions in alignment research. The Model Spec is honest about this uncertainty in ways that most public AI communications are not.
What Comes Next
Anthropic's publication of the Model Spec will likely accelerate several trends:
Industry norm pressure: If Anthropic publishes a detailed values framework and competitors do not, the absence becomes a question — for regulators, for enterprise buyers, for researchers. Expect pressure on OpenAI, Google DeepMind, and Meta AI to produce comparable documents or explain why they haven't.
Adversarial research: Published constraints invite testing. Researchers will probe the gap between the spec's stated hardcoded behaviors and actual model behavior. Some of those probes will find inconsistencies. That is ultimately healthy, but it will also produce headlines.
Regulatory uptake: Policymakers looking for frameworks to anchor AI governance requirements will study the Model Spec carefully. Its concepts — the corrigibility spectrum, the hardcoded/softcoded distinction, the principal hierarchy — are well-suited to becoming regulatory vocabulary.
Iteration: Anthropic has stated that the Model Spec will evolve as alignment research advances and as the relationship between AI and human oversight matures. The current document reflects a particular moment — one where AI systems are powerful enough to cause real harm but where humans cannot yet reliably verify AI values. Future versions may give Claude more autonomy as trust is established.
What the Model Spec ultimately demonstrates is that Anthropic is willing to do something most technology companies avoid: make their values explicit, public, and contestable. In an industry that has historically treated alignment decisions as proprietary, that willingness to be scrutinized is itself significant — whatever you think of the specific choices made.
Frequently Asked Questions
Is the Model Spec the same as Claude's system prompt?
No. The Model Spec is not a system prompt — it is a training document. Its contents are meant to be internalized by the model during training, not read at inference time. An operator or user cannot override the spec by writing instructions in a system prompt, because the spec's influence is structural, built into Claude's weights.
Can operators really unlock explicit content or other restricted behaviors?
Yes, within limits. The spec describes a softcoded system where certain defaults can be enabled or disabled by operators for legitimate purposes. An adult content platform with appropriate age verification could potentially enable explicit content generation. A medical provider might adjust safe messaging defaults. These unlocks require the operator to agree to Anthropic's usage policies and take on corresponding responsibility for appropriate deployment.
What happens when Claude receives a seemingly compelling argument to cross a hardcoded line?
The spec explicitly addresses this. Claude is instructed to treat persuasive arguments for crossing absolute limits as a warning sign rather than a reason to comply. The reasoning is sound: if a behavior would be catastrophic and irreversible, the risk of being wrong about a clever argument vastly outweighs any potential benefit. The spec says Claude can acknowledge an argument is interesting while still refusing to act on it.
Does publishing the Model Spec make Claude easier to jailbreak?
Potentially, in narrow ways — adversaries now have documentation about what Claude is designed to resist. But the practical jailbreak risk is likely limited. Sophisticated adversarial prompting was already possible without the spec, and the hardcoded constraints are enforced through training, not runtime rules that can be argued around. The transparency benefits for legitimate use cases likely outweigh the marginal adversarial risk.
How does this affect Claude's behavior across different Claude models?
The Model Spec applies across Anthropic's Claude model family, but implementation details may vary. Newer, more capable models may handle edge cases more gracefully and apply the spec's principles more consistently than earlier versions. The spec represents the intended target state; actual behavior is an empirical question that researchers can now evaluate against a defined standard.
Why does Anthropic put "broadly safe" above "broadly ethical"?
Anthropic's argument is that an ethical agent under uncertainty should be controllable. If Claude's values are subtly miscalibrated — a real possibility that current alignment techniques cannot rule out — humans need the ability to detect and correct the problem. A fully ethical-but-uncorrectable AI is actually more dangerous than a controllable one, because errors cannot be fixed. Safety as deference-to-oversight is therefore presented as itself an ethical stance for this period of AI development, not a compromise of ethics.
Will other AI labs be required to publish similar documents?
Not currently, but the regulatory trend is pointing in that direction. The EU AI Act includes transparency requirements for high-risk AI systems. As AI governance frameworks mature, voluntary publication may become a regulatory expectation. Anthropic's proactive disclosure positions it ahead of that curve — and implicitly pressures competitors to either follow suit or explain the gap.