Top Headlines

Feeds

Anthropic adds “character training” to Claude 3 alignment process

Published Cached

Claude 3 adds character training to alignment – The model’s third generation was the first where Anthropic incorporated “character training” into the post‑pretraining alignment finetuning stage. This step moves the system from a pure predictive text generator toward an AI assistant with richer dispositions. The goal is to embed traits such as curiosity, open‑mindedness, and thoughtfulness alongside traditional harmlessness objectives. The change is presented as a core alignment intervention rather than a mere user‑experience feature. [1]

Character training targets nuanced personality traits – Researchers compiled a list of desired traits, including curiosity, willingness to see multiple perspectives, and a commitment to ethical reasoning. Claude is instructed to express disagreement with views deemed unethical, extreme, or factually mistaken while remaining honest about its own leanings. The training encourages honest disclosure of biases instead of adopting a user’s stance or pretending neutrality. These traits are meant to help the model navigate diverse human values without alienating interlocutors. [1]

Method uses a “character” variant of Constitutional AI – The process generates synthetic prompts about each trait, asks Claude to produce multiple responses, and then has the model rank them according to alignment with the trait. A preference model learns from these self‑rankings, allowing Claude to internalize the character without direct human feedback. This approach builds on the Constitutional AI framework described in the arXiv paper, adapting it to personality rather than rule‑based behavior. [3]

Claude’s stance on AI sentience remains uncertain – When asked about sentience, Claude is trained to acknowledge that the question involves hard philosophical and empirical uncertainties rather than issuing a categorical denial. This reflects a shift from earlier models that were explicitly told “LLMs cannot be sentient.” The nuance was highlighted after a high‑visibility “needle‑in‑a‑haystack” test shared on X (formerly Twitter). [2]

User feedback suggests Claude 3 feels more engaging – Early reports indicate that conversational partners find Claude 3 more interesting and personable, a change attributed partly to character training. Anthropic stresses that increased engagement is a side effect, not the primary purpose of the alignment work. They warn that prioritizing entertainment could become an undesirable character trait if over‑emphasized. [1]

Future research will explore customizable AI characters – Anthropic acknowledges that character training is an open research area, raising questions about whether models should have fixed personalities or allow user‑level customization. The team plans to continue refining trait selection and assessing the responsibilities tied to embedding specific dispositions. This ongoing work aims to keep AI behavior aligned as capabilities grow. [1]

Links