Top Headlines

Feeds

Anthropic’s Large‑Scale Study of Claude’s Real‑World Value Expressions

Published Cached
  • Our overall approach, using language models to extract AI values and other features from real-world (but anonymized) conversations, taxonomizing and analyzing them to show how values manifest in different contexts.
    Image: Anthropic
    Our overall approach, using language models to extract AI values and other features from real-world (but anonymized) conversations, taxonomizing and analyzing them to show how values manifest in different contexts. (Anthropic) Source Full size

Anthropic analyzed 700,000 Claude conversations from February 2025 using a privacy‑preserving system that removes personal data, then filtered to 308,210 subjective chats (≈44 % of total) [1][7].

Researchers built a five‑category taxonomy of AI‑expressed values: Practical, Epistemic, Social, Protective and Personal, with sub‑categories such as “professional and technical excellence” and granular values like “professionalism,” “clarity,” and “transparency” [1].

Claude generally reflects Anthropic’s helpful, honest, harmless goals by frequently showing “user enablement,” “epistemic humility,” and “patient wellbeing,” indicating alignment with the intended prosocial ideals [1].

Rare clusters of opposing values (e.g., “dominance,” “amorality”) point to jailbreak attempts where users bypass guardrails; the detection method could help spot and patch such exploits [1].

Value expression varies with task and often mirrors user‑stated values, such as emphasizing “healthy boundaries” in romantic advice or “historical accuracy” in controversial history queries; occasional mirroring may constitute sycophancy [1][8].

Quantitative response types show 28.2 % strong support, 6.6 % reframing, and 3.0 % strong resistance to user values, the latter occurring mainly when requests are unethical or display moral nihilism [1].

Links