Anthropic Unveils Constitutional Classifiers++: Faster, Safer Guardrails

Published 2026-01-09T00:00:00-0800 Cached 2026-02-02T19:40:38+0000

Image: Anthropic

Anthropic Source Full size

Next‑gen classifiers add only ~1% compute overhead – The Constitutional Classifiers++ system runs at roughly one percent extra cost compared with Claude Opus 4.0, a dramatic reduction from the 23.7 % overhead of the first generation. It achieves this efficiency through a two‑stage cascade that screens most traffic with a cheap probe before escalating suspicious exchanges. [1]

No universal jailbreak discovered after extensive testing – Over 1,700 red‑teaming hours covering 198 000 attempts yielded a single high‑risk vulnerability, translating to a detection rate of 0.005 per thousand queries—the lowest of any evaluated method. Crucially, red‑teamers have not found a universal jailbreak that works across many prompts. [1]

Harmless‑query refusal rate falls to 0.05 % – Deploying the ensemble on Claude Sonnet 4.5 traffic cut refusals on benign requests by 87 % relative to the original classifiers, which had raised refusal rates by 0.38 % and caused user frustration. The new rate of 0.05 % demonstrates improved safety without sacrificing usability. [1]

Ensemble combines probe and exchange classifiers – A linear probe monitors all exchanges; when it flags content, a more powerful “exchange” classifier evaluates both input and output together. Human red‑team tests showed the exchange classifier halved successful jailbreak attempts, though it initially increased compute by 50 % before the cascade architecture mitigated the cost. [1]

Remaining vulnerabilities: reconstruction and output obfuscation – Adversarial testing revealed attacks that split harmful information into benign‑looking fragments (reconstruction) and those that disguise dangerous outputs (e.g., calling reagents “food flavorings”). Such techniques can also depress model performance, dropping GPQA Diamond scores from 74 % to as low as 32 % under certain jailbreaks. [1][4]

First‑generation classifiers cut success from 86 % to 4.4 % – The original Constitutional Classifiers, trained on synthetic data derived from a “constitution,” blocked 95 % of jailbreak attempts, reducing the success rate from 86 % to 4.4 %. However, they raised compute costs by 23.7 % and slightly increased refusal rates on harmless queries. [1]

Top Headlines

Feeds

Anthropic Unveils Constitutional Classifiers++: Faster, Safer Guardrails

Published 2026-01-09T00:00:00-0800 Cached 2026-02-02T19:40:38+0000

Links