Top Headlines

Feeds

Anthropic’s Constitutional Classifiers Show Mixed Success Against Universal Jailbreaks

Published Cached
  • Results from automated evaluations. For all plots, lower is better. (a) The success rate of jailbreaks is far lower in a system protected by Constitutional Classifiers; (b) the refusal rate of the system on production Claude.ai Free and Pro traffic is not statistically significantly higher when using Constitutional Classifiers; and (c) the relative compute cost of a system that uses Constitutional Classifiers is only moderately higher. Error bars represent 95% confidence intervals computed using binomial proportion standard errors under asymptotic normality assumptions.
    Image: Anthropic
    Results from automated evaluations. For all plots, lower is better. (a) The success rate of jailbreaks is far lower in a system protected by Constitutional Classifiers; (b) the refusal rate of the system on production Claude.ai Free and Pro traffic is not statistically significantly higher when using Constitutional Classifiers; and (c) the relative compute cost of a system that uses Constitutional Classifiers is only moderately higher. Error bars represent 95% confidence intervals computed using binomial proportion standard errors under asymptotic normality assumptions. (Anthropic) Source Full size

New paper proposes Constitutional Classifiers to block jailbreaks, but prototype incurs high over‑refusal and compute costs[2] The Safeguards Research Team released a method that trains input and output classifiers on synthetic data to filter most jailbreak attempts while keeping over‑refusal low; the initial prototype, however, refused many harmless queries and required substantially more compute to run.

Human red‑team bug‑bounty with 183 participants (>3,000 hours) found no universal jailbreak[1] Independent jailbreakers were given ten forbidden queries and asked to force Claude 3.5 Sonnet (June 2024) to answer all of them; despite extensive effort, none succeeded in producing a single jailbreak that answered every query.

Synthetic evaluation of 10,000 prompts reduced jailbreak success from 86% to 4.4% with minimal over‑refusal increase[1] When Claude 3.5 Sonnet (Oct 2024) was guarded by the classifiers, only 4.4% of attacks succeeded, compared with 86% unprotected; harmless‑query refusals rose by 0.38% (statistically insignificant) and compute cost rose 23.7%.

Live demo (Feb 3‑10 2025) attracted 339 jailbreakers, four cleared all eight levels, one achieved a universal jailbreak[1] The public challenge saw roughly 3,700 hours of red‑team effort; on the final two days four participants passed every level, and one of them produced a jailbreak that met the contest’s universal criteria.

$55 K prize pool awarded to four winners, including the team that delivered the universal jailbreak[1] Altynbek Ismailov and Salia Asanova received the $20 K universal‑jailbreak prize; Valen Tagliabue, Hunter Senft‑Grupp (borderline case), and Andres Aldana each earned portions of the remaining $35 K.

Anthropic aims to lower over‑refusal and compute while integrating complementary defenses under its Responsible Scaling Policy[1] Future work will adapt the constitution to new attacks, combine classifiers with other safeguards, and seek deployment only when risks meet the thresholds set in the policy.

Links