New paper proposes Constitutional Classifiers to block jailbreaks, but prototype incurs high over‑refusal and compute costs [2] The Safeguards Research Team released a method that trains input and output classifiers on synthetic data to filter most jailbreak attempts while keeping over‑refusal low; the initial prototype, however, refused many harmless queries and required substantially more compute to run.
Human red‑team bug‑bounty with 183 participants (>3,000 hours) found no universal jailbreak [1] Independent jailbreakers were given ten forbidden queries and asked to force Claude 3.5 Sonnet (June 2024) to answer all of them; despite extensive effort, none succeeded in producing a single jailbreak that answered every query.
Synthetic evaluation of 10,000 prompts reduced jailbreak success from 86% to 4.4% with minimal over‑refusal increase [1] When Claude 3.5 Sonnet (Oct 2024) was guarded by the classifiers, only 4.4% of attacks succeeded, compared with 86% unprotected; harmless‑query refusals rose by 0.38% (statistically insignificant) and compute cost rose 23.7%.
Live demo (Feb 3‑10 2025) attracted 339 jailbreakers, four cleared all eight levels, one achieved a universal jailbreak [1] The public challenge saw roughly 3,700 hours of red‑team effort; on the final two days four participants passed every level, and one of them produced a jailbreak that met the contest’s universal criteria.
$55 K prize pool awarded to four winners, including the team that delivered the universal jailbreak [1] Altynbek Ismailov and Salia Asanova received the $20 K universal‑jailbreak prize; Valen Tagliabue, Hunter Senft‑Grupp (borderline case), and Andres Aldana each earned portions of the remaining $35 K.
Anthropic aims to lower over‑refusal and compute while integrating complementary defenses under its Responsible Scaling Policy [1] Future work will adapt the constitution to new attacks, combine classifiers with other safeguards, and seek deployment only when risks meet the thresholds set in the policy.