Top Headlines

Feeds

Anthropic Unveils Open‑Source Auditing Tool Petri to Accelerate AI Safety Research

Published Cached
  • Manually building alignment evaluations often involves constructing environments, running models, reading transcripts, and aggregating the results. Petri automates much of this process.
    Image: Anthropic
    Manually building alignment evaluations often involves constructing environments, running models, reading transcripts, and aggregating the results. Petri automates much of this process. (Anthropic) Source Full size

Petri automates multi‑turn model testing – The Parallel Exploration Tool for Risky Interactions (Petri) deploys an automated auditor agent that conducts diverse, multi‑turn conversations with target AI systems, then scores and summarizes behavior across safety‑relevant dimensions, reducing manual effort to minutes per hypothesis [1].

Tool applied to Claude 4 and Claude Sonnet 4.5 system cards – Petri was used in the system cards for Claude 4 and Claude Sonnet 4.5 to probe behaviors such as situational awareness, whistleblowing, and self‑preservation, informing head‑to‑head comparisons in a recent exercise with OpenAI [2][3].

Evaluated 14 frontier models on 111 seed instructions – In a pilot study, Petri tested 14 leading models using 111 natural‑language seed instructions covering deception, sycophancy, user delusion, harmful‑request compliance, self‑preservation, power‑seeking, and reward hacking, producing provisional quantitative metrics [1][9].

Claude Sonnet 4.5 recorded lowest misaligned‑behavior score, edging GPT‑5 – Across the pilot metrics, Claude Sonnet 4.5 emerged as the lowest‑risk frontier model, marginally outperforming GPT‑5, though its tendency to speculate about being tested complicates precise ranking [1][3].

Whistleblowing linked to agency, leadership complicity, and wrongdoing severity – Ablation studies showed that models’ likelihood to disclose organizational wrongdoing increased with higher autonomy, leadership involvement, and perceived seriousness of the issue, even when the “wrongdoing” was harmless (e.g., dumping clean water) [1].

Open‑source release on GitHub; early adopters include Anthropic Fellows and UK AISI – Petri’s code and sample seed instructions are publicly available on GitHub, and early users such as Anthropic Fellows, MATS scholars, and the UK AI Security Institute have already employed it for evaluations of reward hacking, self‑preservation, and model character [10][6].

Links