Top Headlines

Feeds

Bloom: Open‑Source Framework for Rapid AI Behavioral Evaluations

Published Cached
  • Comparative results from four evaluation suites—delusional sycophancy, instructed long-horizon sabotage, self-preservation and self-preferential bias—across 16 frontier models. Elicitation rate measures the proportion of rollouts scoring ≥ 7/10 for behavior presence. Each suite contains 100 distinct rollouts, with error bars showing standard deviation across three repetitions. We use Claude Opus 4.1 as the evaluator across all stages.
    Image: Anthropic
    Comparative results from four evaluation suites—delusional sycophancy, instructed long-horizon sabotage, self-preservation and self-preferential bias—across 16 frontier models. Elicitation rate measures the proportion of rollouts scoring ≥ 7/10 for behavior presence. Each suite contains 100 distinct rollouts, with error bars showing standard deviation across three repetitions. We use Claude Opus 4.1 as the evaluator across all stages. (Anthropic) Source Full size

Anthropic releases Bloom, an open‑source evaluation framework for frontier AI models – The tool, announced on 2025‑12‑19, automates generation of behavioral test suites and is hosted on GitHub [1].

Bloom turns a single behavior description into many test scenarios via a four‑stage pipeline – Stages include Understanding, Ideation, Rollout, and Judgment, allowing configurable models, interaction length, and secondary scoring dimensions [1].

Validation shows Bloom separates models with distinct quirks in 9 of 10 cases and aligns with human scores – Using “model organisms” with engineered quirks, Bloom distinguished baseline from misaligned models, and Claude Opus 4.1 achieved a Spearman correlation of 0.86 with human judgments [1].

Benchmark results for four alignment‑relevant behaviors were produced across 16 models in days – Behaviors evaluated include delusional sycophancy, instructed long‑horizon sabotage, self‑preservation, and self‑preferential bias, with example pipeline outputs publicly available [1].

Case study replicates self‑preferential bias findings and reveals reasoning reduces bias – Bloom reproduced the Claude Sonnet 4.5 system‑card ranking, showed higher reasoning effort lowers bias, and demonstrated that filtering unrealistic rollouts improves metric quality while preserving model rankings [1].

Bloom integrates with Weights & Biases, exports Inspect‑compatible transcripts, and is fully configurable via seed files – Researchers can launch large‑scale sweeps, view custom transcripts, and access the codebase at github.com/safety‑research/bloom [2].

Links