CorpGen Architecture Boosts Multi‑Horizon Agent Completion to 15 %
Updated (3 articles)
Multi‑Horizon Tasks Require Dozens of Interleaved Long‑Horizon Goals The paper defines Multi‑Horizon Task Environments (MHTEs) as problem instances demanding coherent execution of more than 45 tasks, each spanning 500–1500+ steps within persistent contexts that run for hours, mirroring real‑world organizational work [1].
Baseline Agents Halve Completion Rates Under Full Load When task load rises from 25 % to 100 % of capacity, baseline corporate‑use agents (CUAs) see completion drop from 16.7 % to 8.7 %, caused by context saturation, memory interference, dependency complexity, and reprioritization overhead; this pattern repeats across three independent implementations [1].
CorpGen Introduces Hierarchical Planning and Tiered Memory CorpGen adds architecture‑agnostic mechanisms: hierarchical planning for goal alignment, sub‑agent isolation to prevent cross‑task contamination, and a tiered memory system (working, structured, semantic) with adaptive summarization, all designed to mitigate the identified failure modes [1].
Empirical Results Show Up to 3.5× Improvement Tests across three CUA backends—UFO2, OpenAI CUA, and a hierarchical model—in the OSWorld Office environment demonstrate CorpGen achieving 15.2 % task completion versus 4.3 % for baselines, maintaining stable performance as load increases [1].
Ablation Study Highlights Experiential Learning as Key Driver Removing the experiential learning component sharply reduces CorpGen’s advantage, indicating it contributes the majority of observed performance gains [1].
Related Tickers
Timeline
Dec 19, 2025 – Anthropic launches Bloom, an open‑source framework that automatically generates behavioral evaluation suites from a researcher‑specified behavior, speeding up assessments and preventing obsolescence of traditional tests [1].
Dec 19, 2025 – Bloom validates its metrics by distinguishing baseline models from “model organisms” in nine of ten cases and achieving a Spearman correlation of 0.86 with human judgments for Claude Opus 4.1 (versus 0.75 for Claude Sonnet 4.5) [1].
Dec 19, 2025 – The system reproduces and extends prior self‑preferential bias findings, showing that higher reasoning effort reduces bias in Sonnet 4 and that rankings stay stable across configuration changes [1].
Feb 2, 2026 – The AgentRx project releases a benchmark of 115 failed AI‑agent trajectories spanning structured API workflows, incident‑management processes, and open‑ended web/file tasks, each manually annotated with the exact failure step [3].
Feb 2, 2026 – AgentRx introduces a grounded‑theory‑derived taxonomy that categorizes failures across domains, enabling consistent labeling and cross‑domain analysis [3].
Feb 2, 2026 – Using constraint‑violation logs and an LLM‑based judge, AgentRx automatically pinpoints critical failure steps, outperforming prior baselines in both step localization and failure‑type attribution [3].
Feb 15, 2026 – CorpGen defines Multi‑Horizon Task Environments (MHTEs), problem instances that require coherent execution of 45 + interleaved long‑horizon tasks, each lasting 500–1500 + steps, to mirror real‑world organizational work [2].
Feb 15, 2026 – Researchers identify four failure modes (context saturation, memory interference, dependency complexity, reprioritization overhead) that cut baseline autonomous‑agent completion rates from 16.7 % to 8.7 % as task load rises from 25 % to 100 % [2].
Feb 15, 2026 – CorpGen introduces architecture‑agnostic mechanisms—hierarchical planning, sub‑agent isolation, tiered memory, and adaptive summarization—to counter those failure modes and improve task coherence [2].
Feb 15, 2026 – Empirical tests on the OSWorld Office environment show CorpGen delivering up to 3.5× performance gains, achieving 15.2 % task completion versus 4.3 % for baselines, with stable results under increasing load [2].
Feb 15, 2026 – Ablation studies reveal that experiential learning is the primary driver of CorpGen’s advantage; removing it sharply reduces performance gains [2].
All related articles (3 articles)
External resources (5 links)
- https://alignment.anthropic.com/2025/bloom-auto-evals/ (cited 1 times)
- https://claude.ai/redirect/website.v1.c4e491c3-b6b3-4cde-87d0-eb399505f2dd/public/artifacts/cbfddf51-ab0d-45a9-913b-163ae2dd4126 (cited 1 times)
- https://github.com/safety-research/bloom (cited 1 times)
- https://github.com/safety-research/bloom/ (cited 1 times)
- https://inspect.aisi.org.uk (cited 1 times)