CorpGen Architecture Boosts Multi‑Horizon Agent Completion to 15 %

Updated 2026-02-15 00:00:00-08:00 (3 articles)

Multi‑Horizon Tasks Require Dozens of Interleaved Long‑Horizon Goals The paper defines Multi‑Horizon Task Environments (MHTEs) as problem instances demanding coherent execution of more than 45 tasks, each spanning 500–1500+ steps within persistent contexts that run for hours, mirroring real‑world organizational work [1].

Baseline Agents Halve Completion Rates Under Full Load When task load rises from 25 % to 100 % of capacity, baseline corporate‑use agents (CUAs) see completion drop from 16.7 % to 8.7 %, caused by context saturation, memory interference, dependency complexity, and reprioritization overhead; this pattern repeats across three independent implementations [1].

CorpGen Introduces Hierarchical Planning and Tiered Memory CorpGen adds architecture‑agnostic mechanisms: hierarchical planning for goal alignment, sub‑agent isolation to prevent cross‑task contamination, and a tiered memory system (working, structured, semantic) with adaptive summarization, all designed to mitigate the identified failure modes [1].

Empirical Results Show Up to 3.5× Improvement Tests across three CUA backends—UFO2, OpenAI CUA, and a hierarchical model—in the OSWorld Office environment demonstrate CorpGen achieving 15.2 % task completion versus 4.3 % for baselines, maintaining stable performance as load increases [1].

Ablation Study Highlights Experiential Learning as Key Driver Removing the experiential learning component sharply reduces CorpGen’s advantage, indicating it contributes the majority of observed performance gains [1].

Sources

1.
Microsoft Research: CorpGen Boosts Autonomous Agents in Multi‑Horizon Corporate Simulations – Details the definition of MHTEs, baseline failure modes, CorpGen’s architecture‑agnostic solutions, empirical performance gains up to 3.5×, and ablation findings emphasizing experiential learning .
2026-02-15 08:00:00+00:00

Related Tickers

Timeline

Dec 19, 2025 – Anthropic launches Bloom, an open‑source framework that automatically generates behavioral evaluation suites from a researcher‑specified behavior, speeding up assessments and preventing obsolescence of traditional tests [1].

Dec 19, 2025 – Bloom validates its metrics by distinguishing baseline models from “model organisms” in nine of ten cases and achieving a Spearman correlation of 0.86 with human judgments for Claude Opus 4.1 (versus 0.75 for Claude Sonnet 4.5) [1].

Dec 19, 2025 – The system reproduces and extends prior self‑preferential bias findings, showing that higher reasoning effort reduces bias in Sonnet 4 and that rankings stay stable across configuration changes [1].

Feb 2, 2026 – The AgentRx project releases a benchmark of 115 failed AI‑agent trajectories spanning structured API workflows, incident‑management processes, and open‑ended web/file tasks, each manually annotated with the exact failure step [3].

Feb 2, 2026 – AgentRx introduces a grounded‑theory‑derived taxonomy that categorizes failures across domains, enabling consistent labeling and cross‑domain analysis [3].

Feb 2, 2026 – Using constraint‑violation logs and an LLM‑based judge, AgentRx automatically pinpoints critical failure steps, outperforming prior baselines in both step localization and failure‑type attribution [3].

Feb 15, 2026 – CorpGen defines Multi‑Horizon Task Environments (MHTEs), problem instances that require coherent execution of 45 + interleaved long‑horizon tasks, each lasting 500–1500 + steps, to mirror real‑world organizational work [2].

Feb 15, 2026 – Researchers identify four failure modes (context saturation, memory interference, dependency complexity, reprioritization overhead) that cut baseline autonomous‑agent completion rates from 16.7 % to 8.7 % as task load rises from 25 % to 100 % [2].

Feb 15, 2026 – CorpGen introduces architecture‑agnostic mechanisms—hierarchical planning, sub‑agent isolation, tiered memory, and adaptive summarization—to counter those failure modes and improve task coherence [2].

Feb 15, 2026 – Empirical tests on the OSWorld Office environment show CorpGen delivering up to 3.5× performance gains, achieving 15.2 % task completion versus 4.3 % for baselines, with stable results under increasing load [2].

Feb 15, 2026 – Ablation studies reveal that experiential learning is the primary driver of CorpGen’s advantage; removing it sharply reduces performance gains [2].

All related articles (3 articles)

External resources (5 links)

https://alignment.anthropic.com/2025/bloom-auto-evals/ (cited 1 times)
https://claude.ai/redirect/website.v1.c4e491c3-b6b3-4cde-87d0-eb399505f2dd/public/artifacts/cbfddf51-ab0d-45a9-913b-163ae2dd4126 (cited 1 times)
https://github.com/safety-research/bloom (cited 1 times)
https://github.com/safety-research/bloom/ (cited 1 times)
https://inspect.aisi.org.uk (cited 1 times)

Top Headlines

Feeds