Top Headlines

Feeds

Microsoft Research Unveils CorpGen Framework Boosting Multi‑Horizon Agent Performance

Updated (2 articles)

New Multi‑Horizon Task Environments Challenge Existing Agents The study defines Multi‑Horizon Task Environments (MHTEs) as problem instances that require coherent execution of more than 45 interleaved tasks, each spanning 500–1500+ steps. These tasks run within persistent contexts that persist for hours, mirroring real‑world organizational workflows. The paper emphasizes that such environments expose limitations of current autonomous agents when handling sustained, complex workloads [1].

Baseline Completion Rates Collapse Under Full Load Baseline corporate‑use agents (CUAs) achieve a 16.7 % task‑completion rate at 25 % load, but this drops to 8.7 % when load reaches 100 %. Four failure modes—context saturation, memory interference, dependency complexity, and reprioritization overhead—cut performance roughly in half. The pattern repeats across three independent CUA implementations, highlighting systemic scalability issues [1].

CorpGen Introduces Hierarchical Planning and Tiered Memory CorpGen adds architecture‑agnostic mechanisms: hierarchical planning aligns long‑term goals, sub‑agent isolation prevents cross‑task contamination, and tiered memory (working, structured, semantic) manages information flow. Adaptive summarization condenses experiences to reduce memory load. These components directly target the identified failure modes, aiming to sustain performance as task density rises [1].

Empirical Results Show Up to 3.5‑Fold Gains Across three CUA backends—UFO2, OpenAI CUA, and a hierarchical model—CorpGen reaches a 15.2 % task‑completion rate versus 4.3 % for baselines, representing up to 3.5× improvement. Performance remains stable as load increases, confirming robustness. Ablation studies reveal that removing experiential learning sharply reduces the advantage, indicating it drives most of the observed gains [1].

Sources

Timeline

2025 – Researchers compile a benchmark of 115 failed AI‑agent trajectories covering structured API workflows, incident‑management processes, and open‑ended web/file tasks, manually annotating each run with the exact failure step and a taxonomy category [2].

2025 – Using grounded‑theory analysis, the team derives a cross‑domain failure taxonomy that captures common error patterns across diverse agent applications, enabling consistent labeling [2].

2025 – The AgentRx framework automatically generates constraints for each trajectory step, evaluates them sequentially, and logs any violations, producing an auditable trace of where the agent went wrong [2].

2025 – An LLM‑based judge consumes the constraint‑violation log, matches evidence to taxonomy categories, and pinpoints the critical failure step, dramatically reducing manual analysis [2].

2025 – Experiments across three distinct domains show AgentRx “outperforms existing baselines in step localization and attribution,” achieving higher accuracy in identifying failure steps and categories [2].

2025 – Researchers identify four failure modes—context saturation, memory interference, dependency complexity, and reprioritization overhead—that cut baseline CUA completion rates from 16.7 % to 8.7 % as task load rises from 25 % to 100 % [1].

Feb 2, 2026 – The AgentRx framework is publicly released, providing the community with a domain‑agnostic diagnostic tool and the 115‑trajectory benchmark dataset [2].

Feb 15, 2026 – CorpGen introduces architecture‑agnostic mechanisms—hierarchical planning for goal alignment, sub‑agent isolation, tiered memory (working, structured, semantic), and adaptive summarization—to counter the identified failure modes in Multi‑Horizon Task Environments [1].

Feb 15, 2026 – Empirical tests on the OSWorld Office environment with three CUA backends (UFO2, OpenAI CUA, hierarchical model) show CorpGen reaching “15.2 % task completion versus 4.3 % for baselines,” a 3.5× performance gain that remains stable as load increases [1].

Feb 15, 2026 – Ablation studies pinpoint experiential learning as the biggest contributor; removing it sharply reduces CorpGen’s advantage, indicating it drives the majority of observed improvements [1].

All related articles (2 articles)