Microsoft Research Unveils Near‑Optimal Bandit Algorithms for Unknown Rewards and Delayed Feedback
Updated (2 articles)
New Single‑Index Bandit Framework Removes Reward‑Function Assumption The team defines generalized linear bandits with unknown link functions, calling them single index bandits, thereby eliminating the unrealistic requirement that the reward function be known, which could cause algorithm failure. This formulation applies to both monotonic and arbitrary reward shapes, establishing a broader problem setting. The new model underpins the subsequent algorithmic contributions. [1]
STOR, ESTOR, and GSTOR Deliver Sublinear Regret Across Reward Types For monotonic unknown rewards, the authors propose STOR and ESTOR, with ESTOR achieving a near‑optimal (\tilde{O}(\sqrt{T})) regret bound. GSTOR extends the approach to any reward shape under a Gaussian design, preserving the same regret order. All three algorithms run in polynomial time and scale to realistic data sizes. [1]
Sparse High‑Dimensional Extension Keeps Regret Rate Intact The researchers adapt ESTOR to a sparse setting where only a small subset of features influences rewards. By leveraging the sparsity index, the algorithm retains the (\tilde{O}(\sqrt{T})) regret despite thousands of irrelevant dimensions. Empirical tests on synthetic and real‑world datasets confirm that performance does not degrade with dimensionality. [1]
Lipschitz Bandits Incorporate Stochastic Delays Without Losing Optimality In a separate study, the authors model actions in a metric space with rewards observed after random delays, covering both bounded and unbounded distributions. The delay‑aware zooming algorithm matches delay‑free regret up to an additive term proportional to the maximum delay (\tau_{\max}). For unbounded delays, a phased learning strategy attains regret within logarithmic factors of a proven lower bound. [2]
Empirical Results Show Superior Performance Over Existing Baselines Simulations across various delay scenarios demonstrate that both the delay‑aware zooming and phased learning algorithms outperform standard bandit methods. Likewise, the single‑index bandit algorithms outperform prior approaches that assume known reward functions. The studies were presented at ICLR 2026, highlighting their relevance to the machine‑learning community. [1][2]
Sources
-
1.
Microsoft Research: New Algorithms for Single Index Bandits with Unknown Rewards: Introduces the single index bandit model, presents STOR, ESTOR, GSTOR, and a sparse high‑dimensional extension with theoretical regret guarantees and experimental validation.
-
2.
Microsoft Research: New Algorithms Achieve Sublinear Regret for Lipschitz Bandits with Stochastic Delays: Defines a delayed‑feedback Lipschitz bandit setting, proposes delay‑aware zooming and phased learning algorithms with near‑optimal regret bounds, and reports simulation results; presented at ICLR 2026.
Timeline
2025 – Researchers commonly assume a known link function in generalized linear bandits, an unrealistic premise that can cause algorithm failure when the true reward function is unknown [1].
Early 2026 – The “single index bandit” problem is defined, removing the known‑link assumption and allowing the reward function to be completely unknown [1].
Early 2026 – Algorithms STOR and ESTOR are introduced for monotonically increasing unknown reward functions; ESTOR attains a near‑optimal (\tilde O(\sqrt{T})) regret bound [1].
Early 2026 – The methods are extended to a high‑dimensional sparse setting, preserving the (\tilde O(\sqrt{T})) regret by exploiting the sparsity index despite many features [1].
Early 2026 – GSTOR is presented, an algorithm agnostic to the shape of the reward function, with proven regret bounds under a Gaussian design assumption, broadening applicability beyond monotonic cases [1].
Early 2026 – Empirical tests on synthetic and real‑world datasets confirm that STOR, ESTOR, and GSTOR are both computationally efficient and achieve the promised regret guarantees [1].
2025 – Prior Lipschitz bandit research focuses on immediate feedback, ignoring stochastic delays in reward observation [2].
Early 2026 – A Lipschitz bandit framework with stochastic delays is defined, covering both bounded and unbounded delay distributions [2].
Early 2026 – A delay‑aware zooming algorithm is proposed for bounded delays, matching delay‑free regret up to an additive term proportional to the maximum delay (\tau_{\max}) [2].
Early 2026 – A phased learning strategy is introduced for unbounded delays, aggregating reliable feedback over scheduled intervals and achieving regret within logarithmic factors of a proven lower bound [2].
Early 2026 – A theoretical lower bound for the unbounded‑delay setting is established, demonstrating that the proposed methods are near‑optimal [2].
Early 2026 – Simulations across diverse delay scenarios show that both the delay‑aware zooming and phased learning algorithms outperform baseline methods, validating practical effectiveness [2].
Apr 2026 – The delayed‑feedback Lipschitz bandit results are presented at ICLR 2026 by Microsoft Research [2].