Microsoft Unveils SageServe Framework to Slash GPU Costs for LLM Inference
Updated (2 articles)
Scale of Office 365 LLM Serving Revealed Microsoft examined its Office 365 LLM deployment handling more than 10 million daily requests across several data‑center regions, uncovering a blend of latency‑sensitive and latency‑insensitive tasks and a variety of service‑level agreements that must be met [1].
Current GPU Allocation Leads to Wasted Capacity Existing practice isolates fast and slow workloads into separate GPU pools, causing substantial idle accelerator capacity because the fixed allocations do not align with fluctuating request loads [1].
SageServe Employs Multi‑Timescale Dynamic Routing The proposed SageServe framework adds a multi‑timescale controller that routes requests to appropriate data centers in the short term, while scaling GPU virtual machines and placing models over longer horizons using traffic forecasts and an integer linear programming optimizer [1].
Evaluations Show Up to 25 % GPU‑Hour Savings Simulations and live tests on 10 million production requests across three regions and four open‑source models cut GPU‑hour consumption by as much as 25 % versus baseline; auto‑scaling waste fell 80 %, translating to potential monthly cost reductions of up to $2.5 million while preserving tail‑latency SLAs [1].