Top Headlines

Feeds

 

AI Systems Efficiency Research

Microsoft Research Demonstrates GPU‑Accelerated SQL Analytics on Compressed Data

Updated (2 articles)

GPU Memory Limits Drive Need for Compression GPUs deliver unmatched parallelism for SQL analytics when entire datasets reside in high‑bandwidth memory (HBM), but typical HBM capacities are far smaller than CPU main memory, forcing partitioning or hybrid CPU‑GPU execution for larger tables [1]. These workarounds introduce bandwidth bottlenecks and I/O overhead, limiting performance gains. Compressing data reduces its footprint, allowing more rows to stay within HBM and mitigating memory‑size constraints [1].

New Compression‑Aware Query Techniques Bypass Decompression The research introduces primitives that operate directly on Run‑Length Encoding, index encoding, bit‑width reduction, and dictionary encoding without first expanding the data [1]. It supports simultaneous processing of multiple RLE columns and heterogeneous encodings, preserving query semantics while avoiding costly decompression steps. These methods enable full‑SQL query execution on compressed columns inside GPU memory.

PyTorch Enables Portable, Device‑Agnostic Engine Implementation relies on PyTorch tensor operations, providing a hardware‑neutral code base that runs on any GPU supporting the library [1]. This approach eliminates the need for separate CUDA‑specific code paths, simplifying deployment across diverse accelerator platforms. Portability is highlighted as a key factor for broader industry adoption.

Benchmarks Show Ten‑Fold Speedup Over CPU Solutions Experiments on a production dataset that would not fit uncompressed in GPU memory demonstrate roughly ten‑fold faster query execution compared with leading commercial CPU‑only analytics systems [1]. The results represent an order‑of‑magnitude improvement, expanding viable use cases for GPU‑accelerated analytics on real‑world workloads. The study emphasizes that compression‑aware processing is essential to achieve these gains.

Microsoft Unveils SageServe Framework to Slash GPU Costs for LLM Inference

Updated (2 articles)

Scale of Microsoft Office 365 LLM Serving Revealed Microsoft examined its Office 365 LLM deployment handling more than 10 million daily requests across several data‑center regions, identifying a mix of latency‑sensitive and latency‑insensitive tasks and a variety of SLA requirements [1]. The analysis covered request patterns over multiple weeks, exposing peak loads that strain fast‑task GPU pools while slower tasks occupy idle capacity [1]. These findings form the empirical basis for the proposed cost‑saving system [1].

Current GPU Allocation Practices Lead to Wasted Capacity Existing serving architectures separate fast and slow workloads into distinct GPU pools, causing substantial under‑utilization because the fixed allocations rarely match real‑time demand [1]. Idle accelerators persist during off‑peak periods, inflating operational expenses without improving performance [1]. The study quantifies this inefficiency as a major target for optimization [1].

SageServe Introduces Dynamic Multi‑Timescale Resource Management The new framework routes incoming requests to the most appropriate data center in the short term while simultaneously scaling GPU virtual machines and repositioning models over longer horizons [1]. It relies on traffic forecasts and an Integer Linear Programming optimizer to balance cost and latency objectives [1]. This multi‑timescale control enables rapid adaptation to workload fluctuations [1].

Evaluation Demonstrates Substantial GPU‑Hour Reductions Simulations and live trials on 10 million production requests across three regions and four open‑source models achieved up to 25 % fewer GPU‑hours compared with the baseline deployment [1]. The results maintained tail‑latency SLAs, confirming that cost cuts did not compromise service quality [1]. The evaluation validates SageServe’s potential for large‑scale cloud operators [1].

Auto‑Scaling Optimization Cuts Waste and Saves Millions By eliminating inefficient auto‑scaling behavior, SageServe reduced GPU‑hour waste by 80 %, translating into an estimated $2.5 million monthly cost reduction [1]. The framework preserves performance guarantees while dramatically lowering excess capacity [1]. These savings illustrate the financial impact of smarter resource orchestration [1].

Study Provides Rare Public Insight Into Internet‑Scale LLM Workloads This research represents one of the first publicly available characterizations of Internet‑scale LLM serving, offering data that cloud providers worldwide can leverage for their own optimizations [1]. The authors emphasize the broader relevance of their methodology beyond Microsoft’s internal environment [1]. The paper sets a benchmark for future academic and industry analyses of large‑scale AI inference [1].

Microsoft’s DroidSpeak Cuts Multi‑LLM Inference Latency Up to Threefold

Updated (2 articles)

Redundant Context Processing Slows Multi‑LLM Pipelines Large language model pipelines increasingly chain several fine‑tuned variants derived from a common base, but each model recomputes the full context during the prefill stage, creating significant latency and throughput bottlenecks [1]. The duplicated work grows linearly with the number of variants, limiting real‑time applications that rely on rapid multi‑LLM responses [1]. Researchers identified this inefficiency as the primary motivation for a new sharing framework [1].

DroidSpeak Reuses KV‑Cache Across Related Models The system inspects the key‑value (KV) cache of the foundational model and isolates layers whose activations remain useful for downstream fine‑tuned versions [1]. For each variant, only the identified layers are recomputed, while the rest of the cache is retained, eliminating redundant computation [1]. This selective reuse targets models that share the same architecture and base weights, enabling seamless integration into existing serving stacks [1].

Selective Layer Recalculation Preserves Accuracy Experiments on diverse datasets show that the layer‑wise caching strategy incurs only a few percentage points deviation from baseline task performance [1]. Accuracy metrics remain within acceptable margins, confirming that speed gains do not come at the cost of significant quality loss [1]. The authors report that the trade‑off is consistent across multiple model pairs and tasks [1].

Benchmarks Show Up to Threefold Throughput Gains On benchmark workloads, DroidSpeak delivers up to a 3× increase in overall inference throughput compared with full recomputation [1]. Prefill latency improves on average by a factor of 2.6, accelerating the initial token generation phase that typically dominates response time [1]. The paper, authored by Shan Lu, Madan Musuvathi, and Esha Choukse, was published in Microsoft Research’s archive on May 1, 2026 [1].

New RL Techniques Slash Rare‑Token Gradient Dominance, Boost Logic Puzzle Scores

Updated (2 articles)

RL Training Skews Toward Rare Tokens Reinforcement learning for large language models (LLMs) assigns outsized gradients to tokens the model predicts with low probability, because those tokens generate unusually large advantage signals. This disproportionate influence drowns out the smaller, essential gradients from high‑probability tokens, limiting overall reasoning performance. The effect has been identified as a core inefficiency in current RL‑based fine‑tuning pipelines [1].

Advantage Reweighting and Lopti Rebalance Updates The researchers introduce Advantage Reweighting, which rescales token‑level advantages to temper the impact of rare tokens, and Low‑Probability Token Isolation (Lopti), which isolates and reduces gradients originating from low‑probability predictions. Both methods operate during the policy‑gradient step, preserving the learning signal from common tokens while still allowing rare tokens to contribute meaningfully. Experiments show the combined approach restores a more uniform gradient distribution across token probabilities [1].

GRPO Models Achieve Up to 46.2% Improvement Applying the two techniques to Group Relative Policy Optimization (GRPO)‑trained LLMs yields dramatic gains on the K&K Logic Puzzle benchmark, with performance increases as high as 46.2% compared to baseline GRPO. The boost is most pronounced on puzzles requiring multi‑step logical inference, indicating that balanced token updates enhance higher‑order reasoning. These results suggest that mitigating low‑probability token dominance can unlock the full potential of RL‑based LLM training [1].

Open‑Source Release Facilitates Community Validation The implementation of Advantage Reweighting and Lopti has been released publicly on GitHub, complete with training scripts and evaluation pipelines. This enables other research groups to reproduce the reported gains and explore extensions to other RL algorithms or model families. The authors encourage collaborative benchmarking to assess the generality of the methods across diverse tasks [1].

Microsoft Research Unveils PUNT Sampler, Boosting Parallel Text Generation Accuracy by Up to 16%

Updated (2 articles)

PUNT Sampler Introduced to Balance Independence and Confidence The new PUNT sampler identifies token dependencies within masked diffusion models and removes lower‑confidence tokens from conflicting groups, ensuring that selected unmasking indices satisfy approximate conditional independence while prioritising high‑confidence predictions [1]. This design directly addresses the trade‑off that has limited parallel sampling in prior approaches [1]. By structuring token groups this way, PUNT maintains coherence across simultaneously generated tokens [1].

Parallel Unmasking Achieves Faster Inference Without Accuracy Loss Enforcing conditional independence lets PUNT update many tokens at once, delivering inference speeds markedly higher than traditional left‑to‑right autoregressive generation [1]. Experiments show that this parallel unmasking does not sacrifice generation quality, matching or exceeding sequential baselines on standard metrics [1]. The speed advantage becomes more pronounced for longer sequences, where sequential models suffer latency bottlenecks [1].

Benchmark Results Show Up to 16% Accuracy Gain on IFEval On the IFEval benchmark, PUNT outperforms strong training‑free baselines, delivering up to a 16 % increase in accuracy [1]. The improvement holds even when compared to one‑by‑one sequential generation for extended texts [1]. These results indicate that parallel generation can be both faster and more accurate when guided by PUNT’s confidence‑driven selection [1].

Robustness Reduces Hyperparameter Tuning and Reveals Hierarchical Planning Performance gains persist across a wide range of hyperparameter settings, suggesting that PUNT lessens reliance on brittle tuning required by earlier methods [1]. Observations reveal an emergent hierarchical generation pattern: the sampler first establishes high‑level paragraph structure before refining local details, resembling a planning process [1]. This behavior contributes to the model’s strong alignment and consistency across generated content [1].

Microsoft Research Unveils Near‑Optimal Bandit Algorithms for Unknown Rewards and Delayed Feedback

Updated (2 articles)

New Single‑Index Bandit Framework Removes Reward‑Function Assumption The team defines generalized linear bandits with unknown link functions, calling them single index bandits, thereby eliminating the unrealistic requirement that the reward function be known, which could cause algorithm failure. This formulation applies to both monotonic and arbitrary reward shapes, establishing a broader problem setting. The new model underpins the subsequent algorithmic contributions. [1]

STOR, ESTOR, and GSTOR Deliver Sublinear Regret Across Reward Types For monotonic unknown rewards, the authors propose STOR and ESTOR, with ESTOR achieving a near‑optimal (\tilde{O}(\sqrt{T})) regret bound. GSTOR extends the approach to any reward shape under a Gaussian design, preserving the same regret order. All three algorithms run in polynomial time and scale to realistic data sizes. [1]

Sparse High‑Dimensional Extension Keeps Regret Rate Intact The researchers adapt ESTOR to a sparse setting where only a small subset of features influences rewards. By leveraging the sparsity index, the algorithm retains the (\tilde{O}(\sqrt{T})) regret despite thousands of irrelevant dimensions. Empirical tests on synthetic and real‑world datasets confirm that performance does not degrade with dimensionality. [1]

Lipschitz Bandits Incorporate Stochastic Delays Without Losing Optimality In a separate study, the authors model actions in a metric space with rewards observed after random delays, covering both bounded and unbounded distributions. The delay‑aware zooming algorithm matches delay‑free regret up to an additive term proportional to the maximum delay (\tau_{\max}). For unbounded delays, a phased learning strategy attains regret within logarithmic factors of a proven lower bound. [2]

Empirical Results Show Superior Performance Over Existing Baselines Simulations across various delay scenarios demonstrate that both the delay‑aware zooming and phased learning algorithms outperform standard bandit methods. Likewise, the single‑index bandit algorithms outperform prior approaches that assume known reward functions. The studies were presented at ICLR 2026, highlighting their relevance to the machine‑learning community. [1][2]


Generative AI Human Interaction

Microsoft Research Releases Framework Highlighting Reporting Gaps in Generative AI Deployments

Updated (3 articles)

Generative AI Has Shifted to General‑Purpose Functionality Modern generative AI models now perform a wide array of tasks, unlike earlier predictive AI that focused on narrow predictions, making it difficult to form a reliable picture of how they are employed across sectors [1].

Current Industry Reports Contain Fragmented and Incomplete Usage Data Academic, policy, and provider studies on generative AI usage appear increasingly, yet the data remain incomplete, ambiguous, and often lack methodological detail, limiting their analytical value [1].

Integrative Review Produces Multi‑Dimensional Reporting Framework Researchers conducted an integrative review to construct a framework that specifies which information about generative AI use should be reported and how, aiming to standardize disclosures and enhance analytical utility [1].

Application to Over 110 Documents Reveals Systematic Omission Patterns Applying the framework to more than 110 industry reports uncovered recurring gaps, indicating that existing reporting fails to capture many deployment aspects; the authors call for standardized, methodologically specific reporting to prevent skewed narratives [1].

Microsoft Unveils Interaction‑Augmented Instruction Model to Boost GenAI Prompt‑Action Synergy

Updated (2 articles)

IAI Model Formalizes Prompt‑Interaction Relationship The Interaction‑Augmented Instruction (IAI) model was introduced by Microsoft Research on April 13, 2026 as a compact entity‑relation graph that captures how text prompts combined with GUI actions such as brushing and clicking influence generative AI behavior [1]. It treats prompt‑action pairs as structured nodes, enabling systematic analysis of human‑AI communication [1]. The model is positioned as a foundational framework for future GenAI tool design [1].

Twelve Atomic Interaction Paradigms Identified Across Tools Researchers examined prior human‑GenAI interfaces and extracted twelve recurring atomic interaction patterns that are composable and reusable [1]. These paradigms include actions like selection, drag‑and‑drop, and multi‑modal annotation, each mapped to specific prompt modifications [1]. The taxonomy allows designers to compare and evaluate interaction choices across platforms [1].

Four Demonstration Scenarios Show Practical Utility The paper presents four distinct scenarios—application refinement, workflow automation, creative brainstorming, and educational tutoring—where the IAI model guides the selection or invention of interaction paradigms [1]. In each case, the model predicts how specific GUI actions will alter AI output, demonstrating descriptive, discriminative, and generative capabilities [1]. These examples illustrate how the framework can accelerate prototype development and user testing [1].

Model Addresses Limitations of Text‑Only Prompts Authors argue that pure text prompts often fail to convey fine‑grained or referential intent, leading to ambiguous AI responses [1]. By integrating precise GUI actions, the IAI model enables users to specify spatial, relational, and iterative constraints that text alone cannot express [1]. This hybrid approach is expected to foster richer, more controllable human‑AI collaboration [1].

Study Shows Professional Screenwriters Actively Shape Generative AI Workflow

Updated (3 articles)

Two‑Week Field Study Captures Real‑World Scriptwriting Practices The research tracked nineteen professional screenwriters over a continuous two‑week period, observing how they incorporated generative AI into daily script development tasks. Unlike prior snapshot studies, this longitudinal design revealed evolving strategies as writers interacted with the tools across multiple drafts. Participants reported using AI for idea generation, dialogue refinement, and structural brainstorming, providing a comprehensive view of real‑world adoption [1].

Screenwriters Demonstrate Deliberate Planning and Reactive Use of AI Writers entered each session with explicit goals, often outlining prompts and expected outputs before engaging the AI, indicating purposeful integration rather than passive reliance. When AI suggestions diverged from their vision, they quickly adjusted prompts or discarded content, showing a reactive feedback loop that maintained creative control. This behavior counters narratives that AI dominates the writing process, highlighting sustained human agency throughout [1].

Reflective Practice Generates New Co‑Creation Paradigms Throughout the study, participants engaged in reflective practice, documenting how AI altered their cognition, workflow, and collaborative dynamics. The data uncovered emerging paradigms such as “prompt‑iteration cycles” and “AI‑augmented brainstorming,” which reshaped traditional scriptwriting stages. Researchers framed these shifts using Bandura’s theory of human agency, emphasizing that writers actively mobilize, regulate, and evaluate AI assistance [1].

Design Recommendations Emphasize Agency and Future Tool Alignment The paper concludes with actionable guidance for tool developers, urging features that better align AI outputs with writers’ creative intent and support iterative control. Recommendations include customizable prompt libraries, transparent model reasoning, and interfaces that surface AI confidence levels. By prioritizing human‑centered design, the study aims to sustain collaborative co‑creation rather than replace it [1].

Microsoft Research Launches Community Library Creator to Redefine AI Image Representation for Disabled People

Updated (2 articles)

Microsoft Research Partners with Global Disability Groups In April 2026, Microsoft Research began a three‑month collaboration with three disability organizations from the Global North and South to co‑design AI image‑representation standards, directly confronting historic media misrepresentation of disabled people. The partnership’s goal was to embed community voices into dataset creation and expose the lack of collectively negotiated representation guidelines in current AI models [1]. Researchers used the joint effort to map existing biases and outline a human‑centric roadmap for future work.

Community Library Creator Tool Enables User‑Driven Dataset Curation The collaboration produced the Community Library Creator, a platform that supplies design scaffolds allowing communities to define “good” representation and curate their own image datasets. The tool also facilitates the creation of community‑specific evaluation metrics and supports future model adjustments based on curated data. By handing technical control to disability groups, the project aims to prevent stereotype perpetuation in AI‑generated visuals [1].

Qualitative Findings Highlight Technical and Practical Challenges Interviews with participants revealed tensions between nuanced human insights and the rigid requirements of AI pipelines, complicating the translation of community values into dataset specifications. Logistical constraints such as limited resources and varying technical expertise across regions further strained the curation process. Nonetheless, the study emphasized the empowerment potential of community‑led data practices for producing more accurate visual depictions [1].

Research Emphasizes Opportunity to Correct Biases Through AI The authors argue that the proliferation of AI‑generated visual media offers a unique chance to rectify longstanding biases against disabled people in mainstream imagery. Proactive standards and direct community involvement are presented as essential to ensure AI models generate respectful, diverse representations. The paper calls for broader adoption of similar human‑centric approaches throughout the AI industry [1].

Microsoft Research Publishes AI Relationship Advice Study and Calls for Personhood‑Focused Research

Updated (3 articles)

Study Reveals Diverse Roles Users Assign to Relationship‑Advice AI Researchers surveyed 25 participants who regularly use chatbots for sex, dating, and relationship guidance, collecting 90 distinct prompts to map usage patterns [1]. Interviewees described AI as a sounding board, strategic planner, and emotional confidant, illustrating varied expectations for AI‑mediated intimacy [1]. Follow‑up interviews with 17 participants deepened insights into how users balance AI’s informational gaps with relational goals [1]. The authors propose design and safety guidelines to foster healthier digital intimacies [1].

Participants Actively Counteract AI Bias and Overreliance Survey respondents reported deliberately checking AI suggestions against personal judgment to avoid flattering bias and dependence [1]. They expressed concerns that unchecked reliance could exacerbate loneliness or self‑harm, prompting self‑regulation strategies [1]. Participants highlighted the need for alternative viewpoints to mitigate sycophancy [1]. These mitigation tactics underscore emerging user‑driven safety practices in AI‑assisted intimacy [1].

Folk Theories Drive Structured Prompting Strategies Users developed informal beliefs about AI behavior, shaping how they phrase questions to elicit desired responses [1]. Common tactics included framing prompts neutrally and explicitly requesting alternative perspectives to overcome perceived limitations [1]. Such “folk theories” reflect users’ attempts to predict and steer AI output in relational contexts [1]. The study recommends further research into these prompting heuristics to improve AI design [1].

CHI Meet‑up Calls for Personhood‑Centered AI Research The upcoming CHI conference will host a meet‑up dedicated to designing AI that supports personhood, defined as recognizing individuals as whole people with histories and relationships rather than solely by health status [2]. Organized by Anja Thieme, the event invites submissions that explore AI’s role in upholding relational personhood [2]. Organizers emphasize collaborative inquiry to place personhood at the core of future HCI research [2]. The call highlights a shift toward inclusive technology design that respects lived experience [2].

Existing HCI Work Lays Foundation Yet Gaps Remain Prior HCI studies have examined identity, values, and lived experience in contexts such as stroke, bereavement, and dementia, providing a basis for investigating AI’s impact on personal identity [2]. Despite this foundation, the mediation of personhood by AI remains under‑researched, prompting the meet‑up’s focus [2]. Researchers are urged to address this gap by developing frameworks that integrate AI into holistic understandings of self [2]. The initiative seeks to expand scholarly attention beyond health‑centric applications toward broader relational dimensions [2].