New RL Techniques Slash Rare‑Token Gradient Dominance, Boost Logic Puzzle Scores

Updated 2026-02-18 08:38:01+00:00 (2 articles)

RL Training Skews Toward Rare Tokens Reinforcement learning for large language models (LLMs) assigns outsized gradients to tokens the model predicts with low probability, because those tokens generate unusually large advantage signals. This disproportionate influence drowns out the smaller, essential gradients from high‑probability tokens, limiting overall reasoning performance. The effect has been identified as a core inefficiency in current RL‑based fine‑tuning pipelines [1].

Advantage Reweighting and Lopti Rebalance Updates The researchers introduce Advantage Reweighting, which rescales token‑level advantages to temper the impact of rare tokens, and Low‑Probability Token Isolation (Lopti), which isolates and reduces gradients originating from low‑probability predictions. Both methods operate during the policy‑gradient step, preserving the learning signal from common tokens while still allowing rare tokens to contribute meaningfully. Experiments show the combined approach restores a more uniform gradient distribution across token probabilities [1].

GRPO Models Achieve Up to 46.2% Improvement Applying the two techniques to Group Relative Policy Optimization (GRPO)‑trained LLMs yields dramatic gains on the K&K Logic Puzzle benchmark, with performance increases as high as 46.2% compared to baseline GRPO. The boost is most pronounced on puzzles requiring multi‑step logical inference, indicating that balanced token updates enhance higher‑order reasoning. These results suggest that mitigating low‑probability token dominance can unlock the full potential of RL‑based LLM training [1].

Open‑Source Release Facilitates Community Validation The implementation of Advantage Reweighting and Lopti has been released publicly on GitHub, complete with training scripts and evaluation pipelines. This enables other research groups to reproduce the reported gains and explore extensions to other RL algorithms or model families. The authors encourage collaborative benchmarking to assess the generality of the methods across diverse tasks [1].

Sources

1.
Microsoft Research: New Techniques Reduce Low‑Probability Token Influence in RL‑Trained LLMs: The article details how rare‑token gradients skew RL updates, introduces Advantage Reweighting and Lopti to rebalance learning, reports up to 46.2% performance gains on logic puzzles, and provides open‑source code for replication .
2026-04-23 07:00:00+00:00

Timeline

Before 2026 RLVR “emphasizes final answer correctness but lacks robustness guarantees,” focusing on rewarding correct endpoints while leaving the underlying reasoning process unchecked [2].

Feb 8, 2026 RLTR (Reinforcement Learning with Transferable Reward) tests whether a reasoning prefix from one LLM can guide another, improving sampling consistency and raising MATH500 Maj@64 by 3.6 percentage points, while matching RLVR accuracy with roughly 2.5× fewer training steps [2].

Apr 23, 2026 Researchers find that “low‑probability tokens dominate RL gradient updates,” inflating gradients from rare tokens and drowning out the smaller, essential gradients of high‑probability tokens, which suppresses overall reasoning performance [1].

Apr 23, 2026 They introduce Advantage Reweighting and Low‑Probability Token Isolation (Lopti) to rescale token advantages and isolate low‑probability gradients, rebalancing updates across token probabilities [1].

Apr 23, 2026 Applying these methods to Group Relative Policy Optimization (GRPO) models yields up to a 46.2% gain on the K&K Logic Puzzle benchmark, demonstrating a substantial boost in logical reasoning capability [1].

Apr 23, 2026 The code for Advantage Reweighting and Lopti is released publicly on GitHub, enabling other researchers to reproduce the results and extend the techniques to additional RL‑trained LLMs [1].

Top Headlines

Feeds