New RL Method BandPO Solves LLM Entropy Collapse
- •Researchers from Fudan University introduce BandPO to fix stability and exploration bottlenecks in reinforcement learning.
- •Dynamic bounds replace fixed clipping to prevent entropy collapse and improve complex mathematical reasoning performance.
- •Benchmarks show BandPO consistently outperforms standard GRPO across Qwen and DeepSeek model families.
Reinforcement learning is the secret sauce behind the reasoning capabilities of modern models, yet standard methods often suffer from a hidden flaw. Current techniques use a fixed "clipping" mechanism to ensure training stays stable. However, researchers have discovered that these rigid boundaries unfairly suppress rare but highly effective strategies, leading to a phenomenon known as entropy collapse where the model loses its ability to explore diverse solutions.
To bridge this gap, a team from Fudan University has developed Band-constrained Policy Optimization (BandPO). This approach swaps out static limits for a dynamic "Band" operator that adjusts based on the probability of an action. By using mathematical projections known as f-divergences, the system can expand or contract its training bounds in real-time. This flexibility allows the model to learn from "tail strategies"—uncommon but correct paths—without sacrificing the overall stability required for large-scale training.
The results are impressive in complex reasoning tasks. In tests using Qwen and DeepSeek models on rigorous math benchmarks, BandPO significantly outperformed the popular GRPO framework. By preserving exploration gradients, the model maintains a healthy level of diversity in its thinking process. This breakthrough offers a robust framework for the open-source community to fine-tune high-performance reasoning models.