NVIDIA Researchers Fix RL Optimization via GDPO Algorithm
- •NVIDIA researchers identified a normalization collapse issue in traditional GRPO when handling multiple reward signals simultaneously.
- •The new GDPO method decouples reward normalization to preserve the distinct resolution of competing training objectives.
- •GDPO serves as a seamless replacement for GRPO, significantly improving performance in math, coding, and tool-calling tasks.
Reinforcement Learning (RL) pipelines often utilize multiple reward signals to align language models with complex preferences like accuracy and brevity. However, NVIDIA researchers found that Group Relative Policy Optimization (GRPO) suffers from "normalization collapse" in these environments. When distinct reward signals are combined before normalization, they converge into nearly identical values, reducing training signal precision. This loss of resolution often leads to model failure or suboptimal performance during alignment.
To solve this, the team introduced Group reward-Decoupled Normalization Policy Optimization (GDPO). This method shifts the order of operations by normalizing individual rewards independently before aggregation. By decoupling the normalization process, GDPO preserves the relative differences between specific preferences, ensuring the model receives high-fidelity feedback even when balancing conflicting objectives. This change is vital for maintaining training stability during complex multi-objective reasoning tasks.
Benchmarks in mathematical reasoning and tool calling show that GDPO consistently outperforms the standard GRPO framework. GDPO is designed for easy adoption as a drop-in replacement compatible with major RL libraries such as verl, TRL, and NVIDIA’s NeMo-RL. The researchers also released a slurm-free implementation, allowing practitioners to validate the method on standard hardware. This development provides a more robust path for aligning large language models with sophisticated, multi-dimensional human values.