FIPO Algorithm Boosts LLM Mathematical Reasoning
- •FIPO improves credit assignment by weighting tokens based on their influence on the subsequent reasoning path
- •Reasoning chain length increased from 4,000 to over 10,000 tokens on the Qwen2.5-32B model
- •Model achieved 58% on AIME 2024, outperforming specialized competitors like o1-mini and DeepSeek-R1-Zero-Math
Current reinforcement learning methods for language models often struggle with "credit assignment," the difficult task of determining which specific words or steps in a long explanation actually led to the correct answer. Traditional systems often treat an entire sequence of reasoning as a single unit, rewarding every word equally regardless of its logical importance.
Researchers have introduced FIPO (Future-KL Influenced Policy Optimization) to solve this by identifying "critical logical pivots" within a model's output. By calculating the real-time influence of every token on the subsequent reasoning path, FIPO can precisely reinforce smart logical moves while suppressing repetitive fillers. This creates a granular feedback system, allowing the model to understand the value of its individual thoughts rather than just the final outcome.
When applied to the Qwen2.5-32B model, FIPO extended the average length of the model's "chain-of-thought"—the step-by-step internal reasoning used to solve problems—from 4,000 to over 10,000 tokens. The model achieved a peak accuracy of 58% on the challenging AIME 2024 mathematics benchmark, outperforming specialized models like o1-mini and DeepSeek-R1-Zero-Math-32B at similar scales.