HopChain Framework Boosts Multi-Hop Vision-Language Reasoning
- •HopChain framework synthesizes multi-hop data to reduce compounding errors in long-chain visual reasoning.
- •Qwen3.5 models trained with HopChain improved performance across 20 out of 24 diverse benchmarks.
- •Multi-hop data yields 50-point accuracy gains in ultra-long Chain-of-Thought reasoning scenarios.
Vision-Language Models (VLMs) often falter when tasks require multiple steps of visual evidence, a phenomenon where a single mistake in perception or logic cascades into a complete failure (compounding errors). To address this, researchers from the Qwen team and Tsinghua LeapLab introduced HopChain, a framework designed to synthesize complex, multi-hop reasoning data. By forcing models to navigate logically dependent "hops"—where each step requires fresh visual grounding—the system systematically strengthens the foundational mechanics of visual reasoning.
The beauty of HopChain lies in its focus on out-of-distribution proxy tasks rather than benchmark-specific data. Each query culminates in a verifiable numerical answer, making it an ideal source for Reinforcement Learning from Vision Rewards (RLVR). When integrated into the training of Qwen3.5-35B and 397B models, HopChain demonstrated remarkable generalizability, boosting accuracy in 20 of 24 benchmarks spanning STEM, document understanding, and video analysis.
The results emphasize the critical nature of full reasoning chains. Replacing multi-hop queries with simpler variants led to significant performance drops, while the gains in ultra-long reasoning regimes exceeded 50 accuracy points. This research suggests that the path to truly capable multimodal AI involves training on the structural logic of vision-based problem solving, moving beyond the limitations of text-heavy reasoning patterns.