StepFun Unveils STEP3-VL-10B: A 10B Model Rivaling 235B Giants
- •StepFun releases STEP3-VL-10B, an open-source model rivaling proprietary giants 20 times its size.
- •Model achieves 92.2% on MMBench using a unified 1.2T token pre-training strategy and Qwen3-8B decoder.
- •Parallel Coordinated Reasoning (PaCoRe) enables significant test-time compute scaling for complex visual math.
StepFun has unveiled STEP3-VL-10B, a compact yet formidable multimodal foundation model that punches far above its weight class. Despite its relatively small 10-billion parameter footprint, this model manages to match or exceed the performance of "frontier" models like Gemini 2.5 Pro and much larger open-source rivals like the 235B Qwen3-VL. This efficiency is achieved through a radical "unified" pre-training strategy where the perception encoder and the language decoder are fully unfrozen and trained together on 1.2 trillion tokens to ensure the vision and language components work in perfect harmony. The breakthrough lies in how the model thinks during inference. By implementing Parallel Coordinated Reasoning (PaCoRe), StepFun allows the model to scale its "test-time compute"—essentially giving the model more "thinking time" to explore and synthesize different visual hypotheses before delivering a final answer. This approach translates into stellar reasoning capabilities, evidenced by a 94.43% score on the AIME2025 benchmark. It’s a clear signal that architectural cleverness and scaling strategies can often overcome the limitations of a raw, massive parameter count. Beyond the base architecture, the model underwent an intensive post-training phase involving over 1,000 iterations of reinforcement learning to refine its accuracy and alignment. This iterative process helps the model handle intricate visual tasks like MathVision with 75.95% accuracy. By releasing the full model suite as open source, StepFun provides a high-efficiency baseline that proves smaller models can indeed deliver state-of-the-art multimodal intelligence when paired with the right perceptual reasoning techniques.