Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
- •New multimodal research paradigm enables dozens of reasoning steps and hundreds of search engine interactions.
- •Vision-DeepResearch outperforms proprietary systems like GPT-5 and Claude-4-Sonnet on fact-centric VQA benchmarks.
- •Model achieves state-of-the-art results through internalization of research capabilities via reinforcement learning.
Vision-DeepResearch introduces a significant shift in how AI handles complex information gathering by enabling models to conduct deep, multi-turn investigations across both visual and textual data. While traditional multimodal models often struggle with "visual noise" or rely on overly simplistic search queries, this new framework allows for multi-entity and multi-scale searching. This means the model doesn't just look at an image once; it interactively zooms in on details and executes dozens of reasoning steps to find the most accurate evidence.
The researchers successfully "internalized" these sophisticated research habits into the model itself. Instead of relying on external scripts to guide the search, the model learns the research process through a combination of cold-start supervision—a method of training the model on high-quality starting data—and Reinforcement Learning. By practicing these workflows, the 8B and 30B-A3B parameter models can now navigate hundreds of engine interactions autonomously.
In head-to-head testing, Vision-DeepResearch demonstrated superior performance over closed-source giants like GPT-5 and Gemini-2.5-Pro on six major fact-centric benchmarks. This suggests that specialized training for long-horizon tasks—those requiring many steps over a long period—can allow smaller, open-source models to punch well above their weight class against the world's most powerful general-purpose foundation models.