Gen-Searcher Adds Search-Augmented Reasoning to Image Generation
- •Gen-Searcher uses multi-hop search to gather external knowledge for high-accuracy image generation
- •New training method combines supervised fine-tuning with reinforcement learning using dual text-image rewards
- •Model outperforms Qwen-Image by 16 points on the new KnowGen search-grounded benchmark
Current image generation models often struggle with knowledge-intensive tasks because they rely solely on data they were trained on, known as frozen internal knowledge. If a user asks for a specific recent event or a niche scientific concept, these models might guess or create inaccurate details. Gen-Searcher addresses this limitation by acting as an agent that can browse the web and retrieve reference images before it begins the creative process.
The system employs a technique called multi-hop reasoning, which means it does not just perform a single search; it follows a chain of information to find exactly what it needs for grounded generation. To train this complex behavior, researchers utilized a specific type of reinforcement learning where the model receives feedback based on both textual accuracy and how well the final image matches the retrieved visual references. This dual reward system ensures the AI remains faithful to real-world data.
The results are significant, showing a 16-point improvement over previous models on the KnowGen benchmark. Because the researchers have open-sourced the 8B model and their training datasets, this project provides a foundational framework for future search agents. These tools allow AI to understand and visualize the world in real-time, bridging the gap between static training and the evolving world of information.