DeepGen 1.0 Outperforms Rivals 16x Its Size
- •DeepGen 1.0 is a 5B parameter unified model for image generation and editing tasks.
- •New 'Stacked Channel Bridging' and 'think tokens' enable 5B model to beat 80B parameter rivals.
- •Model achieves superior performance using only 50 million training samples and open-sourced weights.
DeepGen 1.0 represents a significant shift toward efficiency in multimodal AI, proving that massive parameter counts are not always necessary for high-quality results. Developed by the Shanghai Innovation Institute, this 5B parameter model handles both image generation and editing with a precision that often eludes models five to sixteen times its size.
The architecture utilizes a novel framework called Stacked Channel Bridging (SCB), which pulls rich, layered information from multiple levels of a vision language model. By combining this data with learnable 'think tokens'—special placeholders that help the model process reasoning-rich guidance—DeepGen 1.0 provides the generative backbone with a more structured understanding of complex prompts. This approach bridges the gap between seeing an image and understanding the intricate logic required to modify it.
The researchers employed a three-stage training strategy, culminating in reinforcement learning using GRPO. This technique uses a mixture of reward functions to fine-tune the model based on human preferences, ensuring the output remains high-fidelity without common visual glitches. By open-sourcing the weights and code, the team aims to democratize access to high-performance multimodal tools, allowing researchers to build advanced image tools without needing industrial-scale computing power.