Video Diffusion Models Grant 3D Spatial Awareness to MLLMs
- •VEGA-3D repurposes pre-trained video diffusion models as latent world simulators for multimodal AI.
- •Framework extracts 3D structural priors from noise levels to fix spatial blindness in MLLMs.
- •Method sets new benchmarks in 3D scene understanding and robotic embodied manipulation tasks.
While modern Multimodal Large Language Models (MLLMs) can describe images with remarkable clarity, they often suffer from "spatial blindness," failing to grasp the underlying three-dimensional geometry and physical laws of the world they observe. This limitation hinders their use in complex tasks like robotic navigation or fine-grained spatial reasoning where depth and volume matter.
To bridge this gap, researchers introduced VEGA-3D (Video Extracted Generative Awareness), a framework that treats video diffusion models as "latent world simulators." The core insight is simple. For an AI to generate a realistic, coherent video, it must implicitly understand how objects move and exist in 3D space. By tapping into these hidden physical priors, the system provides MLLMs with a "sense of space" without requiring expensive 3D-labeled data.
The system works by extracting spatiotemporal features from the intermediate stages of the video generation process—specifically from the noise levels where structural details emerge. These geometric cues are then integrated into the language model using an adaptive gated fusion mechanism, which essentially decides how much spatial information is needed for a specific task.
Experimental results show that this approach significantly outperforms existing baselines in 3D scene understanding and embodied manipulation. By turning generative models into spatial teachers, VEGA-3D offers a scalable path toward AI that truly understands the physical dimensions of the environments it inhabits.