Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models
- •Alibaba introduces SpatialGenEval benchmark with 1,230 prompts to test spatial reasoning in text-to-image models.
- •Evaluation of 21 leading models reveals significant failures in higher-order tasks like occlusion and causality.
- •Fine-tuning models on the new SpatialT2I dataset yielded consistent performance gains up to 5.7%.
Current text-to-image models often produce breathtaking visuals that crumble under close scrutiny of spatial logic. While they can render a 'cat on a mat,' they frequently struggle with 'a cat hiding behind a box where only its ears are visible,' failing to grasp the nuances of depth and physical interaction. To bridge this gap, researchers from Alibaba-inc have developed SpatialGenEval, a rigorous benchmark—a standardized test used to compare performance—designed to probe the spatial intelligence of these systems across 1,230 information-dense prompts.
The benchmark moves beyond simple object placement, challenging models with complex scenarios involving occlusion (objects blocking each other) and causality. After evaluating 21 leading state-of-the-art (SOTA) models, the researchers confirmed a significant bottleneck: even the most advanced systems struggle with higher-order spatial reasoning. The study suggests that current training data lacks the descriptive depth needed for models to learn how physical objects truly occupy three-dimensional space.
To solve this, the team also released SpatialT2I, a dataset of 15,400 high-quality text-image pairs. By fine-tuning foundation models—pre-trained models used as a starting point for specialized tasks—such as Stable Diffusion-XL, which are part of a broader class of Multimodal AI, they achieved measurable improvements in spatial accuracy. This data-centric approach indicates that spatial intelligence is not just an architectural challenge but a matter of providing models with more precise, spatially-aware descriptions during the training process.