MLLMs Struggle with Basic Visual Reasoning Benchmark
- •The BabyVision benchmark reveals that current MLLMs struggle with basic visual tasks easily solved by three-year-old children.
- •Top models like Gemini Pro significantly underperform compared to human benchmarks, scoring less than half of the adult average.
- •Researchers released an open-source evaluation toolkit and BabyVision-Gen to help bridge the visual reasoning gap in AI.
Researchers led by Liang Chen, an AI researcher at UniPat-AI, have introduced BabyVision, a benchmark designed to expose fundamental limitations in Multimodal Large Language Models (MLLMs). While humans naturally develop core visual skills before learning to speak, modern AI models often rely on linguistic patterns to hide a fragile understanding of visual data. This benchmark evaluates performance across 22 subclasses and 388 items, specifically testing abilities that are independent of text-based knowledge.
The results demonstrate a substantial performance gap between AI and humans, with even advanced models like Gemini Pro-Preview failing to match the visual intuition of a young child. While these models excel at tasks requiring vast amounts of stored knowledge, they lack the visual primitives necessary for genuine perception and spatial logic. The study suggests that current training methods do not adequately foster the basic cognitive building blocks found in early human development.
To address these shortcomings, the research team also released BabyVision-Gen, a generative approach aimed at solving complex visual puzzles. Alongside this tool, they provided an open-source evaluation toolkit to encourage further development in the field. This research highlights the urgent need for a fundamental shift in how multimodal systems are trained to achieve human-level perception and reasoning.
By focusing on visual-only reasoning, the BabyVision benchmark serves as a critical diagnostic tool for the next generation of AI development. It shifts the focus from simple pattern matching to more robust, human-like interpretation of the physical world. This work underscores that true multimodal intelligence requires more than just scaling up data; it demands a deeper integration of visual and spatial logic.