What are the key points?

VET-Bench reveals that state-of-the-art Vision-Language Models fail basic visual object tracking tasks. Researchers prove fixed-depth transformers cannot track identical objects without explicit intermediate supervision steps. New SGCoT method enables models to achieve 90% accuracy by generating specific object trajectories.

VLMs Master the Shell Game via Trajectory Reasoning

•VET-Bench reveals that state-of-the-art Vision-Language Models fail basic visual object tracking tasks.
•Researchers prove fixed-depth transformers cannot track identical objects without explicit intermediate supervision steps.
•New SGCoT method enables models to achieve 90% accuracy by generating specific object trajectories.

Vision-Language Models (VLMs) have long struggled with the "shell game"—the cognitive ability to track specific objects as they move and swap positions. While humans do this instinctively, current AI models often rely on visual shortcuts, identifying objects by unique colors or textures rather than following their actual path. When faced with identical objects that require continuous tracking through space and time, even the most advanced models typically perform no better than a random guess.

To address this gap, researchers from the National University of Singapore introduced VET-Bench, a diagnostic tool designed to test spatiotemporal continuity. Their analysis provides a mathematical proof that standard transformer architectures have fundamental limits when tracking indistinguishable entities. Because these models process information in fixed layers, they struggle to maintain a persistent memory of an object's position across a sequence of video frames without a structured way to record those movements.

The solution, dubbed Spatiotemporal Grounded Chain-of-Thought (SGCoT), mimics human logic by forcing the AI to narrate the object's trajectory. By generating coordinates and movement descriptions as intermediate reasoning steps, the model creates a logical trail of the object’s location. When applied to the Molmo2 model, this technique boosted accuracy from near-zero to over 90%, proving that explicit reasoning can overcome architectural bottlenecks in visual tracking.

Vision-Language Models (VLMs) have long struggled with the "shell game"—the cognitive ability to track specific objects as they move and swap positions. While humans do this instinctively, current AI models often rely on visual shortcuts, identifying objects by unique colors or textures rather than following their actual path. When faced with identical objects that require continuous tracking through space and time, even the most advanced models typically perform no better than a random guess.

To address this gap, researchers from the National University of Singapore introduced VET-Bench, a diagnostic tool designed to test spatiotemporal continuity. Their analysis provides a mathematical proof that standard transformer architectures have fundamental limits when tracking indistinguishable entities. Because these models process information in fixed layers, they struggle to maintain a persistent memory of an object's position across a sequence of video frames without a structured way to record those movements.

The solution, dubbed Spatiotemporal Grounded Chain-of-Thought (SGCoT), mimics human logic by forcing the AI to narrate the object's trajectory. By generating coordinates and movement descriptions as intermediate reasoning steps, the model creates a logical trail of the object’s location. When applied to the Molmo2 model, this technique boosted accuracy from near-zero to over 90%, proving that explicit reasoning can overcome architectural bottlenecks in visual tracking.

VLMs Master the Shell Game via Trajectory Reasoning

Tags