TOPReward Uses Model Probabilities for Better Robotic Training
- •TOPReward uses internal model token probabilities to estimate robotic task progress with high accuracy.
- •New system achieves 0.947 correlation on Qwen3-VL, significantly outperforming existing zero-shot reward baselines.
- •Framework generalizes across 130 tasks and multiple robot platforms including Franka and SO-100.
Training robots to perform complex tasks often requires "rewards"—mathematical signals that tell the machine when it is doing a good job. Traditionally, these rewards are difficult to design by hand and often fail when the robot encounters a new environment. Researchers from the Allen Institute for AI have introduced TOPReward, a system that turns the "hidden" knowledge inside large Vision-Language Models (VLMs) into precise guidance for robots without needing extra training.
Most current methods ask an AI to describe a robot's progress in words, but this often leads to errors in numerical reasoning. Instead, TOPReward looks directly at the model's internal "logits," which are the raw mathematical scores the AI assigns to different possible words before it speaks. By analyzing the probability of specific tokens, the system creates a smooth "temporal value function"—a map that tracks how close a robot is to finishing a task over time.
The results are striking, showing a 0.947 correlation with actual task progress on the Qwen3-VL model. This allows robots to understand success across over 130 different real-world scenarios, from folding laundry to picking up objects, without any task-specific fine-tuning. This breakthrough simplifies the bridge between massive AI models and physical hardware, making it easier for robots to learn through trial and error in the real world.