Green-VLA: Staged Vision-Language-Action Model for Generalist Robots
- •Sber Robotics Center unveils Green-VLA, a five-stage framework for generalist robot control across diverse embodiments
- •Unified R64 action space and 3,000 hours of data enable control of humanoids and stationary manipulators
- •Staged reinforcement learning alignment doubles success rates in real-world bimanual cleaning tasks vs baselines
Developing a "brain" that can control any robot body remains a holy grail in robotics. Sber Robotics Center has introduced Green-VLA, a Vision-Language-Action (VLA) framework that achieves this versatility through a five-stage training curriculum. By progressing from foundational vision models to multi-embodiment pretraining and reinforcement learning alignment, the system learns to generalize physical intelligence across different hardware configurations.
A core innovation is the unified R64 action space, which acts as a "universal language" for robot movement. This interface allows a single policy to command diverse robots by masking unused joints (degrees of freedom) to prevent technical interference. To handle messy data sources, researchers employed optical-flow-based temporal resampling, a technique that normalizes motion speeds across 3,000 hours of demonstrations to ensure consistent behavior regardless of recording quality.
The framework also focuses on safety and precision via an episode-progress prediction head. This prevents "post-success fidgeting"—the tendency for robots to continue moving after a task is done, often causing accidental failures. In real-world bimanual cleaning tests, Green-VLA nearly doubled the success rate of traditional models while operating twice as fast, marking a leap for embodied AI.