What are the key points?

Sber Robotics Center unveils Green-VLA, a five-stage framework for generalist robot control across diverse embodiments Unified R64 action space and 3,000 hours of data enable control of humanoids and stationary manipulators Staged reinforcement learning alignment doubles success rates in real-world bimanual cleaning tasks vs baselines

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

•Sber Robotics Center unveils Green-VLA, a five-stage framework for generalist robot control across diverse embodiments
•Unified R64 action space and 3,000 hours of data enable control of humanoids and stationary manipulators
•Staged reinforcement learning alignment doubles success rates in real-world bimanual cleaning tasks vs baselines

Developing a "brain" that can control any robot body remains a holy grail in robotics. Sber Robotics Center has introduced Green-VLA, a Vision-Language-Action (VLA) framework that achieves this versatility through a five-stage training curriculum. By progressing from foundational vision models to multi-embodiment pretraining and reinforcement learning alignment, the system learns to generalize physical intelligence across different hardware configurations.

A core innovation is the unified R64 action space, which acts as a "universal language" for robot movement. This interface allows a single policy to command diverse robots by masking unused joints (degrees of freedom) to prevent technical interference. To handle messy data sources, researchers employed optical-flow-based temporal resampling, a technique that normalizes motion speeds across 3,000 hours of demonstrations to ensure consistent behavior regardless of recording quality.

The framework also focuses on safety and precision via an episode-progress prediction head. This prevents "post-success fidgeting"—the tendency for robots to continue moving after a task is done, often causing accidental failures. In real-world bimanual cleaning tests, Green-VLA nearly doubled the success rate of traditional models while operating twice as fast, marking a leap for embodied AI.

Developing a "brain" that can control any robot body remains a holy grail in robotics. Sber Robotics Center has introduced Green-VLA, a Vision-Language-Action (VLA) framework that achieves this versatility through a five-stage training curriculum. By progressing from foundational vision models to multi-embodiment pretraining and reinforcement learning alignment, the system learns to generalize physical intelligence across different hardware configurations.

A core innovation is the unified R64 action space, which acts as a "universal language" for robot movement. This interface allows a single policy to command diverse robots by masking unused joints (degrees of freedom) to prevent technical interference. To handle messy data sources, researchers employed optical-flow-based temporal resampling, a technique that normalizes motion speeds across 3,000 hours of demonstrations to ensure consistent behavior regardless of recording quality.

The framework also focuses on safety and precision via an episode-progress prediction head. This prevents "post-success fidgeting"—the tendency for robots to continue moving after a task is done, often causing accidental failures. In real-world bimanual cleaning tests, Green-VLA nearly doubled the success rate of traditional models while operating twice as fast, marking a leap for embodied AI.

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Tags