Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization
- •BeingBeyond introduces Being-H0.5, a Vision-Language-Action foundation model for diverse robotic platforms.
- •UniHand-2.0 dataset provides 35,000+ hours of multimodal data across 30 distinct robotic embodiments.
- •Model achieves state-of-the-art performance on LIBERO (98.9%) and RoboCasa (53.9%) benchmarks.
The field of robotics has long struggled with the "embodiment gap"—the difficulty of transferring skills learned by one robot to another with a different shape or control system. BeingBeyond researchers have tackled this head-on with Being-H0.5, a Vision-Language-Action (VLA) model that treats human interaction as a universal "mother tongue." By viewing human movements as the baseline for physical interaction, the model can effectively bridge the gap between human demonstrations and various robotic hardware, from multi-fingered hands to industrial arms. At the heart of this system is the UniHand-2.0 dataset, an unprecedented collection of over 35,000 hours of multimodal data covering 30 different robotic embodiments. This massive scale allows the model to learn Generalization, or the ability to apply knowledge to new, unseen scenarios. To handle this diversity, the team implemented a Mixture-of-Transformers architecture. This design uses a Mixture-of-Flow (MoF) framework to separate shared motor skills from specialized modules tailored to specific robot bodies. The results are impressive, with Being-H0.5 setting new records on the LIBERO (98.9%) and RoboCasa (53.9%) benchmarks. Beyond scores, the model introduces a Unified Action Space, which maps different robot controls into semantically aligned slots. This enables "low-resource" robots—those with very little training data—to "bootstrap" or quickly learn complex skills by borrowing intelligence from more data-rich platforms. It represents a significant step toward a single Foundation Model capable of perceiving and acting across any physical form.