AI Model Creates Spatially Aware Digital Humans for VR
- •SARAH system enables real-time, spatially-aware full-body motion for digital humans in virtual reality.
- •Architecture combines causal transformer-based VAEs with flow matching for 300 FPS performance.
- •Adjustable gaze scoring mechanism allows users to control eye contact intensity during live interactions.
Digital humans in virtual reality often feel "robotic" because they fail to react to a user’s physical presence or maintain natural eye contact. A new research project titled SARAH (Spatially Aware Real-time Agentic Humans) addresses this by introducing a fully causal system that generates full-body motion aligned with both speech and spatial context. Unlike previous models that merely sync gestures to audio, SARAH allows an agent to turn toward a user and respond dynamically to their movements in a 3D space.
The technical backbone of SARAH is a sophisticated mix of a causal transformer-based variational autoencoder (VAE) and a flow matching model. By using "causal" processing—meaning the AI only looks at past and present data to make decisions—the system achieves a staggering 300 frames per second. This high speed is critical for streaming VR headsets where any lag can break immersion or cause motion sickness. This efficiency makes it three times faster than previous non-causal baselines while maintaining high motion quality.
Perhaps most impressively, the researchers introduced a gaze scoring mechanism with classifier-free guidance. This allows developers to tune how much eye contact an agent maintains without retraining the whole model. Whether a character is meant to be shy or assertive, the AI captures natural spatial alignment from data while giving users precise control over social dynamics during real-time deployment on hardware.