What are the key points?

OpenMOSS debuts MOVA, an open-source 32B parameter model for synchronized video and audio generation. MOVA uses a Mixture-of-Experts architecture, activating 18B parameters during inference for efficient performance. The system supports image-to-video-audio tasks, delivering lip-sync, sound effects, and content-aligned music.

OpenMOSS Releases MOVA: 32B Open-Source Video-Audio Model

•OpenMOSS debuts MOVA, an open-source 32B parameter model for synchronized video and audio generation.
•MOVA uses a Mixture-of-Experts architecture, activating 18B parameters during inference for efficient performance.
•The system supports image-to-video-audio tasks, delivering lip-sync, sound effects, and content-aligned music.

Generating high-quality video is difficult, but syncing it with realistic audio is even harder. Traditionally, AI systems relied on cascaded pipelines—where one model creates video and another adds sound—often leading to timing issues and accumulated errors. The OpenMOSS Team has addressed this with MOVA (MOSS Video and Audio), a massive 32-billion parameter model designed to generate both modalities simultaneously. This joint modeling ensures that the sound of a crashing wave or the movement of a speaker's lips matches the visual frames with high precision.

The model utilizes a Mixture-of-Experts architecture. Think of this as a team of specialized sub-models where only the most relevant "experts" are called upon for a specific task. While the model contains 32 billion parameters in total, it only activates 18 billion during inference (the phase where the AI actually generates content), significantly reducing the computational power needed without sacrificing output quality. This efficiency allows MOVA to handle complex tasks where a single image and a text prompt are transformed into a full, cinematic audio-visual experience.

By releasing MOVA as an open-source project, the researchers aim to provide a transparent alternative to closed-source systems. The release includes model weights and a codebase supporting LoRA fine-tuning, which allows creators to adapt the AI to specific styles or voices with minimal data. From realistic lip-synced speech to environment-aware sound effects, MOVA provides a versatile, accessible toolset for the next generation of digital storytellers and AI researchers alike.

Generating high-quality video is difficult, but syncing it with realistic audio is even harder. Traditionally, AI systems relied on cascaded pipelines—where one model creates video and another adds sound—often leading to timing issues and accumulated errors. The OpenMOSS Team has addressed this with MOVA (MOSS Video and Audio), a massive 32-billion parameter model designed to generate both modalities simultaneously. This joint modeling ensures that the sound of a crashing wave or the movement of a speaker's lips matches the visual frames with high precision.

The model utilizes a Mixture-of-Experts architecture. Think of this as a team of specialized sub-models where only the most relevant "experts" are called upon for a specific task. While the model contains 32 billion parameters in total, it only activates 18 billion during inference (the phase where the AI actually generates content), significantly reducing the computational power needed without sacrificing output quality. This efficiency allows MOVA to handle complex tasks where a single image and a text prompt are transformed into a full, cinematic audio-visual experience.

By releasing MOVA as an open-source project, the researchers aim to provide a transparent alternative to closed-source systems. The release includes model weights and a codebase supporting LoRA fine-tuning, which allows creators to adapt the AI to specific styles or voices with minimal data. From realistic lip-synced speech to environment-aware sound effects, MOVA provides a versatile, accessible toolset for the next generation of digital storytellers and AI researchers alike.

OpenMOSS Releases MOVA: 32B Open-Source Video-Audio Model

Tags