ERNIE 5.0 Technical Report
- •ERNIE 5.0 debuts as trillion-parameter unified model for multimodal understanding and generation.
- •Sparse Mixture-of-Experts (MoE) architecture enables efficient routing across text, image, video, and audio.
- •Novel elastic training paradigm generates a family of sub-models for diverse deployment constraints.
ERNIE 5.0 marks a significant milestone in the evolution of foundation models, emerging as a trillion-parameter titan that bridges the gap between understanding and generation. Developed as a natively autoregressive system, it treats different types of data—text, images, and video—as part of a unified "next-group-of-tokens" prediction task. This approach allows the model to process varied inputs through a single, cohesive framework rather than relying on separate modules for each modality.
At its core lies an ultra-sparse Mixture-of-Experts (MoE) architecture. This design uses specialized "experts" for different tasks, but unlike traditional models, its expert routing is modality-agnostic. This means the model dynamically chooses which internal pathways to use based on the complexity of the data rather than just its format. To handle the immense costs of large-scale deployment, the researchers introduced an elastic training paradigm. This allows a single pre-training session to produce multiple "sub-models" of varying sizes and speeds, offering flexibility for devices with limited memory or processing power.
Scaling reinforcement learning to such a massive, multimodal MoE system presents unique stability challenges. The technical report details how the team overcame these hurdles to ensure consistent performance. By successfully integrating these diverse components into a production-scale model, ERNIE 5.0 sets a new benchmark for how unified AI systems can be trained from scratch to handle the full spectrum of human communication.