Meituan Unveils LongCat-Next Native Multimodal Model
- •Meituan introduces LongCat-Next, a model using pure discrete autoregressive architecture for native multimodality.
- •The DiNA framework unifies text, vision, and audio into a shared discrete token space.
- •LongCat-Next bridges the performance gap between visual understanding and image generation tasks.
Researchers from Meituan's LongCat team have introduced LongCat-Next, a foundation model that challenges the standard way AI systems handle different types of data. Traditionally, models have been language-centric, often treating images or audio as secondary attachments added to a text-based core. LongCat-Next shifts this paradigm by using the Discrete Native Autoregressive (DiNA) framework. This approach treats every modality—whether it is a word, a pixel, or a sound wave—as a discrete token within a single, shared mathematical space. By doing so, the model achieves a level of "native" multimodality where the same core logic processes all inputs equally.
The breakthrough is powered by a new component called the Discrete Native Any-resolution Visual Transformer (dNaViT). This allows the system to break down visual signals into hierarchical tokens regardless of the image resolution, effectively bridging the gap between seeing (understanding) and painting (generating). Unlike previous models that often struggled to excel at both tasks simultaneously, LongCat-Next maintains high performance across a wide variety of benchmarks. It represents a significant step toward artificial intelligence that perceives the world more holistically, much like the human brain integrates different senses.
To support the broader AI community, Meituan has open-sourced the model and its specialized tokenizers. This move allows developers and researchers to explore a truly unified architecture that minimizes the need for modality-specific hacks or complex patchwork designs. By simplifying the structural complexity of multimodal systems, LongCat-Next paves the way for more efficient and capable AI agents that can talk, listen, and visualize within a single, streamlined framework.