NVIDIA Launches Nemotron 3 Super with SGLang Support
- •NVIDIA releases Nemotron 3 Super, a 120B-parameter hybrid MoE model for multi-agent systems.
- •Model features 1M-token context window and hybrid Transformer-Mamba architecture for 5x throughput.
- •SGLang provides Day-0 inference support for optimized deployment on NVIDIA H200 and B200 GPUs.
NVIDIA has unveiled Nemotron 3 Super, a 120B-parameter model engineered to anchor complex multi-agent ecosystems. Unlike monolithic structures, this model utilizes a Mixture of Experts (MoE) design that activates only 12B parameters per pass. This architectural choice delivers high-tier reasoning at a fraction of the usual computational cost, making it ideal for the high-volume token generation required by agentic workflows.
The model introduces a hybrid architecture combining standard Transformer blocks with Mamba, a sequence modeling technique known for linear scaling and efficiency. To further boost performance, NVIDIA integrated Multi-Token Prediction (MTP), allowing the system to forecast several future words simultaneously. This approach reduces latency during generation, while a massive 1M-token context window ensures agents maintain coherence across extensive, multi-step planning tasks without losing track of previous interactions.
SGLang has announced immediate "Day-0" support for the model, offering a streamlined path to deployment on hardware like the H200 and B200 GPUs. By providing open weights and recipes, NVIDIA is positioning Nemotron 3 Super as a transparent alternative to closed-source giants. This openness, paired with adjustable reasoning depth via a "Thinking Budget," signals a shift toward flexible infrastructure for autonomous AI systems.