NVIDIA Releases Nemotron 3 Super for High-Throughput Agentic AI
- •NVIDIA launches Nemotron 3 Super, a 120-billion-parameter hybrid model achieving 5x higher throughput for autonomous agents.
- •New hybrid architecture combines Mamba and Transformer layers with Latent MoE to significantly reduce computational costs.
- •Model features a 1-million-token context window and open weights, optimized for the NVIDIA Blackwell hardware platform.
NVIDIA has unveiled Nemotron 3 Super, a 120-billion-parameter open model specifically engineered to power the next generation of autonomous AI agents. As businesses transition from simple chatbots to complex multi-agent systems, they often face "context explosion," where the massive volume of data exchanged between agents slows down performance. Nemotron 3 Super addresses this by utilizing a 1-million-token context window, which allows the model to remember vast amounts of information—roughly equivalent to several thick novels—without losing track of its original goal.
The model’s efficiency stems from a sophisticated hybrid architecture that blends two distinct neural network designs. It incorporates Mamba layers, which are highly efficient at processing long sequences of data, alongside traditional Transformer layers that provide the deep reasoning capabilities needed for complex tasks. This "mixture-of-experts" approach means that while the model has 120 billion total parameters, only 12 billion are active at any given time, drastically reducing the energy and computing power required for each response.
To further push the boundaries of speed, NVIDIA introduced "multi-token prediction," a technique that allows the AI to guess several future words simultaneously rather than one by one. This, combined with optimization for the new Blackwell hardware, results in inference speeds up to four times faster than previous generations. By releasing the model weights openly, NVIDIA is enabling developers to build specialized agents for fields like cybersecurity and financial analysis, ensuring high-accuracy tool usage without the "thinking tax" usually associated with large-scale reasoning models.