NVIDIA and SGLang Boost AI Inference by 25x
- •SGLang achieves 25x inference speedup on NVIDIA GB300 NVL72 compared to previous generation Hopper GPUs.
- •New optimizations leverage Blackwell Ultra’s NVFP4 precision and upgraded memory to slash latency for reasoning models.
- •Collaboration with NVIDIA delivers 8x performance boost on GB200 systems through software and kernel refinements.
NVIDIA and the SGLang development team have unveiled a massive leap in AI performance, achieving a 25x speedup in inference for complex reasoning models. By running the DeepSeek R1 model on the new GB300 NVL72 system, the team demonstrated how tightly integrated software and hardware can drastically reduce the cost of running advanced AI. This breakthrough centers on the Blackwell Ultra architecture, which incorporates HBM3e—a type of ultra-fast memory—to provide more capacity for the heavy mathematical demands of modern AI models.
One of the core innovations involves using a new data format called NVFP4. This format shrinks the size of the AI's 'weights'—the internal parameters the model uses to make decisions—without sacrificing accuracy. By halving the amount of data moving through the system, the hardware can handle much larger batches of requests simultaneously. This is particularly effective for 'Mixture of Experts' (MoE) models, which only activate specific parts of their network for each task to save energy and time.
The system also introduces computation-communication overlap, a technique that allows the GPU to perform math operations while simultaneously sending data to other chips in the network. Instead of waiting for one task to finish before starting the next, the system works like a high-speed assembly line. These efficiency gains mean that developers can now deploy frontier-level models with much lower latency, making AI interactions feel more instantaneous and significantly cheaper to operate at scale.