NVIDIA Blackwell Ultra Boosts DeepSeek Long-Context Performance
- •NVIDIA GB300 NVL72 achieves 1.53x throughput gain over GB200 for long-context DeepSeek inference
- •Expanded 288GB HBM3e memory enables 1.6x higher decode batch sizes for 128K token sequences
- •Hardware-accelerated Softmax and optimized kernels reduce initial prompt processing latency by 23%
NVIDIA and the SGLang team have unveiled a major performance leap for long-context AI by deploying the DeepSeek R1 model on the new Blackwell Ultra (GB300) platform. By optimizing how the system handles massive blocks of text—up to 128,000 tokens at once—the team achieved a 53% throughput increase compared to the previous generation. This is a critical development for applications like legal document analysis or complex coding tasks where the AI must "remember" vast amounts of information simultaneously without slowing down.
The secret to this speed lies in the GB300’s expanded memory and specialized hardware components. The new chips feature 288GB of high-speed memory (HBM3e), allowing the system to keep more data ready for immediate use. This prevents the "memory bottleneck" that usually occurs when AI models try to predict the next word in a very long conversation. Furthermore, the team utilized a technique called Multi-token Prediction (MTP), which allows the model to guess several words at once rather than one at a time, nearly doubling the speed for individual users without sacrificing overall system capacity.
To manage the immense workload, the engineers split the process into two stages: "prefill" (reading the prompt) and "decode" (generating the answer). They used a control system called NVIDIA Dynamo to coordinate these tasks across multiple GPUs efficiently. With improved units inside the chip that handle complex math—specifically the Special Function Units—the system can now process the initial prompt up to 23% faster, proving that Blackwell Ultra is currently the most formidable infrastructure for the next generation of deep-reasoning AI models.