Qwen3 Latency Slashed on AMD MI300X Hardware
- •Qwen3-235B achieves over 2x faster token generation using AMD Instinct MI300X accelerators.
- •New PTPC quantization scheme improves efficiency by 15-30% compared to standard block scaling.
- •Multimodal Qwen3-VL sees 7x speedup in image decoding via GPU-accelerated rocJPEG integration.
Alibaba Cloud’s Qwen team, in collaboration with AMD, has revealed massive performance gains for their flagship Qwen3 models on the MI300X series GPUs. By leveraging the SGLang framework, the teams achieved a 2.12x improvement in token generation speed (TPOT) for the massive 235-billion parameter Qwen3-235B model. These breakthroughs make large-scale AI deployment significantly more cost-effective for interactive applications where speed is the primary bottleneck.
The optimization suite introduces several sophisticated techniques, most notably a new quantization method called PTPC (Per-Token Activation, Per-Channel Weight). This approach shrinks model weights to 8-bit floats (FP8) without the usual accuracy loss, aligning perfectly with the hardware's native processing units. By ensuring the math units don't sit idle while waiting for data, PTPC outperforms conventional scaling methods by up to 30%.
For multimodal tasks, the Qwen3-VL variant now handles high-resolution images with far less friction. By offloading image decoding—the process of turning compressed image files into usable data—to the GPU using the rocJPEG library, the team reduced latency for a single image from 27ms to just 4ms. This shift, combined with parallelizing the vision processing across multiple GPUs, ensures that complex visual inputs don't stall the overall model response.