SGLang-Diffusion: Two Months In
- •SGLang-Diffusion achieves 1.5x speedup and state-of-the-art inference speeds on NVIDIA and AMD GPUs.
- •New Layerwise Offload technique overlaps weight loading with computation to reduce VRAM footprint for high-resolution content.
- •Full integration with Cache-DiT and ComfyUI boosts generation speeds by up to 169% for visual models.
LMSYS has released a major update to SGLang-Diffusion, a specialized Inference Framework for generating images and videos. Just two months after its debut, the system is now 1.5 times faster, achieving state-of-the-art performance that outperforms rival solutions by as much as fivefold on NVIDIA hardware. This update transforms the tool into an industrial-grade engine capable of handling the most demanding visual generation tasks. A key technical highlight is the "Layerwise Offload" system. This technique allows the GPU to "prefetch" data for the next layer of a Diffusion Transformer—the complex architecture used in models like Flux.2—while it is still computing the current one. By overlapping these tasks, the system avoids processing delays and reduces the memory footprint, enabling the creation of higher-resolution content on consumer-grade hardware. Furthermore, the update adds extensive support for LoRA, a method for fine-tuning models on specific styles without retraining the entire system. Users can now merge or swap these stylistic "adapters" via a simple interface. Combined with Cache-DiT—a method that speeds up generation by 169%—and a new integration for ComfyUI, SGLang-Diffusion offers a highly flexible and efficient environment for both developers and creative professionals.