EPD Disaggregation: Scaling Vision-Language Models in SGLang
- •LMSYS introduces EPD architecture in SGLang to separate vision encoding from language processing
- •Horizontal scaling of vision encoders reduces Time To First Token by 6-8x in multimodal tasks
- •System implements vision embedding caching and RDMA-based transfer backends to optimize throughput
- •**Vision Language Model (VLM): **AI models capable of processing and understanding both visual information and natural language text simultaneously.
- •**Inference Framework: **Software systems designed to deploy and optimize machine learning models for efficient execution in production environments.
LMSYS Org, in collaboration with engineers from Alibaba Cloud and AntGroup SCT, has launched Encoder-Prefill-Decode (EPD) disaggregation within the SGLang framework. This novel architecture addresses a critical bottleneck in Vision-Language Models (VLMs) by decoupling the vision encoding phase from the language prefill and decoding stages. Traditionally, scaling these components together using tensor parallelism often led to diminishing returns due to high communication overhead and the relatively small parameter count of vision encoders compared to the core language model. By allowing vision encoders to scale horizontally as independent units, EPD enables massive performance gains in image-heavy scenarios, such as multi-image reasoning. The system supports advanced features like vision embedding caching to eliminate redundant computations and high-bandwidth transfer mechanisms like Mooncake for low-latency communication. Benchmarks show that this three-tier approach can reduce Time To First Token (TTFT) by up to 8x and double request throughput compared to standard colocated deployments. This advancement marks a significant shift toward modular infrastructure for multimodal AI, ensuring that compute-intensive vision tasks no longer stall language generation pipelines.