Enterprise AI Analysis
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Authors: Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu
Abstract: MOONCAKE is the serving platform for Kimi, an LLM chatbot service developed by Moonshot AI. This platform features a KVCache-centric disaggregated architecture that not only separates prefill and decoding clusters but also efficiently utilizes the underexploited CPU, DRAM, SSD and NIC resources of the GPU cluster to establish a disaggregated KVCache. At the core of MOONCAKE is its KVCache-centric global cache and a scheduler designed to maximize throughput while adhering to stringent latency-related Service Level Objectives (SLOs). Our experiments demonstrate that MOONCAKE excels in scenarios involving long-context inputs. In tests using real traces, MOONCAKE increases the effective request capacity by 59%~498% when compared to baseline methods, all while complying with SLOs. Currently, MOONCAKE is operational across thousands of nodes, processing over 100 billion tokens daily. In practical deployments, MOONCAKE'S innovative architecture enables Kimi to handle 115% and 107% more requests on NVIDIA A800 and H800 clusters, respectively, compared to previous systems.
Executive Impact: Tangible Business Outcomes
Mooncake's KVCache-centric disaggregated architecture delivers significant performance improvements, translating directly into enhanced efficiency and reduced operational costs for large-scale LLM serving.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Enterprise Process Flow: KVCache-centric Disaggregated Workflow
MOONCAKE employs a disaggregated architecture with a KVCache-centric global scheduler (Conductor). The inference workflow is divided into four key stages to maximize efficiency and resource utilization.
Case Study: Kimi and Open-Source LLM Frameworks
MOONCAKE\'s innovative architecture has been successfully integrated into popular open-source LLM inference frameworks, extending their capabilities for distributed serving.
MOONCAKE serves as the core serving platform for Kimi, Moonshot AI\'s LLM chatbot service, processing over 100 billion tokens daily. Its architecture has also been integrated into prominent open-source LLM inference frameworks such as SGLang and vLLM to accelerate P/D disaggregation and KVCache reuse. These integrations leverage MOONCAKE\'s high-efficiency RDMA communication layer, providing robust technical support for large-scale distributed inference tasks. This demonstrates MOONCAKE\'s role as a pluggable efficiency booster for LLM systems.
Key Takeaways:
- Enhanced performance on NVIDIA A800 and H800 clusters (115% and 107% more requests respectively compared to previous systems).
- Supports P/D disaggregation in SGLang and vLLM.
- Leverages high-efficiency RDMA for communication.
- Modular APIs enable tailored integration with different framework designs.
KVCache-centric scheduling drastically reduces the Time To First Token (TTFT) by optimizing request routing and resource allocation. Our KVCache-centric global scheduling algorithm reduces the average TTFT by an additional 14% compared to local cache-aware scheduling, and approximately 84% compared to random scheduling (from 19.65ms to 3.07ms).
MOONCAKE\'s early rejection based on prediction strategy significantly reduces unnecessary prefill computations during overload scenarios. Compared to a baseline strategy (4183 rejections), Early Rejection based on Prediction reduced rejected requests to 3589, a 14.3% improvement, by predicting decoding load and avoiding wasted prefill computation.
MOONCAKE significantly boosts effective request capacity and improves cache hit rates, leading to substantial computational savings. In tests using real traces, MOONCAKE increases the effective request capacity by 59%~498% compared to baseline methods, all while complying with SLOs. It also achieves up to 93% relative improvement in cache hit rate over LRU baseline with flexible storage.
MOONCAKE\'s transfer engine delivers high bandwidth, drastically outperforming existing solutions for inter-node KVCache transfers. The transfer engine achieves bandwidths of up to 190 GB/s (8x400 Gbps network), approximately 2.4x and 4.6x faster than TCP protocol, minimizing latency for large-scale deployments.
| Feature/System | MOONCAKE | vLLM (Baseline) |
|---|---|---|
| Effective Request Capacity (Long Context) |
|
|
| Global KVCache Pool |
|
|
| Cache Hit Rate Improvement |
|
|
| Prefill GPU Time Savings |
|
|
| P/D Disaggregation |
|
|
| Inter-node KVCache Transfer |
|
|
Calculate Your Potential ROI
Estimate the efficiency gains and cost savings your enterprise could achieve by adopting KVCache-centric disaggregated LLM serving.
Projected Annual Savings
Implementation Roadmap
Our phased approach ensures a smooth transition and optimal integration of Mooncake's architecture into your existing infrastructure.
Phase 01: Initial Assessment & Design
Comprehensive analysis of existing LLM serving infrastructure, workload patterns, and SLO requirements. Design of a tailored Mooncake deployment strategy including KVCache sizing, network configuration, and P/D ratio optimization.
Phase 02: Core Architecture Deployment
Deployment of Mooncake Store (distributed KVCache pool) and Conductor (global scheduler). Integration of Transfer Engine for high-speed inter-node KVCache transfers, leveraging RDMA capabilities.
Phase 03: Prefill & Decoding Cluster Integration
Configuration of disaggregated prefill and decoding instances. Implementation of KVCache-centric scheduling algorithms with dynamic PD ratio adjustment and early rejection policies for overload management.
Phase 04: Performance Tuning & Monitoring
Fine-tuning of cache policies (e.g., flexible KVCache storage, hotspot replication) and scheduling parameters. Setup of robust monitoring and fault tolerance mechanisms for continuous high availability and performance.
Phase 05: Scalability & Future Enhancements
Ongoing optimization for evolving workloads, including integration with advanced techniques like KVCache compression and attention architectures. Planning for future scaling and resource management strategies.
Ready to Transform Your LLM Serving?
Unlock unprecedented efficiency, scalability, and cost savings for your enterprise-grade LLM applications. Our experts are ready to guide you.