Enterprise AI Analysis

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Authors: Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu

Abstract: MOONCAKE is the serving platform for Kimi, an LLM chatbot service developed by Moonshot AI. This platform features a KVCache-centric disaggregated architecture that not only separates prefill and decoding clusters but also efficiently utilizes the underexploited CPU, DRAM, SSD and NIC resources of the GPU cluster to establish a disaggregated KVCache. At the core of MOONCAKE is its KVCache-centric global cache and a scheduler designed to maximize throughput while adhering to stringent latency-related Service Level Objectives (SLOs). Our experiments demonstrate that MOONCAKE excels in scenarios involving long-context inputs. In tests using real traces, MOONCAKE increases the effective request capacity by 59%~498% when compared to baseline methods, all while complying with SLOs. Currently, MOONCAKE is operational across thousands of nodes, processing over 100 billion tokens daily. In practical deployments, MOONCAKE'S innovative architecture enables Kimi to handle 115% and 107% more requests on NVIDIA A800 and H800 clusters, respectively, compared to previous systems.

Schedule Your Strategy Session

Executive Impact: Tangible Business Outcomes

Mooncake's KVCache-centric disaggregated architecture delivers significant performance improvements, translating directly into enhanced efficiency and reduced operational costs for large-scale LLM serving.

0% Boost in Request Capacity

0% More Requests Handled (GPU)

0% KVCache Hit Rate Improvement

0x KVCache Transfer Speedup

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Enterprise Process Flow: KVCache-centric Disaggregated Workflow

MOONCAKE employs a disaggregated architecture with a KVCache-centric global scheduler (Conductor). The inference workflow is divided into four key stages to maximize efficiency and resource utilization.

KVCache Reuse (load prefix from remote CPU memory to GPU)

→

Incremental Prefill (complete prefill, store new KVCache to CPU memory, chunked if long)

→

KVCache Transfer (asynchronously stream KVCache to decoding node CPU memory)

→

Decoding (request joins continuous batching, generate outputs)

Case Study: Kimi and Open-Source LLM Frameworks

MOONCAKE\'s innovative architecture has been successfully integrated into popular open-source LLM inference frameworks, extending their capabilities for distributed serving.

MOONCAKE serves as the core serving platform for Kimi, Moonshot AI\'s LLM chatbot service, processing over 100 billion tokens daily. Its architecture has also been integrated into prominent open-source LLM inference frameworks such as SGLang and vLLM to accelerate P/D disaggregation and KVCache reuse. These integrations leverage MOONCAKE\'s high-efficiency RDMA communication layer, providing robust technical support for large-scale distributed inference tasks. This demonstrates MOONCAKE\'s role as a pluggable efficiency booster for LLM systems.

Key Takeaways:

Enhanced performance on NVIDIA A800 and H800 clusters (115% and 107% more requests respectively compared to previous systems).
Supports P/D disaggregation in SGLang and vLLM.
Leverages high-efficiency RDMA for communication.
Modular APIs enable tailored integration with different framework designs.

84% Reduction in Average Time To First Token (TTFT)

KVCache-centric scheduling drastically reduces the Time To First Token (TTFT) by optimizing request routing and resource allocation. Our KVCache-centric global scheduling algorithm reduces the average TTFT by an additional 14% compared to local cache-aware scheduling, and approximately 84% compared to random scheduling (from 19.65ms to 3.07ms).

14.3% Reduction in Rejected Requests During Overload

MOONCAKE\'s early rejection based on prediction strategy significantly reduces unnecessary prefill computations during overload scenarios. Compared to a baseline strategy (4183 rejections), Early Rejection based on Prediction reduced rejected requests to 3589, a 14.3% improvement, by predicting decoding load and avoiding wasted prefill computation.

498% Increase in Effective Request Capacity

MOONCAKE significantly boosts effective request capacity and improves cache hit rates, leading to substantial computational savings. In tests using real traces, MOONCAKE increases the effective request capacity by 59%~498% compared to baseline methods, all while complying with SLOs. It also achieves up to 93% relative improvement in cache hit rate over LRU baseline with flexible storage.

4.6x Faster KVCache Transfer Speed

MOONCAKE\'s transfer engine delivers high bandwidth, drastically outperforming existing solutions for inter-node KVCache transfers. The transfer engine achieves bandwidths of up to 190 GB/s (8x400 Gbps network), approximately 2.4x and 4.6x faster than TCP protocol, minimizing latency for large-scale deployments.

Comparison: Mooncake vs. vLLM Performance
Feature/System	MOONCAKE	vLLM (Baseline)
Effective Request Capacity (Long Context)	Up to +498% vs. vLLM	Suboptimal, significant TBT fluctuations
Global KVCache Pool	Yes (CPU, DRAM, SSD, RDMA)	No (local HBM/DRAM only)
Cache Hit Rate Improvement	Up to 2.36x higher than local cache (93% relative improvement)	Limited by local HBM capacity (supports ~50% theoretical hit rate)
Prefill GPU Time Savings	Up to 64% reduction across workloads	Higher due to local cache limitations & chunked prefill overhead
P/D Disaggregation	Yes (fully disaggregated)	Coupled prefill/decoding stages
Inter-node KVCache Transfer	High-speed RDMA (up to 190 GB/s), topology-aware	PyNCCL for multi-node P/D (poor performance/failure domain)

Calculate Your Potential ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by adopting KVCache-centric disaggregated LLM serving.

Projected Annual Savings

Your Industry

Number of Employees (impacted by LLM usage)

Average LLM-dependent Hours/Week per Employee

Average Hourly Fully-Loaded Employee Cost ($)

Annual Cost Savings $0

Hours Reclaimed Annually 0

Implementation Roadmap

Our phased approach ensures a smooth transition and optimal integration of Mooncake's architecture into your existing infrastructure.

Phase 01: Initial Assessment & Design

Comprehensive analysis of existing LLM serving infrastructure, workload patterns, and SLO requirements. Design of a tailored Mooncake deployment strategy including KVCache sizing, network configuration, and P/D ratio optimization.

Phase 02: Core Architecture Deployment

Deployment of Mooncake Store (distributed KVCache pool) and Conductor (global scheduler). Integration of Transfer Engine for high-speed inter-node KVCache transfers, leveraging RDMA capabilities.

Phase 03: Prefill & Decoding Cluster Integration

Configuration of disaggregated prefill and decoding instances. Implementation of KVCache-centric scheduling algorithms with dynamic PD ratio adjustment and early rejection policies for overload management.

Phase 04: Performance Tuning & Monitoring

Fine-tuning of cache policies (e.g., flexible KVCache storage, hotspot replication) and scheduling parameters. Setup of robust monitoring and fault tolerance mechanisms for continuous high availability and performance.

Phase 05: Scalability & Future Enhancements

Ongoing optimization for evolving workloads, including integration with advanced techniques like KVCache compression and attention architectures. Planning for future scaling and resource management strategies.

Discuss Your Implementation Timeline

Ready to Transform Your LLM Serving?

Unlock unprecedented efficiency, scalability, and cost savings for your enterprise-grade LLM applications. Our experts are ready to guide you.

Book Your Free Consultation

Enterprise AI Analysis

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Executive Impact: Tangible Business Outcomes

Deep Analysis & Enterprise Applications

Enterprise Process Flow: KVCache-centric Disaggregated Workflow

Case Study: Kimi and Open-Source LLM Frameworks

Comparison: Mooncake vs. vLLM Performance

Calculate Your Potential ROI

Projected Annual Savings

Implementation Roadmap

Phase 01: Initial Assessment & Design

Phase 02: Core Architecture Deployment

Phase 03: Prefill & Decoding Cluster Integration

Phase 04: Performance Tuning & Monitoring

Phase 05: Scalability & Future Enhancements

Ready to Transform Your LLM Serving?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai