Skip to main content
Enterprise AI Analysis: Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Enterprise AI Analysis

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Authors: Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu

Abstract: MOONCAKE is the serving platform for Kimi, an LLM chatbot service developed by Moonshot AI. This platform features a KVCache-centric disaggregated architecture that not only separates prefill and decoding clusters but also efficiently utilizes the underexploited CPU, DRAM, SSD and NIC resources of the GPU cluster to establish a disaggregated KVCache. At the core of MOONCAKE is its KVCache-centric global cache and a scheduler designed to maximize throughput while adhering to stringent latency-related Service Level Objectives (SLOs). Our experiments demonstrate that MOONCAKE excels in scenarios involving long-context inputs. In tests using real traces, MOONCAKE increases the effective request capacity by 59%~498% when compared to baseline methods, all while complying with SLOs. Currently, MOONCAKE is operational across thousands of nodes, processing over 100 billion tokens daily. In practical deployments, MOONCAKE'S innovative architecture enables Kimi to handle 115% and 107% more requests on NVIDIA A800 and H800 clusters, respectively, compared to previous systems.

Executive Impact: Tangible Business Outcomes

Mooncake's KVCache-centric disaggregated architecture delivers significant performance improvements, translating directly into enhanced efficiency and reduced operational costs for large-scale LLM serving.

0% Boost in Request Capacity
0% More Requests Handled (GPU)
0% KVCache Hit Rate Improvement
0x KVCache Transfer Speedup

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Enterprise Process Flow: KVCache-centric Disaggregated Workflow

MOONCAKE employs a disaggregated architecture with a KVCache-centric global scheduler (Conductor). The inference workflow is divided into four key stages to maximize efficiency and resource utilization.

KVCache Reuse (load prefix from remote CPU memory to GPU)
Incremental Prefill (complete prefill, store new KVCache to CPU memory, chunked if long)
KVCache Transfer (asynchronously stream KVCache to decoding node CPU memory)
Decoding (request joins continuous batching, generate outputs)

Case Study: Kimi and Open-Source LLM Frameworks

MOONCAKE\'s innovative architecture has been successfully integrated into popular open-source LLM inference frameworks, extending their capabilities for distributed serving.

MOONCAKE serves as the core serving platform for Kimi, Moonshot AI\'s LLM chatbot service, processing over 100 billion tokens daily. Its architecture has also been integrated into prominent open-source LLM inference frameworks such as SGLang and vLLM to accelerate P/D disaggregation and KVCache reuse. These integrations leverage MOONCAKE\'s high-efficiency RDMA communication layer, providing robust technical support for large-scale distributed inference tasks. This demonstrates MOONCAKE\'s role as a pluggable efficiency booster for LLM systems.

Key Takeaways:

  • Enhanced performance on NVIDIA A800 and H800 clusters (115% and 107% more requests respectively compared to previous systems).
  • Supports P/D disaggregation in SGLang and vLLM.
  • Leverages high-efficiency RDMA for communication.
  • Modular APIs enable tailored integration with different framework designs.
84% Reduction in Average Time To First Token (TTFT)

KVCache-centric scheduling drastically reduces the Time To First Token (TTFT) by optimizing request routing and resource allocation. Our KVCache-centric global scheduling algorithm reduces the average TTFT by an additional 14% compared to local cache-aware scheduling, and approximately 84% compared to random scheduling (from 19.65ms to 3.07ms).

14.3% Reduction in Rejected Requests During Overload

MOONCAKE\'s early rejection based on prediction strategy significantly reduces unnecessary prefill computations during overload scenarios. Compared to a baseline strategy (4183 rejections), Early Rejection based on Prediction reduced rejected requests to 3589, a 14.3% improvement, by predicting decoding load and avoiding wasted prefill computation.

498% Increase in Effective Request Capacity

MOONCAKE significantly boosts effective request capacity and improves cache hit rates, leading to substantial computational savings. In tests using real traces, MOONCAKE increases the effective request capacity by 59%~498% compared to baseline methods, all while complying with SLOs. It also achieves up to 93% relative improvement in cache hit rate over LRU baseline with flexible storage.

4.6x Faster KVCache Transfer Speed

MOONCAKE\'s transfer engine delivers high bandwidth, drastically outperforming existing solutions for inter-node KVCache transfers. The transfer engine achieves bandwidths of up to 190 GB/s (8x400 Gbps network), approximately 2.4x and 4.6x faster than TCP protocol, minimizing latency for large-scale deployments.

Comparison: Mooncake vs. vLLM Performance

Feature/System MOONCAKE vLLM (Baseline)
Effective Request Capacity (Long Context)
  • Up to +498% vs. vLLM
  • Suboptimal, significant TBT fluctuations
Global KVCache Pool
  • Yes (CPU, DRAM, SSD, RDMA)
  • No (local HBM/DRAM only)
Cache Hit Rate Improvement
  • Up to 2.36x higher than local cache (93% relative improvement)
  • Limited by local HBM capacity (supports ~50% theoretical hit rate)
Prefill GPU Time Savings
  • Up to 64% reduction across workloads
  • Higher due to local cache limitations & chunked prefill overhead
P/D Disaggregation
  • Yes (fully disaggregated)
  • Coupled prefill/decoding stages
Inter-node KVCache Transfer
  • High-speed RDMA (up to 190 GB/s), topology-aware
  • PyNCCL for multi-node P/D (poor performance/failure domain)

Calculate Your Potential ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by adopting KVCache-centric disaggregated LLM serving.

Projected Annual Savings

Annual Cost Savings $0
Hours Reclaimed Annually 0

Implementation Roadmap

Our phased approach ensures a smooth transition and optimal integration of Mooncake's architecture into your existing infrastructure.

Phase 01: Initial Assessment & Design

Comprehensive analysis of existing LLM serving infrastructure, workload patterns, and SLO requirements. Design of a tailored Mooncake deployment strategy including KVCache sizing, network configuration, and P/D ratio optimization.

Phase 02: Core Architecture Deployment

Deployment of Mooncake Store (distributed KVCache pool) and Conductor (global scheduler). Integration of Transfer Engine for high-speed inter-node KVCache transfers, leveraging RDMA capabilities.

Phase 03: Prefill & Decoding Cluster Integration

Configuration of disaggregated prefill and decoding instances. Implementation of KVCache-centric scheduling algorithms with dynamic PD ratio adjustment and early rejection policies for overload management.

Phase 04: Performance Tuning & Monitoring

Fine-tuning of cache policies (e.g., flexible KVCache storage, hotspot replication) and scheduling parameters. Setup of robust monitoring and fault tolerance mechanisms for continuous high availability and performance.

Phase 05: Scalability & Future Enhancements

Ongoing optimization for evolving workloads, including integration with advanced techniques like KVCache compression and attention architectures. Planning for future scaling and resource management strategies.

Ready to Transform Your LLM Serving?

Unlock unprecedented efficiency, scalability, and cost savings for your enterprise-grade LLM applications. Our experts are ready to guide you.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking