Skip to main content
Enterprise AI Analysis: WAGES: Workload-Aware GPU Sharing System for Energy-Efficient Serverless LLM Serving

Enterprise AI Analysis

WAGES: Workload-Aware GPU Sharing System for Energy-Efficient Serverless LLM Serving

This analysis explores WAGES, a novel system designed to optimize large language model (LLM) serving for enterprise environments. It addresses critical inefficiencies in current serverless platforms by intelligently managing GPU resources and power consumption without compromising performance.

Key Enterprise Impact

WAGES delivers substantial improvements in critical operational metrics, translating directly into reduced costs and enhanced service delivery for LLM-powered applications.

0% SLO Attainment Improvement
0% Energy Consumption Reduction
Enhanced GPU Utilization
Lowered Hardware Costs

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Addressing LLM Serving Challenges

Large Language Models (LLMs) are central to modern applications, but their serving demands fluctuate, requiring dynamic resource management to meet strict Service-Level Objectives (SLOs) without wasting GPU capacity. Traditional server-centric platforms often lead to over-provisioning. Serverless inference aims to simplify this by automatically managing resources, offering pay-as-you-go economic benefits.

Existing serverless systems often focus on reducing cold-start latency but overlook two key inefficiencies: (i) static, exclusive GPU allocation leading to underutilization and inflated hardware costs, and (ii) reliance on hardware-controlled clock frequencies that waste energy. WAGES addresses these by enabling GPU multiplexing and dynamic clock scaling.

Core Concepts for LLM Serving

LLM Inference Phases

LLM inference typically involves two phases: Prefill, where the input prompt is processed to build the Key-Value (KV) cache and generate the first token, heavily utilizing computation resources (SMs). The Decode phase generates subsequent tokens, reusing the KV cache and becoming memory-bound. Performance metrics are Time-to-First-Token (TTFT) for prefill and Time-Between-Tokens (TBT) for decode.

Serverless LLM Serving Paradigms

Serverless LLM serving platforms offer elastic scaling and pay-as-you-go pricing, making them cost-effective for bursty workloads. However, many still allocate GPUs statically and exclusively, leading to underutilization and energy waste for varied workloads.

GPU Sharing Mechanisms

Modern GPUs offer sharing mechanisms like NVIDIA Multi-Process Service (MPS) and Multi-Instance GPU (MIG). MPS partitions compute resources (SMs) while sharing memory, making it ideal for memory-hungry LLMs and supporting runtime SM repartitioning. MIG provides physical isolation but with fixed partition sizes and costly reconfiguration. WAGES leverages MPS for its flexibility and runtime adaptability.

Identifying Key Inefficiencies in Current LLM Serving

Up to 26% Potential Energy Savings identified from Dynamic Clock Scaling

GPU Inefficiency from Overprovisioning

Measurements show that allocating an entire GPU often provides far more Streaming Multiprocessors (SMs) than needed to meet TTFT and TBT SLOs. For example, 40% SMs were sufficient for three different LLMs under various input lengths. This indicates significant opportunity for GPU multiplexing, where sharing a single GPU among multiple LLMs can boost device utilization and reduce hardware costs, obviating the need for extra GPUs.

GPU Inefficiency from Static Allocation

The speedup from allocating more SMs is non-linear and varies significantly with input length and LLM size. A static SM partition, optimized for the most demanding case, becomes sub-optimal as workloads fluctuate. For instance, increasing SM allocation from 60% to 80% yields only modest TTFT/TBT improvements (e.g., 13% and 10% for 8192 tokens). This highlights the need for a runtime system that can continually re-partition SMs dynamically to adapt to changing workloads and maintain optimal performance.

Energy Inefficiency from Static GPU Clock

Profiling energy consumption reveals that running GPUs at maximum clock frequency rarely leads to the least energy use. Lowering the GPU clock can save 4-6% energy for prefill and decode phases without violating SLOs, with decode benefiting more due to its memory-bound nature. As workload intensity decreases (e.g., shorter input lengths or smaller LLMs), even greater energy savings (over 15%) are possible. Crucially, the energy-optimal clock shifts across different phases and input lengths, demonstrating that dynamic GPU clock adjustment is essential for energy-efficient LLM serving.

LLM Serving Strategy Comparison
Feature Existing Static Allocation WAGES Dynamic Optimization
GPU Utilization Low (over-provisioning) High (dynamic SM partitioning)
Energy Efficiency Low (static max clock) High (dynamic clock scaling)
SLO Attainment Variable, risk of violations Consistent, improved attainment
Workload Adaptability Poor (fixed resources) Excellent (runtime tuning)
Hardware Costs Higher (GPU waste) Lower (GPU multiplexing)

WAGES: A Two-Tier Workload-Aware GPU Sharing System

WAGES is built on NVIDIA MPS and aims to meet SLO targets while cutting energy by dynamically partitioning SMs and tuning GPU clock speeds. It adopts a two-tier scheduler: a Global Scheduler (GS) and multiple Local Schedulers (LS) on each GPU.

Enterprise Process Flow

Client Requests
Global Scheduler (GS)
Local Scheduler (LS)
Runtime SM & Clock Adjustment
GPU Consolidation & Reconfiguration
Energy-Efficient LLM Serving

Global Scheduler (GS) Responsibilities

The GS generates LLM performance-energy profiles for various input lengths, clocks, and SM shares. It dispatches client requests to the optimal local scheduler based on estimated completion time, SLOs, and memory availability. If no existing GPU can meet the SLO, it triggers auto-scaling to provision a new GPU.

Local Scheduler (LS) Operations

The LS on each GPU decides which tasks to run next, and their SM partitions and GPU clocks. It employs out-of-order execution combined with dynamic SM partitioning to balance compute and memory usage, reduce head-of-line blocking, and improve utilization. A priority score prevents starvation of longer tasks. After batching tasks, the LS allocates SMs and sweeps through GPU clock frequencies to select the configuration that satisfies SLOs and minimizes energy consumption.

LLM Placement Reconfiguration

WAGES periodically reconfigures LLM placement across GPUs to improve utilization and reduce energy use, especially with fluctuating workloads. It uses a bipartite graph matching algorithm to determine an optimal GPU mapping that minimizes transmission overhead for LLM weights and KV caches. It also mitigates transmission overheads by overlapping data transfer with computation and intelligently deciding whether to transfer KV caches or recompute them locally.

WAGES Performance & Energy Efficiency

Case Study: Workload W4 - High Demand Scenario

Workload W4 represents a high-demand scenario with 4 small, 4 medium, and 3 large LLM models, pushing computation intensity. In this scenario, WAGES demonstrates its robust capabilities under heavy load.

Compared to MuxServer, WAGES achieves higher SLO attainment even at high request rates by dynamically adapting SM partitions and clock speeds. For energy consumption, WAGES shows a 14% reduction versus MuxServer under W4 at scale 16. This highlights WAGES's ability to maintain efficiency and meet performance targets even when GPUs are fully loaded, through effective runtime SM repartitioning and clock scaling, which cuts execution time and improves overall efficiency.

This case study illustrates how WAGES's fine-grained, workload-aware sharing strategy translates into tangible benefits in demanding enterprise settings.

Experimental Setup

Experiments were conducted on four NVIDIA H100 GPUs with NVLink, leveraging MPS for SM partitioning. Key metrics included SLO attainment (%) and overall energy consumption (Joules). Real-world Azure-Chat traces were used, scaled to create four workloads (W1-W4) with increasing LLM sizes and computation intensity. WAGES was compared against MuxServe (state-of-the-art MPS-based GPU sharing) and MuxServerless (a modified MuxServe with serverless management).

End-to-End Results

WAGES consistently delivers higher SLO attainment than MuxServer and MuxServerless. At low computation demands (e.g., W1 at scales 1 and 4), all systems achieved 100% SLO. However, as LLM sizes grow or request rates increase, the baselines showed a clear drop, while WAGES maintained higher attainment due to its dynamic placement reconfiguration and adaptive SM partitioning. WAGES improves SLO attainment by up to 4%.

For energy consumption, WAGES significantly lowers energy use by up to 26% compared to MuxServer and MuxServerless. Savings come from dynamic LLM placement (merging lightly loaded GPUs) and on-the-fly GPU clock scaling. Even in high-intensity workloads (e.g., W4 at scale 16), WAGES achieved a 14% reduction versus MuxServer by cutting execution time and improving efficiency without hurting SLOs.

Conclusion & Future Directions

WAGES offers an innovative approach to energy-efficient serverless LLM serving. By combining dynamic SM repartitioning and GPU clock adjustment built on NVIDIA MPS, WAGES continuously adapts SM allocations and tunes clock frequencies to meet SLOs while minimizing energy consumption. Its periodic LLM placement reconfiguration consolidates workloads onto fewer GPUs, improving utilization and reducing hardware costs. Overlapping model/KV migration with execution further reduces reconfiguration overheads.

Experimental results confirm that WAGES not only outperforms state-of-the-art GPU-sharing-based serving systems in SLO attainment (by up to 4%) but also achieves substantial energy savings (up to 26%). This demonstrates that fine-grained, workload-aware sharing is crucial for optimizing both performance and energy efficiency in enterprise LLM deployments.

Discussion Points

  • Cold-start optimization: WAGES keeps LLM weights and inference engines alive in CPU memory and can integrate techniques like DRAM preloading or approaches from ServerlessLLM and ENOVA for further improvements.
  • GPU-sharing techniques: While WAGES leverages NVIDIA MPS, further research on advanced GPU sharing strategies focusing specifically on decoder-based LLMs' unique compute and memory access patterns could yield additional benefits, beyond existing work on computer vision models.

Calculate Your Potential ROI with WAGES

Estimate the operational savings and reclaimed engineering hours by implementing advanced LLM serving optimizations in your enterprise.

Estimated Annual Savings $0
Reclaimed Annual Hours 0

Your WAGES Implementation Roadmap

A structured approach to integrating WAGES into your existing LLM serving infrastructure.

Phase 1: Discovery & Assessment (Weeks 1-2)

Understand current LLM workloads, existing GPU infrastructure, and specific SLOs. Profile key models and deployment patterns to establish a baseline for optimization. Identify integration points and potential challenges.

Phase 2: PoC & Customization (Weeks 3-6)

Deploy WAGES in a proof-of-concept environment with a subset of your LLM services. Customize SM partitioning, clock tuning strategies, and placement reconfiguration algorithms to align with your unique workload characteristics. Validate initial performance and energy savings.

Phase 3: Pilot Deployment (Weeks 7-10)

Scale WAGES to a pilot group of users or non-critical LLM applications. Monitor SLO attainment, energy consumption, and GPU utilization in a live environment. Refine scheduling policies and data migration strategies based on real-world feedback.

Phase 4: Full Production Rollout & Optimization (Ongoing)

Integrate WAGES across your entire LLM serving infrastructure. Establish continuous monitoring and automated feedback loops for ongoing optimization. Leverage WAGES's adaptive capabilities to ensure long-term efficiency and cost-effectiveness.

Ready to Transform Your LLM Serving?

Unlock unparalleled efficiency and cost savings for your enterprise LLM deployments with WAGES. Our experts are ready to guide you.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking