AI FOR HIGH-PERFORMANCE COMPUTING

Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training

This paper introduces Plexus, a pioneering 3D parallel framework designed to overcome the memory and communication challenges of training Graph Neural Networks on massive, billion-edge datasets. It leverages a novel 3D tensor parallel algorithm, coupled with advanced optimizations, to achieve unprecedented scalability and speed on modern supercomputers.

Schedule Your AI Strategy Session

Executive Impact & Performance Benchmarks

Plexus sets new standards for large-scale GNN training, delivering superior performance and efficiency compared to state-of-the-art solutions.

0 GPUs on Perlmutter

0 Faster than SOTA

0 Time-to-Solution Reduction on Frontier

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

GNN Architecture & Challenges

3D Parallelism & Data Distribution

Performance Optimizations

Scaling & Results

The Dilemma of Large-Scale GNN Training

Graph Neural Networks (GNNs) are powerful, but training them on real-world, billion-edge graphs poses significant challenges. The sheer size often exceeds single-GPU memory, forcing reliance on mini-batch sampling, which introduces accuracy degradation and CPU-GPU transfer bottlenecks. Distributed full-graph training, while ideal for accuracy, faces high communication overheads and load imbalance due to irregular graph structures.

A typical GCN layer involves three key steps:

Aggregation: Each node collects feature embeddings from its neighbors, capturing local graph structure. This involves Sparse Matrix-Matrix Multiplication (SpMM), which is computationally intensive and suffers from poor performance on GPUs due to irregular memory access patterns.
Combination: Aggregated features are transformed into a new low-dimensional space using a weight matrix.
Activation: A non-linear function (e.g., ReLU) is applied to the combined features, forming the input for the next layer.

Full-Graph vs. Mini-Batch Training

Feature	Full-Graph Training	Mini-Batch Training
Accuracy	High, no approximations or bias	Potentially lower due to sampling approximations and bias
Memory Footprint	Very High (entire graph in memory)	Lower (subset of nodes per iteration)
Communication	High overheads in distributed settings	CPU-GPU data transfers can dominate training time
Load Imbalance	Significant issue in distributed settings due to irregular graphs	Less prominent, but sampling adds overhead
Complexity	Scalable distributed solutions are challenging	Requires careful selection of sampling strategies and hyperparameters

The Power of 3D Tensor Parallelism

Plexus addresses the challenges of distributed GNN training by adopting a three-dimensional (3D) tensor parallel algorithm, inspired by Agarwal et al.'s work in matrix multiplication. This approach efficiently distributes all matrices (adjacency, features, weights) and parallelizes computations across a virtual GPU grid.

In a 3D grid of Gx × Gy × Gz GPUs, matrices are sharded across different planes to ensure compatibility for local computations and minimize communication. For example, the adjacency matrix `A` is sharded across the ZX-plane and replicated across the Y-plane, while input features `F` are sharded across XY and Z planes. This strategic distribution ensures that each GPU only holds a portion of the data necessary for its local calculations, significantly reducing memory burden.

The forward pass involves critical communication steps:

An all-gather of feature matrix shards across the Z-parallel group.
An all-reduce of aggregation output across the X-parallel group.
An all-gather of weight matrix shards across the Z-parallel group.
An all-reduce of combination output across the Y-parallel group.

Enterprise Process Flow: 3D Parallel Aggregation

Shard Input Features (F)

→

All-gather F on Z-plane

→

Perform SpMM (A, F)

→

All-reduce H on X-plane

→

Shard Weight Matrix (W)

→

All-gather W on Z-plane

→

Perform SGEMM (H, W)

→

All-reduce Q on Y-plane

→

Apply Activation (σ)

This systematic approach ensures local computation compatibility and optimizes communication, making it possible to scale GNN training to unprecedented graph sizes.

Strategic Optimizations for Peak Performance

Plexus incorporates several key optimizations to enhance performance and mitigate common issues in distributed GNN training:

Double Permutation for Load Balancing: The sparse and uneven distribution of nonzeros in adjacency matrices can cause significant load imbalance across GPUs. Plexus employs a double permutation scheme (separate permutations for rows and columns) that ensures a near-perfect even distribution of nonzeros, effectively eliminating computational stragglers and improving overall training stability. This contrasts with graph partitioners that focus on edge cuts, which are less suitable for dense aggregation outputs.
Blocked Aggregation: To reduce performance variability observed in SpMM, especially on larger datasets and modest GPU counts, Plexus blocks the sparse adjacency matrix into smaller row-blocks. After each block's SpMM, an all-reduce is performed, and blocks are concatenated. This mitigates performance fluctuations and reduces communication overheads.
Dense Matrix Multiplication Tuning: Although dense matrix multiplications consume a small portion of the overall time, they can still impact scaling on high GPU counts. Plexus optimizes these kernels by reversing the multiplication order (e.g., for gradient computations), significantly reducing their execution time.
Parallel Data Loading: Traditional methods often load entire datasets into CPU memory before transfer, which is unsustainable for massive graphs. Plexus implements a parallel data loader that shards processed data into 2D files offline. Each GPU then loads, merges, and extracts only the shards it needs, drastically cutting CPU memory usage and data loading time.

Impact of Double Permutation

0.0 Max/Mean Non-Zero Ratio (europe_osm)

This metric indicates near-perfect load balance achieved with double permutation, distributing non-zeros evenly across all shards.

Unprecedented Scaling on Modern Supercomputers

Plexus demonstrates groundbreaking strong scaling capabilities across a variety of graph datasets on both Perlmutter (NERSC) and Frontier (OLCF). Our experiments show consistent performance improvements as GPU counts increase, distinguishing Plexus from state-of-the-art frameworks like SA and BNS-GCN, which often struggle with scalability beyond a limited number of GPUs.

Perlmutter Performance: Plexus scales effectively up to 128 GPUs on Reddit, achieving 9x speedup over SA and 6x over BNS-GCN on 32 GPUs. On larger datasets like Isolate-3-8M and products-14M, Plexus maintains strong scaling up to 1024 GPUs, significantly outperforming BNS-GCN by 3.8x at 256 GPUs and SA by 2.3x at 128 GPUs, respectively.
Frontier Performance: On Frontier, Plexus shows even better trends, with superior SpMM times on AMD GPUs contributing to improved scaling. It demonstrates impressive performance for ogbn-papers100M, the largest graph dataset tested, scaling up to 2048 GCDs (Graphics Compute Dies).
Communication Efficiency: Unlike BNS-GCN, which relies on all-to-all collectives that become inefficient at scale, Plexus's ring-based collectives communicate only with neighbors, leading to lower latency and better scalability, especially at higher GPU counts.

Case Study: Scaling Billion-Edge Graphs with ogbn-papers100M

The ogbn-papers100M dataset, with over 111 million nodes and 1.6 billion edges, represents one of the largest real-world graphs for GNN training. Plexus demonstrates its most impressive scaling capabilities on this dataset, pushing the boundaries of what is possible in full-graph GNN training.

On Frontier, Plexus scales efficiently up to 2048 GCDs for ogbn-papers100M, showcasing its ability to handle immense graph sizes. This is the largest reported scale for full-graph GNN training to date. The combination of 3D tensor parallelism and strategic optimizations allows Plexus to maintain strong scaling even as the computation cost becomes marginal, ensuring efficient utilization of massively parallel hardware.

This performance on such a massive dataset highlights Plexus's potential for enabling new breakthroughs in fields requiring analysis of extremely large and complex graph structures, from scientific simulations to social network analysis.

Maximize Your GNN Training ROI

Estimate the potential annual savings and reclaimed GPU hours by adopting optimized distributed GNN training solutions like Plexus.

Your Industry

Number of Data Scientists / Engineers

Average GPU Hours per Week (per person)

Average GPU Cost per Hour ($)

Estimated Annual Savings $0

Estimated Annual GPU Hours Reclaimed 0

Unlock Your Full ROI Potential

Your Enterprise AI Adoption Roadmap

A strategic overview of the phased approach to integrating Plexus into your distributed computing environment.

Phase 1: Initial Assessment & Setup (1-2 Weeks)

Evaluate current GNN workloads, infrastructure, and identify key scaling bottlenecks. Prepare the environment for Plexus deployment, including GPU cluster configuration and dependency installation.

Phase 2: Data Integration & Permutation (2-4 Weeks)

Implement parallel data loading for your specific datasets. Apply Plexus's double permutation scheme to optimize adjacency matrix distribution for balanced load across GPUs.

Phase 3: Model Adaptation & Optimization (4-8 Weeks)

Adapt existing GNN models to leverage Plexus's 3D tensor parallelism. Integrate blocked aggregation and SpMM tuning for specific workloads to maximize computational efficiency.

Phase 4: Scalability Testing & Refinement (2-4 Weeks)

Conduct extensive strong scaling tests on target supercomputing platforms. Utilize the performance model to identify and fine-tune optimal 3D configurations for maximum throughput and efficiency.

Phase 5: Production Deployment & Monitoring (Ongoing)

Deploy Plexus-optimized GNN training into production workflows. Implement continuous monitoring to ensure sustained high performance and address any emerging challenges.

Begin Your GNN Acceleration Journey

Ready to Scale Your GNN Training?

Partner with our experts to harness the power of 3D parallel GNN training and achieve unprecedented performance on your most demanding graph datasets.

Schedule a Consultation

AI FOR HIGH-PERFORMANCE COMPUTING

Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training

Executive Impact & Performance Benchmarks

Deep Analysis & Enterprise Applications

The Dilemma of Large-Scale GNN Training

Full-Graph vs. Mini-Batch Training

The Power of 3D Tensor Parallelism

Enterprise Process Flow: 3D Parallel Aggregation

Strategic Optimizations for Peak Performance

Impact of Double Permutation

Unprecedented Scaling on Modern Supercomputers

Case Study: Scaling Billion-Edge Graphs with ogbn-papers100M

Maximize Your GNN Training ROI

Your Enterprise AI Adoption Roadmap

Phase 1: Initial Assessment & Setup (1-2 Weeks)

Phase 2: Data Integration & Permutation (2-4 Weeks)

Phase 3: Model Adaptation & Optimization (4-8 Weeks)

Phase 4: Scalability Testing & Refinement (2-4 Weeks)

Phase 5: Production Deployment & Monitoring (Ongoing)

Ready to Scale Your GNN Training?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai