Enterprise AI Analysis

Aurora Acceptance: A Collaborative Exascale Test Harness

The Aurora exascale system, deployed at Argonne Leadership Computing Facility (ALCF), underwent a rigorous acceptance testing process. This collaborative effort involved ALCF, Intel, and HPE, culminating in successful acceptance in December 2024. The testing, which mimicked real-world utilization, stressed the system and components, tracked regressions, and extended an open-source test harness. Key capabilities include collaborative test creation, support for diverse workloads, automated test execution, unique workplace creation, functional and performance validation, artifact retention, failure notification, and collaborative root cause analysis. The process leveraged tools like ReFrame, Jenkins, Slack, GitLab, Data Warehouse, Data Lake, and StabilityDB to manage complexity and ensure the system's readiness for scientific discovery.

Schedule Your Strategy Session

Key Impact Metrics

Total Jobs Executed

Acceptance Test Duration

Hundreds Test Configurations

Compute Nodes

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

System Architecture

Testing Framework

Collaboration & Analysis

Key Capabilities

Aurora is a complex leadership-class system with 166 HPE Cray EX4000 cabinets, 10,624 nodes. Each node has 2 Intel Xeon CPU Max Series processors with HBM2e and DDR5, and 6 Intel Data Center GPU Max Series (Ponte Vecchio) with HBM2e. The system interconnect is Slingshot 11 with a Dragonfly topology. Additionally, it includes 1,024 DAOS nodes with NVRAM and NVMe SSD, providing a total raw capacity of 250 PiB. Managed by HPE HPCM, nodes run SUSE Linux with HPE Cray Programming Environment and Intel oneAPI HPC Toolkit. This architecture underpins the rigorous demands of exascale computing.

The Aurora acceptance test harness extends the Polaris system's harness, which used ReFrame, Jenkins, and Slack. Enhancements were made to support Aurora’s scale and requirements, adding GitLab for version control, a Data Warehouse/Lake for data management, and StabilityDB for error tracking. The framework facilitates collaborative test creation, automated execution, and comprehensive failure analysis.

Close collaboration among ALCF, Intel, and HPE was crucial. GitLab's issue system tracks root cause analysis, integrating with Kafka for RAS events and other system metrics, forwarded to a central Data Warehouse. Intel-developed software monitors failed jobs and pulls BMC logs. The Data Lake, implemented with Delta Lake, stores and analyzes massive system log data, overcoming traditional database limitations. It includes hourly DNS zone transfers and a 'Location to Mask' for efficient bitwise filtering. This robust analysis framework enables efficient identification and resolution of systemic issues.

The harness provides collaborative test creation, support for multiple workflow/application patterns, hands-off automated test inception, unique workplace creation, optional compilation, automated job submission, functional/performance validation, artifact retention, failure notification, collaborative root cause analysis, and stakeholder-accessible dashboards. These capabilities ensure system stability, accuracy, and performance for scientific discovery.

10,624 Compute Nodes Tested

Acceptance Test Process Flow

Submit Test

→

Launch Test

→

Stage Test

→

Build/Compile?

→

Submit Job

→

Wait for Job

→

Check Correctness

→

Collect Performance

→

Pass?

→

Notify Failure

→

Create Issue

→

Stability DB

→

Root Cause

→

Data Warehouse

Test Harness Capabilities Comparison (Polaris vs. Aurora)

Capability	Polaris (Legacy)	Aurora (Enhanced)
Test Creation	Small team responsible	✓ Collaborative with SMEs ✓ GitLab for version control
Failure Analysis	Manual spreadsheet Limited data	✓ GitLab issue system ✓ Automated data ingestion (Kafka, BMC) ✓ Data Lake for log analysis ✓ StabilityDB integration
Hardware Targeting	No specific targeting	✓ Target specific nodes for diagnostics ✓ Node availability checks
Software Stack Testing	Pre-built binaries	✓ Automated on-the-fly compilation ✓ Version control for source code

Collaborative Acceptance Testing

Client: Argonne Leadership Computing Facility (ALCF)

Challenge: Deploying the Aurora exascale system with 10,624 nodes and complex software/hardware stack, requiring robust and collaborative acceptance testing.

Solution: Extended open-source test harness (ReFrame, Jenkins, Slack) with GitLab, Data Warehouse, Data Lake, and StabilityDB. Implemented collaborative test creation and automated root cause analysis workflows. Enabled hardware targeting and flexible software stack testing.

Results: Successful acceptance in December 2024. Executed 131,470 jobs across hundreds of configurations over 28 days. Enhanced transparency, reproducibility, and knowledge sharing. Accelerated troubleshooting and system validation.

Estimate Your Potential AI Efficiency Gains

Quantify the impact of advanced AI integration on your operational efficiency and cost savings. Adjust the parameters below to see estimated annual savings and reclaimed employee hours based on industry benchmarks.

Your Industry

Number of Employees Impacted by AI

Average Hours per Week Spent on Repetitive Tasks

Average Hourly Cost (incl. benefits)

Estimated Annual Savings $0

Employee Hours Reclaimed Annually 0

Schedule Your Strategy Session

Implementation Roadmap

A phased approach ensures successful integration and maximum ROI. Our experts guide you through each step, from initial strategy to continuous optimization.

Phase 1: Strategic Alignment & Pilot

Define key objectives, identify pilot projects, and establish success metrics. Integrate core AI components into a small-scale environment for initial validation.

Phase 2: Scaled Integration & Optimization

Expand AI solutions to broader organizational functions, optimize performance based on pilot results, and ensure seamless integration with existing systems.

Phase 3: Continuous Innovation & Governance

Establish ongoing monitoring, refine AI models for continuous improvement, and implement robust governance frameworks for ethical and effective AI deployment.

Ready to Transform Your Enterprise?

Begin your AI transformation journey today. Schedule a personalized consultation to explore how our tailored solutions can drive unparalleled efficiency and innovation for your business.

Schedule Your Strategy Session Download Full Report

Enterprise AI Analysis

Aurora Acceptance: A Collaborative Exascale Test Harness

Key Impact Metrics

Deep Analysis & Enterprise Applications

Acceptance Test Process Flow

Test Harness Capabilities Comparison (Polaris vs. Aurora)

Collaborative Acceptance Testing

Estimate Your Potential AI Efficiency Gains

Implementation Roadmap

Phase 1: Strategic Alignment & Pilot

Phase 2: Scaled Integration & Optimization

Phase 3: Continuous Innovation & Governance

Ready to Transform Your Enterprise?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai