Enterprise AI Analysis

HIGRAPH: A Large-Scale Hierarchical Graph Dataset for Malware Analysis

The advancement of graph-based malware analysis is critically limited by the absence of large-scale datasets that capture the inherent hierarchical structure of software. Existing methods often oversimplify programs into single-level graphs, failing to model the crucial semantic relationship between high-level functional interactions and low-level instruction logic. To bridge this gap, we introduce HIGRAPH, the largest public hierarchical graph dataset for malware analysis, comprising over 200M Control Flow Graphs (CFGs) nested within 595K Function Call Graphs (FCGs). This two-level representation preserves structural semantics essential for building robust detectors resilient to code obfuscation and malware evolution. We demonstrate HIGRAPH's utility through a large-scale analysis that reveals distinct structural properties of benign and malicious software, establishing it as a foundational benchmark for the community. The dataset and tools are publicly available at https://higraph.org.

Authors: Han Chen, Hanchen Wang, Hongmei Chen, Ying Zhang, Lu Qin, Wenjie Zhang

Schedule Your Strategy Session

Quantifiable Impact & Core Innovations

HIGRAPH offers unprecedented scale and structural depth, providing a robust foundation for next-generation malware analysis.

0 Applications Analyzed

0 Hierarchical Graphs Mapped

0 Longitudinal Data Coverage

0 Model Aging Resilience

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The Critical Need for Hierarchical Models

Understanding the limitations of traditional 'flat' graph models and the critical need for a hierarchical representation to capture complex software behaviors, as exemplified by ransomware evolution.

CryptoLocker Evolution: A Case for Hierarchical Graphs

The evolution of ransomware like CryptoLocker from simple XOR to hybrid AES+RSA encryption (Figure 1) demonstrates that while implementation details change, its core malicious behavior (file discovery → encryption → notification) remains consistent. Traditional flat graph representations often miss this, but HIGRAPH's hierarchical model captures these persistent structural patterns, enabling detection that transcends superficial code modifications.

0 CFGs within FCGs (Hierarchical Scale)

Building a Foundational Dataset

Explore HIGRAPH's meticulous methodology for collecting, curating, and extracting hierarchical graph representations (FCGs and CFGs) from a massive dataset of Android applications.

Enterprise Process Flow

Collect Android Apps (AndroZoo)

→

Download VT Reports (Labeling)

→

Decompile & Unpack (Androguard)

→

Extract Hierarchical Graph Structure

→

Feature Engineering (CFG/FCG)

HIGRAPH vs. Existing Malware Datasets
Dataset	Year	Size	Format	Scale	Temporal Robustness	Spatial Robustness
AndroZoo[1]	2016	11M+	Raw APKs	Low/None	Low/None	Low/None
Drebin[4]	2014	5.5K	Features	Medium/Partial	Low/None	Low/None
AMD[30]	2017	NA	Features	Medium/Partial	Low/None	Low/None
APIGraph[32]	2020	322K	Features	High/Full	Low/None	Low/None
Malnet[16]	2021	1.2M	Single-level	High/Full	Medium/Partial	Medium/Partial
HIGRAPH (Ours)	2025	595K	Hierarchical	High/Full	High/Full	High/Full

Key Structural Differences: Benign vs. Malicious

Uncover distinct structural properties of benign and malicious software, revealing key patterns in FCG and CFG complexity, centrality, and temporal evolution.

0 Avg Malicious FCG PageRank

0 Avg Malicious CFG Cyclomatic Complexity

Our analysis (Figure 4) reveals that malicious applications exhibit higher average and maximum PageRank values in their Function Call Graphs (FCGs), indicating more influential functions and a centralized architecture. At the Control Flow Graph (CFG) level, malware shows higher node degrees and elevated cyclomatic complexity, pointing to intricate conditional logic, often for obfuscation. Temporally, benign FCGs show accelerating growth and modularity, while malware FCGs tend to shrink after 2015, increasing in density, optimized for functional concentration (Figure 8).

Superior Detection & Temporal Robustness

Evaluate HIGRAPH's effectiveness in malware detection and classification, highlighting the superior performance and temporal robustness of hierarchical graph neural networks.

Malware Detection Performance (Macro F1)
Model	Binary Classification (Macro F1)	Multi-class Classification (Macro F1)	2012-2013 Temporal AUT(F1)	2012-2016 Temporal AUT(F1)
GCN	0.640 ±0.022	0.401 ±0.024	0.604 ±0.021	0.489 ±0.018
GAT	0.719 ±0.021	0.412 ±0.021	0.691 ±0.018	0.483 ±0.014
GIN	0.690 ±0.021	0.401 ±0.024	0.743 ±0.023	0.520 ±0.021
GraphSAGE	0.678 ±0.017	0.392 ±0.023	0.612 ±0.019	0.489 ±0.016
Hi-GNN	0.734 ±0.210	0.435 ±0.019	0.755 ±0.017	0.715 ±0.019

0 Hi-GNN Binary Classification Performance

0 Hi-GNN Temporal Robustness (2012-2013)

Our Hi-GNN model significantly outperforms single-level GNNs in malware detection across both binary and multi-class classification tasks (Table 6). Crucially, Hi-GNN demonstrates superior temporal robustness, mitigating the 'model aging' effect with a 2012-2013 AUT(F1) of 0.755 (Table 7). This resilience stems from capturing stable semantic features at the CFG level and adaptive architectural patterns at the FCG level, proving the hierarchical structure's critical role in mitigating concept drift.

Deep Dive into Malware Analysis with Us

Calculate Your Potential AI-Driven ROI

Estimate the significant time and cost savings your enterprise could achieve by integrating advanced AI solutions for threat detection and analysis.

Your Industry

Number of Employees (Impacted by Manual Analysis)

Avg. Weekly Hours Spent on Manual Analysis per Employee

Avg. Hourly Cost per Employee (Incl. Benefits)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Request a Custom ROI Analysis

Our AI Implementation Roadmap

A structured approach to integrating cutting-edge hierarchical graph analysis into your existing security infrastructure.

Phase 1: Discovery & Assessment

Comprehensive analysis of existing systems, data architecture, and specific threat detection challenges to tailor HIGRAPH's hierarchical models to your unique needs.

Phase 2: Data Integration & Model Training

Secure integration of your proprietary binary data with the HIGRAPH framework, followed by bespoke model training and fine-tuning for optimal performance.

Phase 3: Deployment & Continuous Learning

Seamless deployment of hierarchical GNNs into your operational environment, with ongoing monitoring, threat intelligence updates, and model retraining to adapt to evolving malware.

Phase 4: Performance Monitoring & Optimization

Establish key performance indicators (KPIs) for detection accuracy, false positive rates, and operational efficiency, continuously optimizing the solution for maximum impact.

Begin Your AI Transformation

Ready to Elevate Your Malware Analysis?

Leverage the power of hierarchical graph neural networks to build more robust, resilient, and adaptive threat detection systems. Connect with our experts today.

Book Your Consultation

Enterprise AI Analysis

HIGRAPH: A Large-Scale Hierarchical Graph Dataset for Malware Analysis

Quantifiable Impact & Core Innovations

Deep Analysis & Enterprise Applications

The Critical Need for Hierarchical Models

CryptoLocker Evolution: A Case for Hierarchical Graphs

Building a Foundational Dataset

Enterprise Process Flow

Key Structural Differences: Benign vs. Malicious

Superior Detection & Temporal Robustness

Calculate Your Potential AI-Driven ROI

Our AI Implementation Roadmap

Phase 1: Discovery & Assessment

Phase 2: Data Integration & Model Training

Phase 3: Deployment & Continuous Learning

Phase 4: Performance Monitoring & Optimization

Ready to Elevate Your Malware Analysis?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai