Skip to main content
Enterprise AI Analysis: Is All the Information in the Price? LLM Embeddings versus the EMH in Stock Clustering

Enterprise AI Analysis

Is All the Information in the Price? LLM Embeddings versus the EMH in Stock Clustering

This paper investigates whether artificial intelligence (AI), specifically large language model (LLM) embeddings, can improve stock clustering compared to traditional methods like price-based correlations and human-defined industry classifications (GICS). Examining the semi-strong Efficient Markets Hypothesis (EMH), the study finds that price-based clustering consistently outperforms both GICS and LLM embeddings in predicting short-horizon stock returns, suggesting that relevant public information is largely already reflected in market prices.

Executive Impact: Quantified Advantage

Our analysis reveals the superior predictive power of price-based clustering, validating the Efficient Markets Hypothesis and offering clear implications for enterprise financial strategies.

0 RMSE Reduction vs. GICS
0 RMSE Reduction vs. LLM
0 Statistical Significance
0 Months of Backtesting Data

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The core research question explored is: Can AI, in the form of LLM embeddings, provide improved clustering in comparison to price-based, or GICS-based clusters?

To answer this, the study rigorously compares three distinct stock clustering methodologies: (i) Price-based clusters, derived from historical return correlations; (ii) Human-informed clusters, using the Global Industry Classification Standard (GICS); and (iii) AI-driven clusters, constructed from large language model (LLM) embeddings of stock-related news headlines.

The evaluation leverages a novel methodology that transforms any clustering into a synthetic factor model, grounded in the Arbitrage Pricing Theory (APT) framework, allowing for consistent out-of-sample predictive performance assessment.

The methodology involves constructing and evaluating clusters based on different similarity measures and algorithms:

  • Similarity Measures:
    • Correlation of Historical Returns: Uses daily stock return co-movements from Compustat, computed over rolling time windows.
    • Human Insight (GICS): The de facto standard classification by S&P and MSCI, categorizing stocks into 11 sectors based on business activity.
    • LLM Embeddings of News Headlines: Utilizes OpenAI's text-embedding-3-large model on RavenPack news headlines, creating 3072-dimensional embedding vectors to capture semantic relationships.
    • Random Baseline: Stocks arbitrarily assigned to clusters as a performance floor.
  • Clustering Algorithms:
    • k-means clustering: Partitions firms into k clusters by minimizing the within-cluster sum of squares (WCSS).
    • Hierarchical clustering (Agglomerative): A bottom-up approach merging clusters based on smallest inter-cluster dissimilarity.
  • Evaluation Framework: Each clustering method generates 11 clusters. These are converted into synthetic factor models using the Arbitrage Pricing Theory (APT) framework. Daily return series are constructed for each cluster (equal-weighted average). A linear model is estimated weekly with one month of daily data. Out-of-sample prediction errors (RMSE and MAE) are then computed.

The empirical results, derived from rolling-forward experiments on S&P 500 constituents from 2022 through 2024, provide clear insights into the predictive performance of different clustering methods.

  • Dominant Performance of Price-Based Clustering: Price-based clustering methods consistently outperformed both LLM embedding methods and GICS.
  • Quantified Superiority: Price-based clustering reduced the Root Mean Squared Error (RMSE) by 15.9% relative to GICS and 14.7% relative to LLM embeddings.
  • Statistical Significance: A paired Student's t-test comparing the best-performing returns-based clustering with the best-performing LLM-based clustering yielded a p-value of 0, indicating that the difference in performance is statistically significant.
  • Reinforcement of EMH: The findings reinforce the view that the dominant information content for equity pricing is already embedded in co-movements of returns, supporting the semi-strong Efficient Markets Hypothesis.

Human judgments and sophisticated NLP extractions from news text did not translate into superior short-term predictive factors compared to market-driven price signals.

The research has significant implications for both financial practitioners and academics:

  • For Practitioners: The proposed framework offers a practical diagnostic tool for monitoring evolving sector structures. It allows for computing rolling correlation clusters, constructing equal-weighted synthetic portfolios, and tracking their incremental explanatory power relative to existing sector definitions. This can guide portfolio construction, risk modeling, and strategic asset allocation by identifying the most effective drivers of cross-sectional return variation.
  • For Academics: The methodology provides a robust framework for testing alternative hypotheses about how quickly markets absorb information and what types of information are most efficiently priced. It opens avenues for further research into market efficiency and the utility of alternative data sources in financial modeling.
  • Support for EMH: The empirical evidence strongly suggests that for short-horizon returns, the market efficiently incorporates public information, making price-based signals highly effective.

Price-Based Clustering Outperforms AI & GICS

15.9% RMSE Reduction vs. GICS (14.7% vs. LLM)

The study benchmarked three clustering approaches: price-based (historical return correlations), human-informed (GICS), and AI-driven (LLM embeddings of news headlines). Price-based clustering consistently outperformed both, validating the Efficient Markets Hypothesis for short-horizon returns by demonstrating superior predictive power in reducing Root Mean Squared Error (RMSE).

APT-Based Clustering Evaluation Flow

Our novel evaluation methodology transforms any equity grouping—manual, machine, or market-driven—into a real-time factor model. This enables consistent out-of-sample predictive performance assessment, crucial for robust financial analysis.

Assign Stocks to 11 Clusters (Weekly)
Construct Daily Return Series for Each Cluster
Estimate Linear Model (OLS) using 1 Month Daily Data
Predict Next Period Returns Out-of-Sample
Measure RMSE & MAE Daily

Clustering Method Comparison

A direct comparison of clustering methods highlights their distinct characteristics and performance profiles, demonstrating why price-based approaches currently lead in short-term return prediction.

Method Basis Pros Cons
Price-Based (Historical Returns) Daily return co-movements
  • Highest predictive accuracy (lowest RMSE)
  • Dynamically adapts to market changes
  • Requires significant historical data processing
AI-Driven (LLM Embeddings) Semantic relationships from news headlines
  • Captures non-traditional, textual information
  • Potential for forward-looking signals
  • Outperformed by price-based methods
  • Computationally intensive
  • Limited by LLM knowledge cutoff date
Human-Defined (GICS) Expert judgment, business activity
  • Commonly understood, stable
  • Widely adopted for long-term analysis
  • Lowest predictive accuracy for short-term returns
  • Slow to update
  • Does not capture dynamic market shifts

EMH Validation: Price Leads Information

This research provides robust empirical evidence reinforcing the semi-strong Efficient Markets Hypothesis. It demonstrates that for short-horizon equity returns, the information relevant for clustering stocks is largely already contained within their historical price movements. Neither sophisticated human-defined categories like GICS nor advanced AI-driven LLM embeddings of news headlines provided superior predictive power. This suggests that public information is quickly and efficiently absorbed by market prices, making price-based methods the most effective for identifying underlying systematic risk structures for trading and risk management strategies.

Calculate Your Potential ROI

Estimate the financial impact of optimizing your clustering and factor modeling strategies with our advanced AI insights.

Estimated Annual Savings $0
Analyst Hours Reclaimed Annually 0

Your Path to Optimized Financial Modeling

A structured approach to integrating advanced clustering and factor modeling into your enterprise, maximizing efficiency and predictive accuracy.

Phase 1: Discovery & Strategy Alignment

Initial consultation to understand current clustering methodologies, data infrastructure, and strategic objectives. Define KPIs for performance improvement and tailor AI integration roadmap.

Phase 2: Data Integration & Model Prototyping

Secure integration of historical price data, news feeds (if applicable), and existing GICS classifications. Develop and test initial price-based factor models with your data, establishing benchmarks.

Phase 3: Customization & Validation

Refine clustering algorithms and factor models based on continuous performance monitoring and backtesting against your specific portfolio and market conditions. Validate out-of-sample predictive power.

Phase 4: Deployment & Continuous Optimization

Deploy the optimized clustering and factor modeling solution into your trading or risk management systems. Provide training and ongoing support, with continuous monitoring and adaptive optimization to market shifts.

Unlock Superior Market Insights

Ready to enhance your financial modeling with empirically proven clustering strategies? Connect with our experts to tailor a solution for your enterprise.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking