GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models
Revolutionizing Text-to-SQL with Semantic Accuracy
Authors: Mattia Tritto*, Giuseppe Farano*, Dario Di Palma*, Gaetano Rossiello, Fedelucio Narducci, Dharmashankar Subramanian, and Tommaso Di Noia
Affiliations: Polytechnic University of Bari, IBM T.J. Watson Research Center
Text-to-SQL, the task of translating natural language questions into SQL queries, has significantly advanced with the introduction of Large Language Models (LLMs), broadening database accessibility for a wide range of users. Despite substantial progress in generating valid SQL, current LLMs still struggle with complex queries that require precise alignment between user intent and the database schema. To mitigate this, test-time strategies such as Best-of-N (BoN) and Majority Voting (Maj) are often employed, based on the assumption that LLMs can generate correct answers but may require multiple attempts. However, these methods rely on surface-level heuristics, selecting either the syntactically correct query through execution-based BoN (ex-BoN) or the most frequently generated query with Maj. Recently, Outcome Reward Models (ORMs), which assign utility scores to generated outputs based on semantic correctness, have emerged as a promising approach for better aligning model predictions with user intent. Nevertheless, their application to Text-to-SQL remains largely underexplored. In this work, we evaluate ORMs as an effective heuristic for BoN, compare them with ex-BoN and Maj, and introduce a frame-work for training ORMs for the Text-to-SQL task. We evaluate our ORMs on the BIRD and SPIDER benchmarks, finetuning various open-source LLMs, including the Qwen2, Granite3, and Llama3 model families. Our results show that ORMs outperform ex-BoN and Maj, achieving execution accuracy gains of +4.33% (BIRD) and +2.10% (Spider) over ex-BoN, and +2.91% (BIRD) and +0.93% (Spider) over Maj. We further demonstrate that finetuning models already aligned with SQL generation, such as OmniSQL, yields superior ORM performance. Additionally, we observe that ORMs achieve competitive results on simple queries and benefit more from an increased number of candidates compared to ex-BoN and Maj. All code, datasets, and trained models are publicly released to support reproducibility and encourage future research in this area.
Executive Impact: GradeSQL at a Glance
GradeSQL's innovative Outcome Reward Models (ORMs) represent a significant leap in Text-to-SQL query ranking, offering enhanced accuracy and reliability for enterprise data systems.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Text-to-SQL systems convert natural language questions into executable SQL queries. Traditional methods struggle with complex queries, leading to the need for advanced inference strategies like Best-of-N (BoN) or Majority Voting (Maj). These often rely on surface-level heuristics, failing to fully capture semantic correctness. GradeSQL addresses this by introducing Outcome Reward Models (ORMs).
GradeSQL proposes a three-stage framework: Candidate Generation, Data Labeling, and Supervised Fine-Tuning (SFT). An LLM generates diverse SQL candidates, which are then labeled for semantic correctness by comparing execution results against gold queries. Finally, a separate LLM is fine-tuned as an ORM to assign probabilistic scores to candidates based on their semantic alignment.
ORMs consistently outperform execution-based Best-of-N and Majority Voting across BIRD and Spider benchmarks. They achieve execution accuracy gains of up to +4.33% on BIRD and +2.10% on Spider over ex-BoN, and even higher over Maj. Fine-tuning ORMs aligned with SQL generation (e.g., OmniSQL) further enhances performance. ORMs are particularly robust for complex queries.
GradeSQL Framework Stages
Method | BIRD dev | Spider dev | Spider test |
---|---|---|---|
Baseline (N=1) | 63.89 | 82.40 | 84.02 |
Majority Voting (N=32) | 66.95 (+3.06) | 83.75 (+1.35) | 85.47 (+1.45) |
Execution-based Best-of-N (N=32) | 66.04 (+2.15) | 82.79 (+0.39) | 85.14 (+1.12) |
ORM-based Best-of-N (N=32) | 68.90 (+5.01) | 84.53 (+2.13) | 87.47 (+3.45) |
Real-world Impact: Enhanced Data Accessibility
Imagine an enterprise with vast, complex databases. With GradeSQL, non-technical users can pose natural language questions and receive highly accurate SQL queries, democratizing data access. This reduces the burden on IT teams and accelerates data-driven decision-making. Our ORMs ensure that even with diverse LLM outputs, the most semantically correct query is consistently selected, minimizing errors and maximizing efficiency.
Advanced ROI Calculator
Understand the potential time and cost savings GradeSQL can bring to your organization by improving SQL generation accuracy.
Implementation Roadmap
A phased approach to integrate GradeSQL into your enterprise, ensuring a smooth transition and maximum impact.
Phase 1: Pilot Program & Integration
Implement GradeSQL within a controlled environment, integrating with existing LLM pipelines and database systems. Focus on a specific business unit for initial testing and feedback collection.
Phase 2: Performance Tuning & Expansion
Optimize ORM models with custom enterprise data, leveraging GradeSQL's data synthesis framework. Expand deployment to broader user groups, offering training and support.
Phase 3: Advanced AI-Driven Data Access
Explore integration with reinforcement learning for continuous self-improvement and enable conversational SQL interfaces across the enterprise, pushing the boundaries of natural language data interaction.
Ready to Transform Your Data Strategy?
Book a free consultation with our AI experts to explore how GradeSQL can enhance your Text-to-SQL capabilities and drive significant operational efficiencies.