GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models

Revolutionizing Text-to-SQL with Semantic Accuracy

Authors: Mattia Tritto*, Giuseppe Farano*, Dario Di Palma*, Gaetano Rossiello, Fedelucio Narducci, Dharmashankar Subramanian, and Tommaso Di Noia

Affiliations: Polytechnic University of Bari, IBM T.J. Watson Research Center

Text-to-SQL, the task of translating natural language questions into SQL queries, has significantly advanced with the introduction of Large Language Models (LLMs), broadening database accessibility for a wide range of users. Despite substantial progress in generating valid SQL, current LLMs still struggle with complex queries that require precise alignment between user intent and the database schema. To mitigate this, test-time strategies such as Best-of-N (BoN) and Majority Voting (Maj) are often employed, based on the assumption that LLMs can generate correct answers but may require multiple attempts. However, these methods rely on surface-level heuristics, selecting either the syntactically correct query through execution-based BoN (ex-BoN) or the most frequently generated query with Maj. Recently, Outcome Reward Models (ORMs), which assign utility scores to generated outputs based on semantic correctness, have emerged as a promising approach for better aligning model predictions with user intent. Nevertheless, their application to Text-to-SQL remains largely underexplored. In this work, we evaluate ORMs as an effective heuristic for BoN, compare them with ex-BoN and Maj, and introduce a frame-work for training ORMs for the Text-to-SQL task. We evaluate our ORMs on the BIRD and SPIDER benchmarks, finetuning various open-source LLMs, including the Qwen2, Granite3, and Llama3 model families. Our results show that ORMs outperform ex-BoN and Maj, achieving execution accuracy gains of +4.33% (BIRD) and +2.10% (Spider) over ex-BoN, and +2.91% (BIRD) and +0.93% (Spider) over Maj. We further demonstrate that finetuning models already aligned with SQL generation, such as OmniSQL, yields superior ORM performance. Additionally, we observe that ORMs achieve competitive results on simple queries and benefit more from an increased number of candidates compared to ex-BoN and Maj. All code, datasets, and trained models are publicly released to support reproducibility and encourage future research in this area.

Schedule Your Strategy Session

Executive Impact: GradeSQL at a Glance

GradeSQL's innovative Outcome Reward Models (ORMs) represent a significant leap in Text-to-SQL query ranking, offering enhanced accuracy and reliability for enterprise data systems.

0 BIRD Execution Accuracy Gain

0 Spider Execution Accuracy Gain

0 Models Supported

Code & Data Publicly Available

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Problem Formulation

Methodology Overview

Key Results

Text-to-SQL systems convert natural language questions into executable SQL queries. Traditional methods struggle with complex queries, leading to the need for advanced inference strategies like Best-of-N (BoN) or Majority Voting (Maj). These often rely on surface-level heuristics, failing to fully capture semantic correctness. GradeSQL addresses this by introducing Outcome Reward Models (ORMs).

GradeSQL proposes a three-stage framework: Candidate Generation, Data Labeling, and Supervised Fine-Tuning (SFT). An LLM generates diverse SQL candidates, which are then labeled for semantic correctness by comparing execution results against gold queries. Finally, a separate LLM is fine-tuned as an ORM to assign probabilistic scores to candidates based on their semantic alignment.

ORMs consistently outperform execution-based Best-of-N and Majority Voting across BIRD and Spider benchmarks. They achieve execution accuracy gains of up to +4.33% on BIRD and +2.10% on Spider over ex-BoN, and even higher over Maj. Fine-tuning ORMs aligned with SQL generation (e.g., OmniSQL) further enhances performance. ORMs are particularly robust for complex queries.

+4.33% Execution Accuracy Gain on BIRD (over ex-BoN)

GradeSQL Framework Stages

Stage 1: Candidate Generation (LLM produces N SQL candidates)

→

Stage 2: Data Labeling (Candidates checked for semantic correctness)

→

Stage 3: Supervised Fine-tuning (ORM learns to score candidates)

ORM vs. Baseline Performance (Execution Accuracy %)

Method	BIRD dev	Spider dev	Spider test
Baseline (N=1)	63.89	82.40	84.02
Majority Voting (N=32)	66.95 (+3.06)	83.75 (+1.35)	85.47 (+1.45)
Execution-based Best-of-N (N=32)	66.04 (+2.15)	82.79 (+0.39)	85.14 (+1.12)
ORM-based Best-of-N (N=32)	68.90 (+5.01)	84.53 (+2.13)	87.47 (+3.45)

Real-world Impact: Enhanced Data Accessibility

Imagine an enterprise with vast, complex databases. With GradeSQL, non-technical users can pose natural language questions and receive highly accurate SQL queries, democratizing data access. This reduces the burden on IT teams and accelerates data-driven decision-making. Our ORMs ensure that even with diverse LLM outputs, the most semantically correct query is consistently selected, minimizing errors and maximizing efficiency.

Advanced ROI Calculator

Understand the potential time and cost savings GradeSQL can bring to your organization by improving SQL generation accuracy.

Industry Sector

Number of Employees Needing SQL Access

Avg. Hours/Week on SQL Tasks per Employee

Average Hourly Rate of SQL-Dependent Staff ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Get Your Custom ROI Report

Implementation Roadmap

A phased approach to integrate GradeSQL into your enterprise, ensuring a smooth transition and maximum impact.

Phase 1: Pilot Program & Integration

Implement GradeSQL within a controlled environment, integrating with existing LLM pipelines and database systems. Focus on a specific business unit for initial testing and feedback collection.

Phase 2: Performance Tuning & Expansion

Optimize ORM models with custom enterprise data, leveraging GradeSQL's data synthesis framework. Expand deployment to broader user groups, offering training and support.

Phase 3: Advanced AI-Driven Data Access

Explore integration with reinforcement learning for continuous self-improvement and enable conversational SQL interfaces across the enterprise, pushing the boundaries of natural language data interaction.

Discuss Your Implementation Timeline

Ready to Transform Your Data Strategy?

Book a free consultation with our AI experts to explore how GradeSQL can enhance your Text-to-SQL capabilities and drive significant operational efficiencies.

Book a Free Consultation

GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models

Revolutionizing Text-to-SQL with Semantic Accuracy

Executive Impact: GradeSQL at a Glance

Deep Analysis & Enterprise Applications

GradeSQL Framework Stages

ORM vs. Baseline Performance (Execution Accuracy %)

Real-world Impact: Enhanced Data Accessibility

Advanced ROI Calculator

Implementation Roadmap

Phase 1: Pilot Program & Integration

Phase 2: Performance Tuning & Expansion

Phase 3: Advanced AI-Driven Data Access

Ready to Transform Your Data Strategy?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai