Article AI Analysis

SPOTIT: Evaluating Text-to-SQL Evaluation with Formal Verification

Authors: Rocky Klopfenstein, Yang He, Andrew Tremante, Yuepeng Wang, Nina Narodytska, Haoze Wu

Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SPOTIT, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.

Schedule Your Strategy Session

Key Executive Impact

This analysis reveals critical insights into Text-to-SQL evaluation, highlighting the need for more rigorous methods to accurately benchmark model performance and identify underlying issues in existing datasets, including ambiguous natural language queries and incorrect gold standard SQLs.

0 Average Accuracy Drop from Test-based to SPOTIT

0 Evaluation Coverage Increase with SPOTIT Extensions

0 Avg. Verification Time per Counterexample

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

SPOTIT Evaluation Pipeline

Input Phase: NL Query & Gold SQL

→

Verification Phase: Bounded Equivalence Check

→

Validation Phase: Counterexample Verification

SPOTIT introduces a novel three-phase evaluation pipeline: first, taking the natural language query and both gold and generated SQLs; second, employing a formal bounded equivalence verification engine to find differentiating databases; and third, validating these counterexamples to ensure their non-spuriousness.

11.3-14.2% Average Accuracy Drop from Test-based to SPOTIT

Current test-based Text-to-SQL evaluations often significantly overestimate model performance. SPOTIT, through its rigorous formal verification, reveals that reported accuracies can drop by a substantial margin (11.3%-14.2%) when truly differentiating databases are considered, leading to a more accurate understanding of model capabilities.

Test-based vs. SPOTIT Evaluation: A Critical Comparison

Feature	Test-based Evaluation	SPOTIT (Formal Verification)
Equivalence Check	Compares execution results on a single static test database.	Actively searches for databases that differentiate queries (bounded equivalence).
Accuracy Assessment	Optimistic, prone to false positives (coincidental same output). Can misrepresent actual model performance.	Rigorous, provides stronger correctness guarantees, reveals true differences. More accurate ranking of Text-to-SQL methods.
Problem Identification	Cannot pinpoint sources of inconsistency (data-specific equivalence).	Generates minimal differentiating databases, enabling root cause analysis (gold SQL errors, ambiguity).
Coverage & SQL Support	Generally high coverage for execution on standard DBMS.	Extends VERIEQL to support richer SQL subset (strings, dates), significantly increasing coverage.

Case Study: Incorrect Gold SQLs Identified by SPOTIT

Example 3.1: Incorrect WHERE Clause. SPOTIT found that the gold SQL's WHERE clause (T2.rnp != '-' OR '+-') was semantically incorrect, interpreting string literals as boolean FALSE, leading to an unintended empty condition. This highlights how manual annotation errors can impact evaluation.

Example B.1: Typo in Numeric Condition. SPOTIT revealed a gold SQL error where the condition T2.ua > 6.5 was used, contradicting the evidence that the normal range is less than or equal to 6.50. The generated SQL correctly used <= 6.5. This type of error, a simple comparison operator mistake, significantly alters query semantics.

Example B.2: Inclusive vs. Exclusive Date Range. The gold SQL used STRFTIME('%Y', T1.date) > '2012' for 'after January 1st, 2012', which incorrectly includes 2012-01-01 due to the specific string comparison. The generated SQL correctly used T.date > '2012-01-01'. SPOTIT caught this nuanced date interpretation that standard testing might miss.

Example B.3: Incorrect Ordering for 'First'. For 'first paid customer', the gold SQL incorrectly used ORDER BY T1.time DESC LIMIT 1 instead of ASC, returning the latest customer rather than the earliest. SPOTIT successfully flagged this fundamental logical error.

Example B.4: BETWEEN Operator Misuse. The gold SQL used BETWEEN 100 AND 400 for an exclusive range 'PLT > 100 and PLT < 400'. The SQL BETWEEN operator is inclusive, leading to incorrect results for boundary values. SPOTIT detected this subtle semantic difference.

Case Study: Ambiguous Natural Language Questions

Example 3.2: COUNT vs. COUNT(DISTINCT). The question 'How many male patients...' was ambiguous regarding counting distinct patients versus total examinations per patient. The gold SQL used COUNT(T1.id) while the generated used COUNT(DISTINCT patient.id). Both are justifiable interpretations, highlighting NL ambiguity where a single gold SQL might unfairly penalize a valid alternative.

Example B.7: Interpretation of 'members'. The term 'members' could mean any club member or those with a specific 'Member' position. The gold SQL filtered by position, but without explicit evidence, a broader interpretation is also valid. SPOTIT surfaced this ambiguity in natural language interpretation.

Example B.8: Underspecified Aggregation. For 'legal status for vintage play format', it was unclear if the intent was to return unique legal statuses or the status of every valid card. The gold SQL used DISTINCT, while the generated did not, reflecting an underspecified requirement that SPOTIT identified as a valid source of divergence.

Example B.9: Tie-breaking Rules. For 'comment with the highest score' when multiple posts have tied scores, the natural language did not specify a tie-breaking rule. Both gold and generated SQLs returned different valid comments due to the lack of specificity with LIMIT 1. This reveals a common issue in NL to SQL where contextual rules are missing.

Case Study: Correctly Identified Generated SQL Errors

Example B.5: Missing JOIN Condition. The generated SQL for 'college of person with link to major...' correctly filtered by name but omitted the crucial link_to_major constraint from the WHERE clause. This led to an over-inclusive result that SPOTIT accurately identified as an error in the generated query's logic.

Example B.6: Missing Contextual Filter. For 'original diagnose when first came to hospital', the generated SQL only checked for a diagnosis and date, failing to correctly link it to the patient's actual 'first date' of hospital visit. This omission, leading to potentially later diagnoses, was precisely detected by SPOTIT.

Calculate Your Potential AI ROI

Estimate the significant efficiency gains and cost savings your enterprise could realize by implementing advanced AI solutions.

Your Industry

Number of Employees (impacted by manual data/text tasks)

Avg. Hours/Week Spent on Manual Tasks per Employee

Average Hourly Cost per Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Your Enterprise AI Implementation Roadmap

A typical journey from initial strategy to fully optimized AI operations, tailored for robust enterprise integration.

Discovery & Strategy

Comprehensive analysis of existing data infrastructure, business processes, and strategic objectives. Defining clear KPIs and an AI adoption strategy.

Pilot & Prototyping

Development of initial AI models and prototypes on a small scale, focusing on a high-impact use case to demonstrate feasibility and gather feedback.

Integration & Deployment

Seamless integration of validated AI solutions into existing enterprise systems, ensuring scalability, security, and compliance with data governance policies.

Optimization & Scaling

Continuous monitoring, performance tuning, and expansion of AI capabilities across more business units and processes for maximum ROI.

Ready to Transform Your Enterprise with AI?

Book a personalized consultation with our AI strategists to discuss how these insights apply to your business and to design a bespoke AI roadmap.

Discuss Your Implementation

Article AI Analysis

SPOTIT: Evaluating Text-to-SQL Evaluation with Formal Verification

Key Executive Impact

Deep Analysis & Enterprise Applications

SPOTIT Evaluation Pipeline

Test-based vs. SPOTIT Evaluation: A Critical Comparison

Case Study: Incorrect Gold SQLs Identified by SPOTIT

Case Study: Ambiguous Natural Language Questions

Case Study: Correctly Identified Generated SQL Errors

Calculate Your Potential AI ROI

Your Enterprise AI Implementation Roadmap

Discovery & Strategy

Pilot & Prototyping

Integration & Deployment

Optimization & Scaling

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai