LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost

Revolutionizing QA with AI-Powered Test Coverage Assessment

Authors: Donghao Huang, Shila Chew, Anna Dutkiewicz, Zhaoxia Wang

Publication: arXiv:2512.01232v1 [cs.SE] 1 Dec 2025

Executive Impact: Key Metrics for Enterprise AI Adoption

This study highlights critical advancements in automated test coverage evaluation using LLMs, offering significant gains in efficiency, reliability, and cost-effectiveness for modern QA pipelines.

0 Cost Reduction (vs GPT-5 high)

0 Optimal Model Accuracy

0 High Reliability

0 Cost per 1K Evaluations

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Enterprise Process Flow: LAJ Workflow

Jira Ticket Creation

→

Gherkin Script Development

→

Expert Annotation (Ground Truth)

→

LLM-as-a-Judge Evaluation

Problem Formulation: Given a software requirement specification (e.g., a Jira ticket) R and a corresponding Gherkin acceptance test script T, the goal is to automatically assess the test coverage completeness—the degree to which the test script adequately addresses the specified requirements. An LLM-as-a-Judge model M produces an assessment AM(R,T) ∈ [0,100] representing estimated coverage percentage. We evaluate M against expert-provided ground truth A*(R,T) to measure assessment quality.

Input	Process	Output
Ticket requirements and acceptance criteria The Gherkin test script A comprehensive rubric specifying coverage expectations	Analyzes alignment between script and requirements Checks breadth and depth of scenarios Applies the rubric to produce a scalar coverage score with justification	Coverage percentage (0-100) Coverage analysis: scenarios covered, gaps identified, recommendations Rubric-aligned flags for downstream analysis

GPT-40 Mini Achieves best overall accuracy with 6.07% MAAE, demonstrating superior precision.

Tier	MAAE Range	Key Models	Characteristics
Elite Performance	< 7.0%	GPT-40 Mini GPT-5 (high) GPT-5 (medium)	Best accuracy and strong agreement with expert assessments.
Strong Performance	7.0-9.0%	GPT-40 GPT-4.1 Mini GPT-4.1 Nano GPT-5 Mini variants GPT-4.1 GPT-5 (low)	High close match rates, good balance of accuracy and cost.
Economy Tier	> 9.0%	GPT-OSS 20B/120B families GPT-5 Nano variants	Lower API costs but significantly higher error rates.

GPT-40 Mini: The Optimal Production Choice

GPT-40 Mini stands out as the optimal production model due to its best-in-class accuracy (6.07 MAAE, 93.93% APS), high reliability (96.6% ECR@1), and exceptional cost-effectiveness ($1.01/1K). This model achieves a remarkable 78× cost reduction compared to GPT-5 (high reasoning) while simultaneously improving accuracy, making it ideal for practical adoption at scale.

Key Results:

✓ 78× cost reduction vs. GPT-5 (high)
✓ Superior accuracy (6.07 MAAE)
✓ High reliability (96.6% ECR@1)
✓ Cost-effective at $1.01/1K

Tier	ECR@1 Range	Key Models	Implications
Perfect Reliability	100%	GPT-40 GPT-4.1 GPT-5 (low)	No retries needed, predictable costs and latencies.
High Reliability	> 95%	GPT-40 Mini GPT-5 family variants GPT-4.1 Mini GPT-4.1 Nano	Minimal retries, negligible cost increases from reliability overhead. GPT-40 Mini optimal.
Moderate Reliability	90% < ECR@1 < 95%	Open-weight models	Higher variance, JSON parsing errors, schema violations. 1.07–1.19 attempts/eval.
Low Reliability	< 90%	GPT-OSS 20B (high)	Lowest reliability (85.4%), high variance, problematic for production.

175x Total cost span across all models, from $0.45/1K to $78.96/1K evaluations.

Reasoning Effort Impact: Reasoning effort significantly impacts both accuracy and cost, with model-family dependent patterns. GPT-5 models generally benefit from higher reasoning with predictable accuracy-cost trade-offs, where reducing reasoning from high to low can yield up to 70% cost savings for a 1.53pp accuracy degradation. In contrast, open-weight models (GPT-OSS) degrade across all dimensions as reasoning effort increases, making higher reasoning levels counterproductive for these models.

Key Recommendations: Practitioners should always measure ECR@1 alongside accuracy to capture operational reliability, implement robust retry logic with a 5-15% overhead budget, and continuously monitor ECR@1, latency, and cost in production. Using adjusted cost metrics for total cost of ownership is crucial. Starting with GPT-40 Mini is recommended for most use cases due to its optimal balance of accuracy, reliability, and cost-effectiveness.

Limitations	Future Directions
Domain-specificity (Gherkin/RESTful APIs only) Task complexity (all scripts treated equally difficult) Systematic bias (over/under-estimation patterns) Temporal stability (API/model evolution)	Expanded test coverage (unit, UI, security tests) Multi-domain validation (healthcare, finance, IoT) Task complexity stratification Systematic bias mitigation (calibration techniques)

Calculate Your Potential ROI

Estimate the efficiency gains and cost savings from implementing LLM-as-a-Judge in your QA workflow.

Your Industry

Number of QA Engineers

Hours Spent per Week on Manual Test Coverage Assessment (per engineer)

Average Hourly Rate for QA Engineers ($)

Annual Savings $0

Annual QA Hours Reclaimed 0

Get a Custom ROI Analysis

Your Journey to Scalable AI-Powered QA

A typical implementation timeline for integrating LLM-as-a-Judge into your enterprise QA processes.

Phase 1: Initial Assessment & Framework Setup

Integrate LAJ framework, define core rubrics, and configure initial model evaluation environment. Focus on Gherkin/BDD compatibility.

Phase 2: Pilot Deployment & Metric Baseline

Conduct pilot runs with GPT-40 Mini, establish baseline accuracy (MAAE), reliability (ECR@1), and cost metrics. Implement basic retry logic.

Phase 3: Optimization & Model Selection

Explore model configurations and reasoning efforts. Optimize for accuracy-reliability-cost trade-offs. Refine retry strategies and monitor adjusted cost metrics.

Phase 4: Scalable Integration & Continuous Monitoring

Integrate LAJ into CI/CD pipelines for automated test coverage evaluation. Implement dashboards for real-time monitoring of performance and costs. Iterate based on feedback.

Start Your Implementation Roadmap

Ready to Transform Your QA with AI?

Leverage the power of LLM-as-a-Judge to achieve unprecedented accuracy, reliability, and cost-efficiency in your software testing.

Book a Free Consultation

LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost

Revolutionizing QA with AI-Powered Test Coverage Assessment

Executive Impact: Key Metrics for Enterprise AI Adoption

Deep Analysis & Enterprise Applications

Enterprise Process Flow: LAJ Workflow

GPT-40 Mini: The Optimal Production Choice

Calculate Your Potential ROI

Your Journey to Scalable AI-Powered QA

Phase 1: Initial Assessment & Framework Setup

Phase 2: Pilot Deployment & Metric Baseline

Phase 3: Optimization & Model Selection

Phase 4: Scalable Integration & Continuous Monitoring

Ready to Transform Your QA with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai