Enterprise AI Analysis
Diagnostic accuracy of generative large language artificial intelligence models for the assessment of dental crowding
Background Generative artificial intelligence (Al) models have shown potential for addressing text-based dental enquiries and answering exam questions. However, their role in diagnosis and treatment planning has not been thoroughly investigated. This study aimed to investigate the reliability of different generative Al models in classifying the severity of dental crowding. Methods Two experienced orthodontists categorized the severity of dental crowding in 120 intraoral occlusal images as mild, moderate, or severe (40 images per category). These images were then uploaded to three generative Al models (ChatGPT-40 mini, Microsoft Copilot, and Claude 3.5 Sonnet) and prompted to identify the dental arch and to assess the severity of dental crowding. Response times were recorded, and outputs were compared to orthodontists' assessments. A random image subset was re-analyzed after one week to evaluate model consistency. Results Claude 3.5 Sonnet successfully classified the severity of dental crowding in 50% of the images, followed by ChatGPT-40 mini (44%), and Copilot (34%). Visual recognition of the dental arches was higher with Claude and ChatGPT-40 mini (99%) compared to Copilot (72%). Response generation was significantly longer for ChatGPT-40 mini than for Claude and Copilot (p < .0001). However, the response times were comparable for both Claude and Copilot (p=.98). Repeated analyses showed improvement in image classification for both ChatGPT-40 mini and Copilot, while Claude 3.5 Sonnet misclassified a significant portion of the images. Conclusions The performance of ChatGPT-40 mini-, Microsoft Copilot, and Claude 3.5 Sonnet in analyzing the severity of dental crowding often did not match the evaluations made by orthodontists. Further developments in image processing algorithms of commercially available generative Al models are required prior to reliable use for dental crowding classification. Keywords ChatGPT, Microsoft, Claude, Al, Large language models
Executive Impact: Key Performance Indicators
Understand the critical metrics of AI's current capabilities in dental diagnostics and what it means for your practice.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Context of Dental Crowding & AI in Orthodontics
Quantifying dental crowding is crucial for orthodontic treatment planning. Traditional methods are often subjective and time-consuming. Artificial intelligence (AI), particularly deep learning (DL) and large-language models (LLMs), offers new opportunities. While AI has been extensively studied in dental radiology and facial esthetics, its application in visual diagnostic tasks, such as assessing dental crowding from intraoral images, remains largely unexplored. This study aimed to fill that gap by evaluating the diagnostic accuracy of generative AI models for dental crowding classification.
Study Design and AI Model Evaluation
This comparative study evaluated ChatGPT-40 mini, Microsoft Copilot, and Claude 3.5 Sonnet. Two experienced orthodontists classified 120 intraoral occlusal images (40 mild, 40 moderate, 40 severe crowding). Images were randomized and uploaded to AI models with a standardized prompt to classify crowding and identify the dental arch. Response times were recorded, and a subset of 30 images was re-analyzed after one week to assess consistency. Statistical analysis included Cohen's kappa for inter-rater agreement and confusion matrices for AI performance.
Enterprise Process Flow
AI Model Performance & Repeatability
Orthodontist agreement was substantial (kappa 0.87). For crowding classification, Claude 3.5 Sonnet had the highest accuracy at 50%, followed by ChatGPT-40 mini (44%), and Copilot (34%). Arch recognition was significantly better for Claude and ChatGPT-40 mini (99%) than Copilot (72%). ChatGPT-40 mini had the longest response time (11.9s) but showed the best consistency (87% kappa 0.74) in repeatability analysis, compared to Copilot (73% kappa 0.41) and Claude (43% kappa 0.1).
| Feature | ChatGPT-40 mini | Microsoft Copilot | Claude 3.5 Sonnet |
|---|---|---|---|
| Overall Crowding Accuracy | 44.2% | 34.2% | 50.0% |
| Mild Crowding Sensitivity | 67.5% | 22.5% | 55.0% |
| Mild Crowding Specificity | 58.8% | 86.3% | 77.5% |
| Arch Recognition Accuracy | 99% | 72% | 99% |
| Average Response Time | ~11.9s | ~4.0s | ~3.9s |
| Crowding Classification Repeatability | 87% | 73% | 43% |
Implications and Limitations of AI in Dental Crowding
The study's findings suggest that current commercially available LLMs are suboptimal for visual crowding assessment, performing below expert orthodontists. While promising for arch identification, their varied performance in crowding classification and repeatability highlights the need for further development. These models are not yet ready for autonomous diagnostic decisions, but could assist in low-risk scenarios or early triaging. Integrating NLP and computer vision is complex, and future improvements will rely on domain-specific training data and refined prompting strategies.
Future Directions for AI in Orthodontics
ChatGPT-40 mini, Microsoft Copilot, and Claude 3.5 Sonnet showed limited and inconsistent accuracy in classifying dental crowding from intraoral occlusal images. While arch identification was somewhat better, their ability to grade crowding severity often misaligned with experienced orthodontists' judgment. Further developments in image processing algorithms and domain-specific training are required before reliable use in clinical dental crowding classification.
Key Takeaway: AI Still Needs Refinement for Accurate Crowding Diagnosis
The study unequivocally demonstrated that commercially available generative AI models (ChatGPT-40 mini, Microsoft Copilot, and Claude 3.5 Sonnet) currently lack the diagnostic accuracy and consistency required for reliable classification of dental crowding based on intraoral images. While they show potential in basic visual recognition tasks like arch identification, their performance in grading the severity of crowding often falls short of expert orthodontists' judgment. This highlights a critical need for further advancements in image processing algorithms and domain-specific training before these AI models can be dependably integrated into orthodontic diagnostic workflows.
Key Learnings for Enterprise AI Adoption:
- Commercial LLMs are not yet clinically reliable for complex visual dental diagnoses.
- Significant development in AI image processing and specialized training data is essential.
- AI models currently serve better as supportive tools rather than autonomous diagnostic systems in orthodontics.
Quantify Your Potential AI Impact
Use our interactive calculator to estimate the efficiency gains and cost savings AI can bring to your specific enterprise operations.
Your AI Implementation Roadmap
A structured approach to integrating AI into your enterprise, ensuring maximum impact and minimal disruption.
Phase 1: Discovery & Strategy
Comprehensive assessment of current workflows, identification of AI opportunities, and development of a tailored AI strategy aligned with your business objectives.
Phase 2: Pilot & Validation
Deployment of AI solutions in a controlled environment, rigorous testing, and validation of performance against defined KPIs. Iterative refinement based on feedback.
Phase 3: Integration & Scaling
Seamless integration of validated AI solutions into your existing enterprise systems and scaling across relevant departments. Training and support for your teams.
Phase 4: Optimization & Future-Proofing
Continuous monitoring, performance optimization, and exploration of advanced AI capabilities to ensure sustained competitive advantage and long-term value.
Ready to Transform Your Enterprise with AI?
Connect with our AI specialists to discuss a bespoke strategy that drives efficiency and innovation in your organization.