AI Research Analysis
NERIF: GPT-4V for Automatic Scoring of Drawn Models
Authors: Gyeonggeon Lee, Xiaoming Zhai
Publication Date: November 19, 2025
This study proposes Notation-Enhanced Rubric Instruction for Few-Shot Learning (NERIF), a prompt engineering method to leverage GPT-4V for automatic scoring of student-drawn models. It evaluates GPT-4V's accuracy and interpretability across six science modeling tasks, finding promising potential despite challenges with complex models.
Executive Impact
This analysis of 'NERIF: GPT-4V for Automatic Scoring of Drawn Models' reveals key performance indicators for implementing AI-powered automatic scoring in educational settings.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The study introduces NERIF, a prompt engineering approach for GPT-4V, combining instructional notes, scoring rubrics, and few-shot learning for automatic model scoring.
NERIF Process Flow
GPT-4V achieves an average scoring accuracy of 0.51, with higher accuracy for 'Beginning' and 'Developing' categories (0.64 and 0.62) but lower for 'Proficient' (0.26).
| Category | Accuracy |
|---|---|
| Beginning | 0.64 |
| Developing | 0.62 |
| Proficient | 0.26 |
Interpretable Rationales
GPT-4V provides detailed, rubric-aligned justifications, explaining how it identifies model components and assigns scores, allowing human users to understand its decision-making process. For example, it identified 'longer arrows after heating' as evidence for faster particle motion, even if the student used double lines for motion, demonstrating plausible but sometimes incorrect inferences.
VLMs like GPT-4V offer a paradigm shift for computer vision in education, reducing technical barriers and data requirements for automatic scoring. However, accuracy for complex models needs improvement.
| Feature | VLM (GPT-4V) | Traditional ML (CNN) |
|---|---|---|
| Technical Barrier |
|
|
| Training Data Needs |
|
|
| Interpretability |
|
|
| Flexibility |
|
|
Advanced ROI Calculator
Estimate the potential time and cost savings by automating student model scoring in your institution.
Implementation Roadmap
Our phased implementation plan ensures a smooth transition and maximum impact for your educational AI initiatives.
Phase 1: Pilot & Validation
Conduct small-scale pilots with a subset of modeling tasks and educators to validate NERIF's effectiveness and gather feedback.
Phase 2: Integration & Customization
Integrate GPT-4V with existing learning platforms and customize rubrics and instructional notes for broader application across disciplines.
Phase 3: Scalable Deployment & Training
Roll out the automatic scoring system across the institution, providing comprehensive training for educators on prompt engineering and critical assessment of AI outputs.
Ready to Transform Your Assessment? Schedule a Consultation.
Our experts are ready to help you integrate cutting-edge AI into your educational workflows, ensuring efficient and effective assessment practices.