Skip to main content
Enterprise AI Analysis: NERIF: GPT-4V for Automatic Scoring of Drawn Models

AI Research Analysis

NERIF: GPT-4V for Automatic Scoring of Drawn Models

Authors: Gyeonggeon Lee, Xiaoming Zhai

Publication Date: November 19, 2025

This study proposes Notation-Enhanced Rubric Instruction for Few-Shot Learning (NERIF), a prompt engineering method to leverage GPT-4V for automatic scoring of student-drawn models. It evaluates GPT-4V's accuracy and interpretability across six science modeling tasks, finding promising potential despite challenges with complex models.

Executive Impact

This analysis of 'NERIF: GPT-4V for Automatic Scoring of Drawn Models' reveals key performance indicators for implementing AI-powered automatic scoring in educational settings.

0.51 Average Scoring Accuracy
0.64 Beginning Category Accuracy
0.62 Developing Category Accuracy
0.26 Proficient Category Accuracy

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The study introduces NERIF, a prompt engineering approach for GPT-4V, combining instructional notes, scoring rubrics, and few-shot learning for automatic model scoring.

NERIF Process Flow

Write Prompt
Validate
Test
9 Few-shot examples used for training GPT-4V, significantly reducing data collection burden.

GPT-4V achieves an average scoring accuracy of 0.51, with higher accuracy for 'Beginning' and 'Developing' categories (0.64 and 0.62) but lower for 'Proficient' (0.26).

Category Accuracy
Beginning 0.64
Developing 0.62
Proficient 0.26

Interpretable Rationales

GPT-4V provides detailed, rubric-aligned justifications, explaining how it identifies model components and assigns scores, allowing human users to understand its decision-making process. For example, it identified 'longer arrows after heating' as evidence for faster particle motion, even if the student used double lines for motion, demonstrating plausible but sometimes incorrect inferences.

VLMs like GPT-4V offer a paradigm shift for computer vision in education, reducing technical barriers and data requirements for automatic scoring. However, accuracy for complex models needs improvement.

100% Reduction in programming skills required for automatic scoring compared to traditional ML methods.
Feature VLM (GPT-4V) Traditional ML (CNN)
Technical Barrier
  • Low (prompt engineering)
  • High (sophisticated CNNs)
Training Data Needs
  • Few-shot (9 examples)
  • Large (hundreds to thousands)
Interpretability
  • High (natural language rationales)
  • Low (black box)
Flexibility
  • High (natural language prompts)
  • Low (model-specific tuning)

Advanced ROI Calculator

Estimate the potential time and cost savings by automating student model scoring in your institution.

Annual Savings $0
Hours Reclaimed Annually 0

Implementation Roadmap

Our phased implementation plan ensures a smooth transition and maximum impact for your educational AI initiatives.

Phase 1: Pilot & Validation

Conduct small-scale pilots with a subset of modeling tasks and educators to validate NERIF's effectiveness and gather feedback.

Phase 2: Integration & Customization

Integrate GPT-4V with existing learning platforms and customize rubrics and instructional notes for broader application across disciplines.

Phase 3: Scalable Deployment & Training

Roll out the automatic scoring system across the institution, providing comprehensive training for educators on prompt engineering and critical assessment of AI outputs.

Ready to Transform Your Assessment? Schedule a Consultation.

Our experts are ready to help you integrate cutting-edge AI into your educational workflows, ensuring efficient and effective assessment practices.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking