Katalog Plus
Bibliothek der Frankfurt UAS
Bald neuer Katalog: sichern Sie sich schon vorab Ihre persönlichen Merklisten im Nutzerkonto: Anleitung.
Dieses Ergebnis aus ERIC kann Gästen nicht angezeigt werden.  Login für vollen Zugriff.

Automatic Prompt Engineering for Automatic Scoring

Title: Automatic Prompt Engineering for Automatic Scoring
Language: English
Authors: Mingfeng Xue (ORCID 0000-0002-4801-3754); Yunting Liu (ORCID 0009-0004-9594-9661); Xingyao Xiao (ORCID 0000-0001-8430-0438); Mark Wilson (ORCID 0000-0002-0425-5305)
Source: Journal of Educational Measurement. 2025 62(4):559-587.
Availability: Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www.wiley.com/en-us
Peer Reviewed: Y
Page Count: 29
Publication Date: 2025
Sponsoring Agency: National Science Foundation (NSF)
Contract Number: 2010322
Document Type: Journal Articles; Reports - Research
Descriptors: Computer Assisted Testing; Prompting; Educational Assessment; Automation; Natural Language Processing; Scoring; True Scores; Accuracy; Validity; Scoring Rubrics; Innovation
DOI: 10.1111/jedm.70002
ISSN: 0022-0655; 1745-3984
Abstract: Prompts play a crucial role in eliciting accurate outputs from large language models (LLMs). This study examines the effectiveness of an automatic prompt engineering (APE) framework for automatic scoring in educational measurement. We collected constructed-response data from 930 students across 11 items and used human scores as the true labels. A baseline was established by providing LLMs with the original human-scoring instructions and materials. APE was then applied to optimize prompts for each item. We found that on average, APE increased scoring accuracy by 9%; few-shot learning (i.e., giving multiple labeled examples related to the goal) increased APE performance by 2%; a high temperature (i.e., a parameter for output randomness) was needed in at least part of the APE to improve the scoring accuracy; Quadratic Weighted Kappa (QWK) showed a similar pattern. These findings support the use of APE in automatic scoring. Moreover, compared with the manual scoring instructions, APE tended to restate and reformat the scoring prompts, which could give rise to concerns about validity. Thus, the creative variability introduced by LLMs raises considerations about the balance between innovation and adherence to scoring rubrics.
Abstractor: As Provided
Entry Date: 2026
Accession Number: EJ1491371
Database: ERIC