| Title: |
Automatic Prompt Engineering for Automatic Scoring |
| Language: |
English |
| Authors: |
Mingfeng Xue (ORCID 0000-0002-4801-3754); Yunting Liu (ORCID 0009-0004-9594-9661); Xingyao Xiao (ORCID 0000-0001-8430-0438); Mark Wilson (ORCID 0000-0002-0425-5305) |
| Source: |
Journal of Educational Measurement. 2025 62(4):559-587. |
| Availability: |
Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www.wiley.com/en-us |
| Peer Reviewed: |
Y |
| Page Count: |
29 |
| Publication Date: |
2025 |
| Sponsoring Agency: |
National Science Foundation (NSF) |
| Contract Number: |
2010322 |
| Document Type: |
Journal Articles; Reports - Research |
| Descriptors: |
Computer Assisted Testing; Prompting; Educational Assessment; Automation; Natural Language Processing; Scoring; True Scores; Accuracy; Validity; Scoring Rubrics; Innovation |
| DOI: |
10.1111/jedm.70002 |
| ISSN: |
0022-0655; 1745-3984 |
| Abstract: |
Prompts play a crucial role in eliciting accurate outputs from large language models (LLMs). This study examines the effectiveness of an automatic prompt engineering (APE) framework for automatic scoring in educational measurement. We collected constructed-response data from 930 students across 11 items and used human scores as the true labels. A baseline was established by providing LLMs with the original human-scoring instructions and materials. APE was then applied to optimize prompts for each item. We found that on average, APE increased scoring accuracy by 9%; few-shot learning (i.e., giving multiple labeled examples related to the goal) increased APE performance by 2%; a high temperature (i.e., a parameter for output randomness) was needed in at least part of the APE to improve the scoring accuracy; Quadratic Weighted Kappa (QWK) showed a similar pattern. These findings support the use of APE in automatic scoring. Moreover, compared with the manual scoring instructions, APE tended to restate and reformat the scoring prompts, which could give rise to concerns about validity. Thus, the creative variability introduced by LLMs raises considerations about the balance between innovation and adherence to scoring rubrics. |
| Abstractor: |
As Provided |
| Entry Date: |
2026 |
| Accession Number: |
EJ1491371 |
| Database: |
ERIC |