Dieses Ergebnis aus ERIC kann Gästen nicht angezeigt werden. Login für vollen Zugriff.

De-Identifying Student Personally Identifying Information in Discussion Forum Posts with Large Language Models

Title:	De-Identifying Student Personally Identifying Information in Discussion Forum Posts with Large Language Models
Language:	English
Authors:	Andres Felipe Zambrano; Shreya Singhal; Maciej Pankiewicz; Ryan Shaun Baker; Chelsea Porter; Xiner Liu
Source:	Information and Learning Sciences. 2025 126(5-6):401-424.
Availability:	Emerald Publishing Limited. Howard House, Wagon Lane, Bingley, West Yorkshire, BD16 1WA, UK. Tel: +44-1274-777700; Fax: +44-1274-785201; e-mail: emerald@emeraldinsight.com; Web site: http://www.emerald.com/insight
Peer Reviewed:	Y
Page Count:	24
Publication Date:	2025
Document Type:	Journal Articles; Reports - Research
Education Level:	Higher Education; Postsecondary Education
Descriptors:	Artificial Intelligence; Identification; Privacy; Information Security; Discussion Groups; MOOCs; College Students
Geographic Terms:	Pennsylvania (Philadelphia)
DOI:	10.1108/ILS-11-2024-0156
ISSN:	2398-5348; 2398-5356
Abstract:	Purpose: This study aims to evaluate the effectiveness of three large language models (LLMs), GPT-4o, Llama 3.3 70B and Llama 3.1 8B, in redacting personally identifying information (PII) from forum data in massive open online courses (MOOCs). Design/methodology/approach: Forum posts from students enrolled in nine MOOCs were redacted by three human reviewers. The GPT and Llama models were then tasked with de-identifying the same data set using standardized prompts. Discrepancies between LLM and human redactions were analyzed to identify patterns in LLM errors. Findings: All models achieved an average recall of over 0.9 in identifying PII and identified PII instances overlooked by humans. However, their precisions were lower -- 0.579 for GPT-4o, 0.506 for Llama 3.3 and 0.262 for Llama 3.1 -- showing a tendency to over-redact non-PII names and locations. Research limitations/implications: Several courses' data were analyzed to increase findings' generalizability but the models' performance may vary in other contexts. GPT and Llama models were selected because of their availability and cost-effectiveness at the time of the study; future newer models may improve performance. Practical implications: The use of downloadable LLMs enables researchers to de-identify data without training specialized models or involving external companies, ensuring that student data remains private. Originality/value: Previous research on LLM text de-identification has largely used proprietary models, which require sharing data containing sensitive PII with third-party companies. This study evaluates the performance of two open weight models that can be deployed locally, eliminating the need to share sensitive data externally.
Abstractor:	As Provided
Entry Date:	2025
Accession Number:	EJ1473727
Database:	ERIC