Katalog Plus
Bibliothek der Frankfurt UAS
Bald neuer Katalog: sichern Sie sich schon vorab Ihre persönlichen Merklisten im Nutzerkonto: Anleitung.
Dieses Ergebnis aus BASE kann Gästen nicht angezeigt werden.  Login für vollen Zugriff.

Efficient and General Text Classification: An Active Learning Approach Using Active Learning and NLP to Aid Processes Such as Journalistic Investigations And document Analysis

Title: Efficient and General Text Classification: An Active Learning Approach Using Active Learning and NLP to Aid Processes Such as Journalistic Investigations And document Analysis
Authors: van Grinsven, Micha; Brinkhuis, Matthieu; Krempl, Georg; Snijder, Joop; Sub Softw.Techn. for Learning and Teach; Sub Algorithmic Data Analysis; Meo, Rosa; Silvestri, Fabrizio
Publication Year: 2025
Subject Terms: ASReview; Active Learning; Journalistic Investigations; Natural Language Processing; Pipeline; Text classification; Text mining; Taverne; General Computer Science; General Mathematics
Description: Active Learning (AL) has shown advantages over Passive Learning in domains where labeled data is costly to obtain. Nevertheless, it is relatively underused in real-world applications for textual data. In this research, AL is applied to two unbalanced, real-world datasets on the now-defunct energy company Enron and the Dutch oil company Shell. The Enron data is labelled on the presence of information on logistics in documents, whereas the Shell dataset is part of a current investigation by Follow the Money which is a journalism bureau. In this paper, we attempt to aid such a journalistic investigation with an Active Machine Learning approach. This approach assists the investigator (oracle) to identify documents belonging to a storyline in the dataset. The classification of documents is performed by looking only at the textual data in these datasets. As an initial test, the public Enron dataset with its large number of labels is used. Subsequently, the method is used on a real-world application with the Shell dataset. During testing, it is found that the highest F1-score of Passive Learning is matched by an Active learning approach that uses only 42% of the data necessary for Passive Learning. Furthermore, it turns out that by using a combination of Active Learning and Natural Language Processing on the Shell data, an F1-score of 0.87 together with an accuracy of 0.91 can be achieved using only 5% of labeled data with a logistic regression model. This shows that Active Learning can aid in a journalistic investigation and the development of storylines. ASReview is used to facilitate this research. The setup presented in this research could be applied to almost any textual data classification problem.
Document Type: book part
File Description: application/pdf
Language: English
ISSN: 1865-0929
Relation: https://dspace.library.uu.nl/handle/1874/482628
Availability: https://dspace.library.uu.nl/handle/1874/482628
Rights: info:eu-repo/semantics/OpenAccess
Accession Number: edsbas.9B50792A
Database: BASE