Katalog Plus
Bibliothek der Frankfurt UAS
Bald neuer Katalog: sichern Sie sich schon vorab Ihre persönlichen Merklisten im Nutzerkonto: Anleitung.
Dieses Ergebnis aus BASE kann Gästen nicht angezeigt werden.  Login für vollen Zugriff.

Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

Title: Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification
Authors: Eshuijs, Leon; Wang, Shihan; Fokkens, Antske; Sub Intelligent Systems
Publication Year: 2025
Description: Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism.We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.
Document Type: book part
File Description: application/pdf
Language: English
Relation: https://dspace.library.uu.nl/handle/1874/483015
Availability: https://dspace.library.uu.nl/handle/1874/483015
Rights: info:eu-repo/semantics/OpenAccess
Accession Number: edsbas.72FD4D83
Database: BASE