Katalog Plus
Bibliothek der Frankfurt UAS
Bald neuer Katalog: sichern Sie sich schon vorab Ihre persönlichen Merklisten im Nutzerkonto: Anleitung.
Dieses Ergebnis aus BASE kann Gästen nicht angezeigt werden.  Login für vollen Zugriff.

SynODC: Utilizing the Syntactic Structure for Outlier Detection in Categorical Attributes

Title: SynODC: Utilizing the Syntactic Structure for Outlier Detection in Categorical Attributes
Authors: Zylinski, Arthur; Qahtan, Abdulhakim A.; Sub Data Intensive Systems
Publication Year: 2024
Subject Terms: Taverne
Description: The problem of outlier detection is a long-standing problem, where outliers affect the data quality significantly. Machine learning models that are trained on a low quality data tend to produce inaccurate decisions and poor predictions. While detecting outliers in numerical data has been extensively studied, few attempts were made to solve the problem of detecting outliers in attributes with categorical values. In this paper, we introduce SynODC for detecting categorical outliers in relational (tabular) datasets by utilizing the syntactic structure of the values. For a given attribute, SynODC identifies a set of patterns that represent the majority of the values as dominating patterns. Data values that do not match (i.e. cannot be generated by) one of the dominating patterns are declared as outliers. Our target is to construct, for each attribute, a minimal set of dominating patterns that are expressive enough to represent the different formats of the values in the attribute. To do that, we define a new distance metric that generalizes the Levenshtein distance to measure the distance between the patterns. Using the new distance metric, SynODC combines similar patterns to maintain compact representations of the attributes. The experimental results on multiple real-world datasets prove the effectiveness of SynODC in detecting syntactic outliers that cannot be detected by other data cleaning tools.
Document Type: book part
File Description: application/pdf
Language: English
ISSN: 0302-9743
Relation: https://dspace.library.uu.nl/handle/1874/482624
Availability: https://dspace.library.uu.nl/handle/1874/482624
Rights: info:eu-repo/semantics/OpenAccess
Accession Number: edsbas.92E63501
Database: BASE