| Title: |
DetCat: Detecting Categorical Outliers in Relational Datasets |
| Authors: |
Zylinski, Arthur; Qahtan, Abdulhakim A.; Sub Data Intensive Systems; Data Intensive Systems |
| Publication Year: |
2024 |
| Subject Terms: |
categorical values; outliers; similarity metrics; syntactic structure; Taverne; General Business,Management and Accounting; General Decision Sciences |
| Description: |
Poor data quality significantly affects different data analytics tasks, leading to inaccurate decisions and poor predictions of the machine learning models. Outliers represent one of the most common data glitches that impact data quality. While detecting outliers in numerical data has been extensively studied, few attempts were made to solve the problem of detecting categorical outliers. In this paper, we introduce DetCat for detecting categorical outliers in relational datasets, by utilizing the syntactic structure of the values. For a given attribute, DetCat identifies a set of patterns that represents the majority of the values as dominating patterns. Data values that cannot be generated by the dominating patterns are declared as outliers. The demo will show the effectiveness of our tool in detecting categorical outliers and discovering the syntactical data patterns. |
| Document Type: |
book part |
| File Description: |
application/pdf |
| Language: |
English |
| ISSN: |
2155-0751 |
| Relation: |
https://dspace.library.uu.nl/handle/1874/482520 |
| Availability: |
https://dspace.library.uu.nl/handle/1874/482520 |
| Rights: |
info:eu-repo/semantics/OpenAccess |
| Accession Number: |
edsbas.97CD4F41 |
| Database: |
BASE |