| Title: |
Dataset Discovery using Semantic Matching |
| Authors: |
Khwaileh, Enas; Velegrakis, Yannis; Sub Data Intensive Systems |
| Publication Year: |
2025 |
| Subject Terms: |
Information Systems; Software; Computer Science Applications |
| Description: |
The exponential growth of data sizes and heterogeneity has made increasingly challenging to be able to identify datasets that meets specific analytical needs. Traditional keyword search methods often fail in that task since they cannot fully capture the semantics of the datasets and match them to those of the query. We introduce a novel dataset discovery method that significantly enhance both accuracy and retrieval speed. By employing advanced semantic matching at the individual field level and leveraging clustering and dimensionality reduction techniques, our method efficiently and effectively retrieves the datasets related to a query. Unlike traditional methods that focus on syntactic matches, our approach uncovers deeper semantic relationships within table data, providing more precise and relevant results. It achieves this by using transformers to generate and work with embeddings instead of the actual values. We present three different search methods that utilize these embeddings, and experimentally demonstrate the improvement that is achieved when compared to the state-of-the-art. |
| Document Type: |
book part |
| File Description: |
application/pdf |
| Language: |
English |
| ISSN: |
2367-2005 |
| Relation: |
https://dspace.library.uu.nl/handle/1874/482923 |
| Availability: |
https://dspace.library.uu.nl/handle/1874/482923 |
| Rights: |
info:eu-repo/semantics/OpenAccess |
| Accession Number: |
edsbas.7DEE79F |
| Database: |
BASE |