Sampling dirty data for matching attributes

Henning Köhler*, Xiaofang Zhou, Shazia Sadiq, Yanfeng Shu, Kerry Taylor

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

19 Citations (Scopus)

Abstract

We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.

Original languageEnglish
Title of host publicationProceedings of the 2010 International Conference on Management of Data, SIGMOD '10
Pages63-74
Number of pages12
DOIs
Publication statusPublished - 2010
Externally publishedYes
Event2010 International Conference on Management of Data, SIGMOD '10 - Indianapolis, IN, United States
Duration: 6 Jun 201011 Jun 2010

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2010 International Conference on Management of Data, SIGMOD '10
Country/TerritoryUnited States
CityIndianapolis, IN
Period6/06/1011/06/10

Fingerprint

Dive into the research topics of 'Sampling dirty data for matching attributes'. Together they form a unique fingerprint.

Cite this