TY - GEN
T1 - Sampling dirty data for matching attributes
AU - Köhler, Henning
AU - Zhou, Xiaofang
AU - Sadiq, Shazia
AU - Shu, Yanfeng
AU - Taylor, Kerry
PY - 2010
Y1 - 2010
N2 - We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.
AB - We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.
KW - database integration
KW - sampling
KW - schema matching
UR - http://www.scopus.com/inward/record.url?scp=77954738593&partnerID=8YFLogxK
U2 - 10.1145/1807167.1807177
DO - 10.1145/1807167.1807177
M3 - Conference contribution
SN - 9781450300322
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 63
EP - 74
BT - Proceedings of the 2010 International Conference on Management of Data, SIGMOD '10
T2 - 2010 International Conference on Management of Data, SIGMOD '10
Y2 - 6 June 2010 through 11 June 2010
ER -