TY - CHAP
T1 - Quality and complexity measures for data linkage and deduplication
AU - Christen, Peter
AU - Goiser, Karl
PY - 2007
Y1 - 2007
N2 - Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity.
AB - Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity.
KW - Data integration and matching
KW - Data mining pre-processing
KW - Data or record linkage
KW - Deduplication
KW - Quality and complexity measures
UR - http://www.scopus.com/inward/record.url?scp=33846428121&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-44918-8_6
DO - 10.1007/978-3-540-44918-8_6
M3 - Chapter
SN - 3540449116
SN - 9783540449119
T3 - Studies in Computational Intelligence
SP - 127
EP - 151
BT - Quality Measures in Data Mining
A2 - Guillet, Fabrice
A2 - Hamilton, Howard
ER -