TY - GEN
T1 - Assessing deduplication and data linkage quality
T2 - 4th Australasian Data Mining Conference, AusDM 2005 - Collocated with the 18th Australian Joint Conference on Artificial Intelligence, AI 2005 and the 2nd Australian Conference on Artificial Life, ACAL 2005
AU - Christen, Peter
AU - Goiser, Karl
PY - 2005
Y1 - 2005
N2 - Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and effciency when deduplicating or linking very large data sets. Different measures have been used to characterise the quality of data linkage algorithms. This paper presents an overview of the issues involved in measuring deduplication and data linkage quality, and it is shown that measures in the space of record pair comparisons can produce deceptive accuracy results. Various measures are discussed and recommendations are given on how to assess deduplication and data linkage quality.
AB - Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and effciency when deduplicating or linking very large data sets. Different measures have been used to characterise the quality of data linkage algorithms. This paper presents an overview of the issues involved in measuring deduplication and data linkage quality, and it is shown that measures in the space of record pair comparisons can produce deceptive accuracy results. Various measures are discussed and recommendations are given on how to assess deduplication and data linkage quality.
KW - Data integration and matching
KW - Data mining pre-processing
KW - Data or record linkage
KW - Deduplication
KW - Quality measures
UR - http://www.scopus.com/inward/record.url?scp=84884345501&partnerID=8YFLogxK
M3 - Conference contribution
SN - 1863657169
SN - 9781863657167
T3 - AusDM 2005 Proc. - 4th Australasian Data Mining Conf. - Collocated with the 18th Australian Joint Conf. on Artificial Intelligence, AI 2005 and the 2nd Australian Conf. on Artifical Life, ACAL 2005
SP - 37
EP - 52
BT - AusDM 2005 Proc. - 4th Australasian Data Mining Conf. - Collocated with the 18th Australian Joint Conf. on Artificial Intelligence, AI 2005 and the 2nd Australian Conf. on Artificial Life, ACAL 2005
Y2 - 5 December 2005 through 6 December 2005
ER -