Assessing deduplication and data linkage quality: What to measure?

Peter Christen*, Karl Goiser

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    12 Citations (Scopus)

    Abstract

    Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and effciency when deduplicating or linking very large data sets. Different measures have been used to characterise the quality of data linkage algorithms. This paper presents an overview of the issues involved in measuring deduplication and data linkage quality, and it is shown that measures in the space of record pair comparisons can produce deceptive accuracy results. Various measures are discussed and recommendations are given on how to assess deduplication and data linkage quality.

    Original languageEnglish
    Title of host publicationAusDM 2005 Proc. - 4th Australasian Data Mining Conf. - Collocated with the 18th Australian Joint Conf. on Artificial Intelligence, AI 2005 and the 2nd Australian Conf. on Artificial Life, ACAL 2005
    Pages37-52
    Number of pages16
    Publication statusPublished - 2005
    Event4th Australasian Data Mining Conference, AusDM 2005 - Collocated with the 18th Australian Joint Conference on Artificial Intelligence, AI 2005 and the 2nd Australian Conference on Artificial Life, ACAL 2005 - Sydney, NSW, Australia
    Duration: 5 Dec 20056 Dec 2005

    Publication series

    NameAusDM 2005 Proc. - 4th Australasian Data Mining Conf. - Collocated with the 18th Australian Joint Conf. on Artificial Intelligence, AI 2005 and the 2nd Australian Conf. on Artifical Life, ACAL 2005

    Conference

    Conference4th Australasian Data Mining Conference, AusDM 2005 - Collocated with the 18th Australian Joint Conference on Artificial Intelligence, AI 2005 and the 2nd Australian Conference on Artificial Life, ACAL 2005
    Country/TerritoryAustralia
    CitySydney, NSW
    Period5/12/056/12/05

    Fingerprint

    Dive into the research topics of 'Assessing deduplication and data linkage quality: What to measure?'. Together they form a unique fingerprint.

    Cite this