Quality and complexity measures for data linkage and deduplication

Peter Christen*, Karl Goiser

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

    134 Citations (Scopus)

    Abstract

    Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity.

    Original languageEnglish
    Title of host publicationQuality Measures in Data Mining
    EditorsFabrice Guillet, Howard Hamilton
    Pages127-151
    Number of pages25
    DOIs
    Publication statusPublished - 2007

    Publication series

    NameStudies in Computational Intelligence
    Volume43
    ISSN (Print)1860-949X

    Fingerprint

    Dive into the research topics of 'Quality and complexity measures for data linkage and deduplication'. Together they form a unique fingerprint.

    Cite this