A two-step classification approach to unsupervised record linkage

Peter Christen*

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    26 Citations (Scopus)

    Abstract

    Linking or matching databases is becoming increasingly important in many data mining projects, as linked data can contain information that is not available otherwise, or that would be too expensive to collect manually. A main challenge when linking large databases is the classification of the compared record pairs into matches and non-matches. In traditional record linkage, classification thresholds have to be set either manually or using an EM-based approach. More recently developed classification methods are mainly based on supervised machine learning techniques and thus require training data, which is often not available in real world situations or has to be prepared manually. In this paper, a novel two-step approach to record pair classification is presented. In a first step, example training data of high quality is generated automatically, and then used in a second step to train a supervised classifier. Initial experimental results on both real and synthetic data show that this approach can outperform traditional unsupervised clustering, and even achieve linkage quality almost as good as fully supervised techniques.

    Original languageEnglish
    Title of host publicationData Mining and Analytics 2007 - 6th Australasian Data Mining Conference, AusDM 2007, Proceedings
    Pages111-119
    Number of pages9
    Publication statusPublished - 2007
    Event6th Australasian Data Mining Conference, AusDM 2007 - Gold Coast, QLD, Australia
    Duration: 3 Dec 20074 Dec 2007

    Publication series

    NameConferences in Research and Practice in Information Technology Series
    Volume70
    ISSN (Print)1445-1336

    Conference

    Conference6th Australasian Data Mining Conference, AusDM 2007
    Country/TerritoryAustralia
    CityGold Coast, QLD
    Period3/12/074/12/07

    Fingerprint

    Dive into the research topics of 'A two-step classification approach to unsupervised record linkage'. Together they form a unique fingerprint.

    Cite this