Using metric space indexing for complete and efficient record linkage

Özgür Akgün*, Alan Dearle, Graham Kirby, Peter Christen

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    4 Citations (Scopus)

    Abstract

    Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.

    Original languageEnglish
    Title of host publicationAdvances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Proceedings
    EditorsGeoffrey I. Webb, Dinh Phung, Mohadeseh Ganji, Lida Rashidi, Vincent S. Tseng, Bao Ho
    PublisherSpringer Verlag
    Pages89-101
    Number of pages13
    ISBN (Print)9783319930398
    DOIs
    Publication statusPublished - 2018
    Event22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018 - Melbourne, Australia
    Duration: 3 Jun 20186 Jun 2018

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume10939 LNAI
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018
    Country/TerritoryAustralia
    CityMelbourne
    Period3/06/186/06/18

    Fingerprint

    Dive into the research topics of 'Using metric space indexing for complete and efficient record linkage'. Together they form a unique fingerprint.

    Cite this