A scalable and efficient subgroup blocking scheme for multidatabase record linkage

Thilina Ranbaduge*, Dinusha Vatsalan, Peter Christen

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    5 Citations (Scopus)

    Abstract

    Record linkage is a commonly used task in data integration to facilitate the identification of matching records that refer to the same entity from different databases. The scalability of multidatabase record linkage (MDRL) is significantly challenged with the increase of both the sizes and the number of databases that are to be linked. Identifying matching records across subgroups of databases is an important aspect in MDRL that has not been addressed so far. We propose a scalable subgroup blocking approach for MDRL that uses an efficient search over a graph structure to identify similar blocks of records that need to be compared across subgroups of multiple databases. We provide an analysis of our technique in terms of complexity and blocking quality. We conduct an empirical study on large real-world datasets that shows our approach is scalable with the size of subgroups and the number of databases, and outperforms an existing state-of-the-art blocking technique for MDRL.

    Original languageEnglish
    Title of host publicationAdvances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Proceedings
    EditorsGeoffrey I. Webb, Dinh Phung, Mohadeseh Ganji, Lida Rashidi, Vincent S. Tseng, Bao Ho
    PublisherSpringer Verlag
    Pages15-27
    Number of pages13
    ISBN (Print)9783319930398
    DOIs
    Publication statusPublished - 2018
    Event22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018 - Melbourne, Australia
    Duration: 3 Jun 20186 Jun 2018

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume10939 LNAI
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018
    Country/TerritoryAustralia
    CityMelbourne
    Period3/06/186/06/18

    Fingerprint

    Dive into the research topics of 'A scalable and efficient subgroup blocking scheme for multidatabase record linkage'. Together they form a unique fingerprint.

    Cite this