Unsupervised blocking key selection for real-time entity resolution

Banda Ramadan*, Peter Christen

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    17 Citations (Scopus)

    Abstract

    Real-time entity resolution (ER) is the process of matching query records in sub-second time with records in a database that represent the same real-world entity. Indexing is a major step in the ER process, aimed at reducing the search space by bringing similar records closer to each other using a blocking key criterion. Selecting these keys is crucial for the effectiveness and efficiency of the real-time ER process. Traditional indexing techniques require domain knowledge for optimal key selection. However, to make the ER process less dependent on human domain knowledge, automatic selection of optimal blocking keys is required. In this paper we propose an unsupervised learning technique that automatically selects optimal blocking keys for building indexes that can be used in real-time ER. We specifically learn multiple keys to be used with multi-pass sorted neighbourhood, one of the most efficient and widely used indexing techniques for ER. We evaluate the proposed approach using three real-world data sets, and compare it with an existing automatic blocking key selection technique. The results show that our approach learns optimal blocking/sorting keys that are suitable for real-time ER. The learnt keys significantly increase the efficiency of query matching while maintaining the quality of matching results.

    Original languageEnglish
    Title of host publicationAdvances in Knowledge Discovery and Data Mining - 19th Pacific-Asia Conference, PAKDD 2015, Proceedings
    EditorsTru Cao, Ee-Peng Lim, Tu-Bao Ho, Zhi-Hua Zhou, Hiroshi Motoda, David Cheung
    PublisherSpringer Verlag
    Pages574-585
    Number of pages12
    ISBN (Print)9783319180311
    DOIs
    Publication statusPublished - 2015
    Event19th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2015 - Ho Chi Minh City, Viet Nam
    Duration: 19 May 201522 May 2015

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume9078
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference19th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2015
    Country/TerritoryViet Nam
    CityHo Chi Minh City
    Period19/05/1522/05/15

    Fingerprint

    Dive into the research topics of 'Unsupervised blocking key selection for real-time entity resolution'. Together they form a unique fingerprint.

    Cite this