Noise-tolerant approximate blocking for dynamic real-time entity resolution

Huizhi Liang, Yanzhe Wang, Peter Christen, Ross Gayler

    Research output: Contribution to journalConference articlepeer-review

    15 Citations (Scopus)

    Abstract

    Entity resolution is the process of identifying records in one or multiple data sources that represent the same real-world entity. This process needs to deal with noisy data that contain for example wrong pronunciation or spelling errors. Many real world applications require rapid responses for entity queries on dynamic datasets. This brings challenges to existing approaches which are mainly aimed at the batch matching of records in static data. Locality sensitive hashing (LSH) is an approximate blocking approach that hashes objects within a certain distance into the same block with high probability. How to make approximate blocking approaches scalable to large datasets and effective for entity resolution in real-time remains an open question. Targeting this problem, we propose a noise-tolerant approximate blocking approach to index records based on their distance ranges using LSH and sorting trees within large sized hash blocks. Experiments conducted on both synthetic and real-world datasets show the effectiveness of the proposed approach.

    Original languageEnglish
    Pages (from-to)449-460
    Number of pages12
    JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume8444 LNAI
    Issue numberPART 2
    DOIs
    Publication statusPublished - 2014
    Event18th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2014 - Tainan, Taiwan
    Duration: 13 May 201416 May 2014

    Fingerprint

    Dive into the research topics of 'Noise-tolerant approximate blocking for dynamic real-time entity resolution'. Together they form a unique fingerprint.

    Cite this