Two stage similarity-aware indexing for large-scale real-time entity resolution

Shouheng Li, Huizhi Liang, Banda Ramadan

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    1 Citation (Scopus)

    Abstract

    Entity resolution is the process of identifying records in one or multiple data sources that represent the same real-world entity. How to find all the records that belong to the same entity as the query record in real-time brings challenges to existing entity resolution approaches. The challenge is especially true for large-scale dataset. In this paper, we propose to use a two-stage similarity-aware indexing approach for large-scale real-time entity resolution. In the first stage, we use locality sensitive hashing to fulter out records with low similarities for the purpose of decreasing the number of comparisons. Then, in the second stage, we pre-calculate the comparison similarities of the attribute values to further decrease the query time. The experiments conducted on a largescale dataset with over 2 million records shows the effectiveness of the proposed approach.

    Original languageEnglish
    Title of host publicationData Mining and Analytics 2013 - Proceedings of the 11th Australasian Data Mining Conference, AusDM 2013
    EditorsYanchang Zhao, Andrew Stranieri, Lin Liu, Paul Kennedy, Peter Christen, Kok-Leong Ong, Yanchang Zhao
    PublisherAustralian Computer Society
    Pages107-116
    Number of pages10
    ISBN (Electronic)9781921770166
    Publication statusPublished - 2013

    Publication series

    NameConferences in Research and Practice in Information Technology Series
    Volume146
    ISSN (Print)1445-1336

    Fingerprint

    Dive into the research topics of 'Two stage similarity-aware indexing for large-scale real-time entity resolution'. Together they form a unique fingerprint.

    Cite this