Towards scalable real-time entity resolution using a similarity-aware inverted index approach

Peter Christen*, Ross Gayler

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    18 Citations (Scopus)

    Abstract

    Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have as-sumed the matching of two static databases. In our networked and online world, however, it is becoming increasingly important for many organisations to be able to conduct entity resolution between a collection of often very large databases and a stream of query or update records. The matching should be done in (near) real-time, and be as automatic and accurate as possible, returning a ranked list of matched records for each given query record. This task therefore be-comes similar to querying large document collections, as done for example by Web search engines, however based on a different type of documents: structured database records that, for example, contain personal information, such as names and addresses. In this paper, we investigate inverted indexing techniques, as commonly used in Web search engines, and employ them for real-time entity resolution. We present two variations of the traditional inverted in-dex approach, aimed at facilitating fast approximate matching. We show encouraging initial results on large real-world data sets, with the inverted index ap-proaches being up-to one hundred times faster than the traditionally used standard blocking approach. However, this improved matching speed currently comes at a cost, in that matching quality for larger data sets can be lower compared to when tandard blocking is used, and thus more work is required.

    Original languageEnglish
    Title of host publicationAusDM'08 - Conferences in Research and Practice in Information TechnologyConferences in Research and Practice in Information Technology
    Pages51-60
    Number of pages10
    Publication statusPublished - 2008
    Event7th Australasian Data Mining Conference, AusDM 2008 - Glenelg, SA, Australia
    Duration: 27 Nov 200828 Nov 2008

    Publication series

    NameConferences in Research and Practice in Information Technology Series
    Volume87
    ISSN (Print)1445-1336

    Conference

    Conference7th Australasian Data Mining Conference, AusDM 2008
    Country/TerritoryAustralia
    CityGlenelg, SA
    Period27/11/0828/11/08

    Fingerprint

    Dive into the research topics of 'Towards scalable real-time entity resolution using a similarity-aware inverted index approach'. Together they form a unique fingerprint.

    Cite this