Forest-based dynamic sorted neighborhood indexing for real-time entity resolution

Banda Ramadan, Peter Christen

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    20 Citations (Scopus)

    Abstract

    Real-time entity resolution (ER) is the process of matching a query record in sub-second time with records in a database that represent the same real-world entity. To facilitate realtime matching on large databases, appropriate indexing approaches are required to reduce the search space. Most available indexing techniques are based on batch algorithms that work only with static databases and are not suitable for realtime ER. In this paper, we propose a forest-based sorted neighborhood index that uses multiple index trees with different sorting keys to facilitate real-time ER for read-most databases. Our technique aims to reduce the effect of errors and variations in attribute values on matching quality by building several distinct index trees. We conduct an experimental evaluation on two large real-world data sets, and multiple synthetic data sets with various data corruption rates. The results show that our approach is scalable to large databases and that using multiple trees gives a noticeable improvement on matching quality with only a small increase in query time. Our approach also achieves over one order of magnitude faster indexing and querying times, as well as higher matching accuracy, compared to another recently proposed real-time ER technique.

    Original languageEnglish
    Title of host publicationCIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management
    PublisherAssociation for Computing Machinery (ACM)
    Pages1787-1790
    Number of pages4
    ISBN (Electronic)9781450325981
    DOIs
    Publication statusPublished - 3 Nov 2014
    Event23rd ACM International Conference on Information and Knowledge Management, CIKM 2014 - Shanghai, China
    Duration: 3 Nov 20147 Nov 2014

    Publication series

    NameCIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management

    Conference

    Conference23rd ACM International Conference on Information and Knowledge Management, CIKM 2014
    Country/TerritoryChina
    CityShanghai
    Period3/11/147/11/14

    Fingerprint

    Dive into the research topics of 'Forest-based dynamic sorted neighborhood indexing for real-time entity resolution'. Together they form a unique fingerprint.

    Cite this