Context-Aware Approximate String Matching for Large-Scale Real-Time Entity Resolution

Peter Christen, Ross W. Gayler

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    1 Citation (Scopus)

    Abstract

    Techniques for approximate string matching have been widely studied over several decades. They are required in many applications, including entity resolution, spell checking, similarity joins, and biological sequence comparison. Most existing techniques for approximate string matching used in entity resolution only consider the two strings that are compared. They neglect contextual information such as the frequency of how often strings occur in a database, the likelihood of the character edits between strings, or how many other similar strings there are in a database. In this paper we investigate if incorporating such contextual information into edit distance based approximate string matching can improve matching quality for real-time entity resolution. In this application, query records have to be matched in sub-second time to records in a large database that refer to the same entity. We evaluate our approach on two large real data sets and compare it to several baseline approaches. Our results show that considering edit frequency and the neighborhood size of a string can improve matching results, while taking string frequencies into account can actually make results worse.

    Original languageEnglish
    Title of host publicationProceedings - 15th IEEE International Conference on Data Mining Workshop, ICDMW 2015
    EditorsXindong Wu, Alexander Tuzhilin, Hui Xiong, Jennifer G. Dy, Charu Aggarwal, Zhi-Hua Zhou, Peng Cui
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages211-217
    Number of pages7
    ISBN (Electronic)9781467384926
    DOIs
    Publication statusPublished - 29 Jan 2016
    Event15th IEEE International Conference on Data Mining Workshop, ICDMW 2015 - Atlantic City, United States
    Duration: 14 Nov 201517 Nov 2015

    Publication series

    NameProceedings - 15th IEEE International Conference on Data Mining Workshop, ICDMW 2015

    Conference

    Conference15th IEEE International Conference on Data Mining Workshop, ICDMW 2015
    Country/TerritoryUnited States
    CityAtlantic City
    Period14/11/1517/11/15

    Fingerprint

    Dive into the research topics of 'Context-Aware Approximate String Matching for Large-Scale Real-Time Entity Resolution'. Together they form a unique fingerprint.

    Cite this