Efficient record linkage using a compact Hamming space

Dimitrios Karapiperis, Dinusha Vatsalan, Vassilios S. Verykios, Peter Christen

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    14 Citations (Scopus)

    Abstract

    Record linkage, the process of identifying similar records that correspond to the same real-world entities across databases, is a well-established research problem in the database, data mining, and information retrieval communities. Computing distances between string values of records is the key component in order to determine the similarity of the represented entities. Due to the typically large volumes of records, a two-step process is followed. A blocking mechanism is first applied for grouping similar records together, and then a matching mechanism is performed for comparing the records which have been inserted into the same block. However, there does not exist any efficient blocking/matching mechanism which provides theoretical guarantees for identifying similar records which consist of strings. Towards this end, we put forth the novel notion of embedding string-based records into a Hamming space, where such a mechanism exists. The size of these embeddings is kept as small as needed in order to guarantee the correspondence of distances in that space to the types of errors that exist between strings, e.g., a missing or a modified character. We build embeddings whose size is 120 bits for representing accurately four fields of a publicly available data set. We also present a distance threshold-aware blocking technique for higher accuracy rates compared to blocking approaches which ignore the specified threshold. Our empirical study conducted on real-world data sets shows the efficacy achieved by our embedding method as compared to several existing solutions.

    Original languageEnglish
    Title of host publicationAdvances in Database Technology - EDBT 2016
    Subtitle of host publication19th International Conference on Extending Database Technology, Proceedings
    EditorsIoana Manolescu, Evaggelia Pitoura, Amelie Marian, Sofian Maabout, Letizia Tanca, Georgia Koutrika, Kostas Stefanidis
    PublisherOpenProceedings.org
    Pages209-220
    Number of pages12
    ISBN (Electronic)9783893180707
    DOIs
    Publication statusPublished - 2016
    Event19th International Conference on Extending Database Technology, EDBT 2016 - Bordeaux, France
    Duration: 15 Mar 201618 Mar 2016

    Publication series

    NameAdvances in Database Technology - EDBT
    Volume2016-March
    ISSN (Electronic)2367-2005

    Conference

    Conference19th International Conference on Extending Database Technology, EDBT 2016
    Country/TerritoryFrance
    CityBordeaux
    Period15/03/1618/03/16

    Fingerprint

    Dive into the research topics of 'Efficient record linkage using a compact Hamming space'. Together they form a unique fingerprint.

    Cite this