Robust record linkage blocking using suffix arrays

Timothy De Vries*, Hui Ke, Sanjay Chawla, Peter Christen

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    42 Citations (Scopus)

    Abstract

    Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, a quadratic scalability for the brute force approach necessitates the design of appropriate indexing or blocking techniques. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlights the importance of using efficient indexing and blocking in real world applications where data sets contain millions of records.

    Original languageEnglish
    Title of host publicationACM 18th International Conference on Information and Knowledge Management, CIKM 2009
    Pages305-314
    Number of pages10
    DOIs
    Publication statusPublished - 2009
    EventACM 18th International Conference on Information and Knowledge Management, CIKM 2009 - Hong Kong, China
    Duration: 2 Nov 20096 Nov 2009

    Publication series

    NameInternational Conference on Information and Knowledge Management, Proceedings

    Conference

    ConferenceACM 18th International Conference on Information and Knowledge Management, CIKM 2009
    Country/TerritoryChina
    CityHong Kong
    Period2/11/096/11/09

    Fingerprint

    Dive into the research topics of 'Robust record linkage blocking using suffix arrays'. Together they form a unique fingerprint.

    Cite this