A clustering-based framework to control block sizes for entity resolution

Jeffrey Fisher, Peter Christen, Qing Wang, Erhard Rahm

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    48 Citations (Scopus)

    Abstract

    Entity resolution (ER) is a common data cleaning task that involves determining which records from one or more data sets refer to the same real-world entities. Because a pairwise comparison of all records scales quadratically with the number of records in the data sets to be matched, it is common to use blocking or indexing techniques to reduce the number of comparisons required. These techniques split the data sets into blocks and only records within blocks are compared with each other. Most existing blocking techniques do not provide control over the size of the generated blocks, despite this control being important in many practical applications of ER, such as privacy-preserving record linkage and realtime ER. We propose two novel hierarchical clustering approaches which can generate blocks within a specified size range, and we present a penalty function which allows control of the trade-off between block quality and block size in the clustering process. We evaluate our techniques on three real-world data sets and compare them against three baseline approaches. The results show our proposed techniques perform well on the measures of pairs completeness and reduction ratio compared to the baseline approaches, while also satisfying the block size restrictions.

    Original languageEnglish
    Title of host publicationKDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    PublisherAssociation for Computing Machinery (ACM)
    Pages279-288
    Number of pages10
    ISBN (Electronic)9781450336642
    DOIs
    Publication statusPublished - 10 Aug 2015
    Event21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015 - Sydney, Australia
    Duration: 10 Aug 201513 Aug 2015

    Publication series

    NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
    Volume2015-August

    Conference

    Conference21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015
    Country/TerritoryAustralia
    CitySydney
    Period10/08/1513/08/15

    Fingerprint

    Dive into the research topics of 'A clustering-based framework to control block sizes for entity resolution'. Together they form a unique fingerprint.

    Cite this