Scalable entity resolution using probabilistic signatures on parallel databases

Yuhang Zhang, Tania Churchill, Kee Siong Ng, Peter Christen

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    3 Citations (Scopus)

    Abstract

    Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a first-principles formulation of entity resolution, this paper presents a novel entity resolution algorithm that introduces a data-driven blocking and record linkage technique based on the probabilistic identification of entity signatures in data. The scalability and accuracy of the proposed algorithm are evaluated using benchmark datasets and shown to achieve state-of-the-art results. The proposed algorithm can be implemented simply on modern parallel databases, which we have done in the financial intelligence domain with tens of Terabytes of noisy data.

    Original languageEnglish
    Title of host publicationCIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management
    EditorsNorman Paton, Selcuk Candan, Haixun Wang, James Allan, Rakesh Agrawal, Alexandros Labrinidis, Alfredo Cuzzocrea, Mohammed Zaki, Divesh Srivastava, Andrei Broder, Assaf Schuster
    PublisherAssociation for Computing Machinery
    Pages2213-2222
    Number of pages10
    ISBN (Electronic)9781450360142
    DOIs
    Publication statusPublished - 17 Oct 2018
    Event27th ACM International Conference on Information and Knowledge Management, CIKM 2018 - Torino, Italy
    Duration: 22 Oct 201826 Oct 2018

    Publication series

    NameInternational Conference on Information and Knowledge Management, Proceedings

    Conference

    Conference27th ACM International Conference on Information and Knowledge Management, CIKM 2018
    Country/TerritoryItaly
    CityTorino
    Period22/10/1826/10/18

    Fingerprint

    Dive into the research topics of 'Scalable entity resolution using probabilistic signatures on parallel databases'. Together they form a unique fingerprint.

    Cite this