Accurate privacy-preserving record linkage for databases with missing values

Sirintra Vaiwsri*, Thilina Ranbaduge, Peter Christen, Rainer Schnell

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    3 Citations (Scopus)

    Abstract

    Privacy-preserving record linkage is the process of matching records that refer to the same entity across sensitive databases held by different organisations. This process is often challenging because no unique entity identifiers, such as social security numbers, are available in the databases to be linked. Therefore, quasi-identifying attributes such as names and addresses, are required to identify records that are similar and likely refer to the same entity. Such quasi-identifiers are however often not allowed to be shared between organisations due to privacy and confidentiality concerns. Besides variations and errors in the values used for linking, quasi-identifiers can have missing values. A popular approach to link sensitive data in a privacy-preserving way is to encode quasi-identifying values into Bloom filters, bit vectors that allow approximate similarities between values to be calculated. However, with existing Bloom filter encoding approaches missing values can lead to missed true matches because they affect the similarities calculated between Bloom filters. In this paper we propose a novel approach to consider missing values in privacy-preserving record linkage by adapting Bloom filter encoding based on the patterns of missingness identified in the databases to be linked. We build a lattice structure of missingness patterns, and then generate partitions of Bloom filters over this lattice. In each partition the non-missing encoded quasi-identifying attributes are assigned different weights during the Bloom filter generation process. This results in more accurate similarity calculation and better linkage quality. To improve the privacy of our approach, each partition is encoded independently which prevents both dictionary and frequency-based attacks. We evaluate our approach on large databases that contain different amounts and patterns of missing values, showing that it can substantially outperform both Bloom filter encoding that does not consider missing values, and an earlier Bloom filter based approach for linking sensitive databases that do contain missing values.

    Original languageEnglish
    Article number101959
    JournalInformation Systems
    Volume106
    DOIs
    Publication statusPublished - May 2022

    Fingerprint

    Dive into the research topics of 'Accurate privacy-preserving record linkage for databases with missing values'. Together they form a unique fingerprint.

    Cite this