Active Learning Based Similarity Filtering for Efficient and Effective Record Linkage

Charini Nanayakkara*, Peter Christen, Thilina Ranbaduge

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

The limited analytical value of using individual databases on their own increasingly requires the integration of large and complex databases for advanced data analytics. Linking personal medical records with travel and immigration data, for example, will allow the effective management of pandemics such as the current COVID-19 outbreak by tracking potentially infected individuals and their contacts. One major challenge for accurate linkage of large databases is the quadratic or even higher computational complexities of many advanced linkage algorithms. In this paper we present a novel approach that, based on the expected number of true matches between two databases, applies active learning to remove compared record pairs that are likely non-matches before a computationally expensive classification or clustering algorithm is employed to classify record pairs. Unlike blocking and indexing techniques that are used to reduce the number of record pairs to be compared, using recursive binning on a data dimension such as time or space, our approach removes likely non-matching record pairs in each bin after their comparison. Experiments on two real-world databases show that similarity filtering can substantially reduce run time and improve precision, at the costs of a small reduction in recall, of the final linkage results.

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 25th Pacific-Asia Conference, PAKDD 2021, Proceedings
EditorsKamal Karlapalem, Hong Cheng, Naren Ramakrishnan, R. K. Agrawal, P. Krishna Reddy, Jaideep Srivastava, Tanmoy Chakraborty
PublisherSpringer Science and Business Media Deutschland GmbH
Pages321-333
Number of pages13
ISBN (Print)9783030757649
DOIs
Publication statusPublished - 2021
Event25th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2021 - Virtual, Online
Duration: 11 May 202114 May 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12713 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference25th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2021
CityVirtual, Online
Period11/05/2114/05/21

Fingerprint

Dive into the research topics of 'Active Learning Based Similarity Filtering for Efficient and Effective Record Linkage'. Together they form a unique fingerprint.

Cite this