TY - GEN
T1 - Active Learning Based Similarity Filtering for Efficient and Effective Record Linkage
AU - Nanayakkara, Charini
AU - Christen, Peter
AU - Ranbaduge, Thilina
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - The limited analytical value of using individual databases on their own increasingly requires the integration of large and complex databases for advanced data analytics. Linking personal medical records with travel and immigration data, for example, will allow the effective management of pandemics such as the current COVID-19 outbreak by tracking potentially infected individuals and their contacts. One major challenge for accurate linkage of large databases is the quadratic or even higher computational complexities of many advanced linkage algorithms. In this paper we present a novel approach that, based on the expected number of true matches between two databases, applies active learning to remove compared record pairs that are likely non-matches before a computationally expensive classification or clustering algorithm is employed to classify record pairs. Unlike blocking and indexing techniques that are used to reduce the number of record pairs to be compared, using recursive binning on a data dimension such as time or space, our approach removes likely non-matching record pairs in each bin after their comparison. Experiments on two real-world databases show that similarity filtering can substantially reduce run time and improve precision, at the costs of a small reduction in recall, of the final linkage results.
AB - The limited analytical value of using individual databases on their own increasingly requires the integration of large and complex databases for advanced data analytics. Linking personal medical records with travel and immigration data, for example, will allow the effective management of pandemics such as the current COVID-19 outbreak by tracking potentially infected individuals and their contacts. One major challenge for accurate linkage of large databases is the quadratic or even higher computational complexities of many advanced linkage algorithms. In this paper we present a novel approach that, based on the expected number of true matches between two databases, applies active learning to remove compared record pairs that are likely non-matches before a computationally expensive classification or clustering algorithm is employed to classify record pairs. Unlike blocking and indexing techniques that are used to reduce the number of record pairs to be compared, using recursive binning on a data dimension such as time or space, our approach removes likely non-matching record pairs in each bin after their comparison. Experiments on two real-world databases show that similarity filtering can substantially reduce run time and improve precision, at the costs of a small reduction in recall, of the final linkage results.
KW - Binning
KW - Efficiency enhancement
KW - Entity resolution
UR - http://www.scopus.com/inward/record.url?scp=85111086834&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-75765-6_26
DO - 10.1007/978-3-030-75765-6_26
M3 - Conference contribution
SN - 9783030757649
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 321
EP - 333
BT - Advances in Knowledge Discovery and Data Mining - 25th Pacific-Asia Conference, PAKDD 2021, Proceedings
A2 - Karlapalem, Kamal
A2 - Cheng, Hong
A2 - Ramakrishnan, Naren
A2 - Agrawal, R. K.
A2 - Reddy, P. Krishna
A2 - Srivastava, Jaideep
A2 - Chakraborty, Tanmoy
PB - Springer Science and Business Media Deutschland GmbH
T2 - 25th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2021
Y2 - 11 May 2021 through 14 May 2021
ER -