TY - GEN
T1 - Scalable block scheduling for efficient multi-database record linkage
AU - Ranbaduge, Thilina
AU - Vatsalan, Dinusha
AU - Christen, Peter
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/2
Y1 - 2016/7/2
N2 - Record linkage (RL) is a task in data integration that aims to identify matching records that refer to the same entity from different databases. When records from more than two databases are to be linked RL is significantly challenged by the intrinsic exponential growth in the number of potential record comparisons to be conducted.We propose a scalable metablocking protocol to be used for Multi-Database RL (MDRL) to significantly reduce the complexity of the matching (comparison and classification) phase. Our approach uses a graph structure to schedule the comparison of pairs of blocks with the aim of minimizing the number of repeated and superfluous comparisons between records. We provide an analysis of our approach and conduct an empirical study on large real-world databases.
AB - Record linkage (RL) is a task in data integration that aims to identify matching records that refer to the same entity from different databases. When records from more than two databases are to be linked RL is significantly challenged by the intrinsic exponential growth in the number of potential record comparisons to be conducted.We propose a scalable metablocking protocol to be used for Multi-Database RL (MDRL) to significantly reduce the complexity of the matching (comparison and classification) phase. Our approach uses a graph structure to schedule the comparison of pairs of blocks with the aim of minimizing the number of repeated and superfluous comparisons between records. We provide an analysis of our approach and conduct an empirical study on large real-world databases.
UR - http://www.scopus.com/inward/record.url?scp=85014517722&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2016.40
DO - 10.1109/ICDM.2016.40
M3 - Conference contribution
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 1161
EP - 1166
BT - Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016
A2 - Bonchi, Francesco
A2 - Domingo-Ferrer, Josep
A2 - Baeza-Yates, Ricardo
A2 - Zhou, Zhi-Hua
A2 - Wu, Xindong
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 16th IEEE International Conference on Data Mining, ICDM 2016
Y2 - 12 December 2016 through 15 December 2016
ER -