TY - GEN
T1 - Blind data linkage using n-gram similarity comparisons
AU - Churches, Tim
AU - Christen, Peter
N1 - Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 2004.
PY - 2004
Y1 - 2004
N2 - Integrating or linking data from different sources is an increasingly important task in the preprocessing stage of many data mining projects. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. If no common unique entity identifiers (keys) are available in all data sources, the linkage needs to be performed using the available identifying attributes, like names and addresses. Data confidentiality often limits or even prohibits successful data linkage, as either no consent can be gained (for example in biomedical studies) or the data holders are not willing to release their data for linkage by other parties. We present methods for confidential data linkage based on hash encoding, public key encryption and n-gram similarity comparison techniques, and show how blind data linkage can be performed.
AB - Integrating or linking data from different sources is an increasingly important task in the preprocessing stage of many data mining projects. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. If no common unique entity identifiers (keys) are available in all data sources, the linkage needs to be performed using the available identifying attributes, like names and addresses. Data confidentiality often limits or even prohibits successful data linkage, as either no consent can be gained (for example in biomedical studies) or the data holders are not willing to release their data for linkage by other parties. We present methods for confidential data linkage based on hash encoding, public key encryption and n-gram similarity comparison techniques, and show how blind data linkage can be performed.
KW - Data matching
KW - Hash encoding
KW - Privacy preserving data mining
KW - Public key infrastructure
KW - n-gram indexing
UR - http://www.scopus.com/inward/record.url?scp=7444258692&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-24775-3_15
DO - 10.1007/978-3-540-24775-3_15
M3 - Conference contribution
SN - 354022064X
SN - 9783540220640
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 121
EP - 126
BT - Advances in Knowledge Discovery and Data Mining - 8th Pacific-Asia Conference, PAKDD 2004, Proceedings
A2 - Dai, Honghua
A2 - Srikant, Ramakrishnan
A2 - Zhang, Chengqi
PB - Springer Verlag
T2 - 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2004
Y2 - 26 May 2004 through 28 May 2004
ER -