TY - GEN
T1 - Automatic training example selection for scalable unsupervised record linkage
AU - Christen, Peter
PY - 2008
Y1 - 2008
N2 - Linking records from two or more databases is an increasingly important data preparation step in many data mining projects, as linked data can enable studies that are not feasible otherwise, or that would require expensive collection of specific data. The aim of such linkages is to match all records that refer to the same entity. One of the main challenges in record linkage is the accurate classification of record pairs into matches and non-matches. Many modern classification techniques are based on supervised machine learning and thus require training data, which is often not available in real world situations. A novel two-step approach to unsupervised record pair classification is presented in this paper. In the first step, training examples are selected automatically, and they are then used in the second step to train a binary classifier. An experimental evaluation shows that this approach can outperform k-means clustering and also be much faster than other classification techniques.
AB - Linking records from two or more databases is an increasingly important data preparation step in many data mining projects, as linked data can enable studies that are not feasible otherwise, or that would require expensive collection of specific data. The aim of such linkages is to match all records that refer to the same entity. One of the main challenges in record linkage is the accurate classification of record pairs into matches and non-matches. Many modern classification techniques are based on supervised machine learning and thus require training data, which is often not available in real world situations. A novel two-step approach to unsupervised record pair classification is presented in this paper. In the first step, training examples are selected automatically, and they are then used in the second step to train a binary classifier. An experimental evaluation shows that this approach can outperform k-means clustering and also be much faster than other classification techniques.
KW - Clustering
KW - Data linkage
KW - Data mining preprocessing
KW - Entity resolution
KW - Support vector machines
UR - http://www.scopus.com/inward/record.url?scp=44649093306&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-68125-0_45
DO - 10.1007/978-3-540-68125-0_45
M3 - Conference contribution
SN - 3540681248
SN - 9783540681243
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 511
EP - 518
BT - Advances in Knowledge Discovery and Data Mining - 12th Pacific-Asia Conference, PAKDD 2008, Proceedings
T2 - 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2008
Y2 - 20 May 2008 through 23 May 2008
ER -