TY - GEN
T1 - Automatic record linkage using seeded nearest neighbour and support vector machine classification
AU - Christen, Peter
PY - 2008
Y1 - 2008
N2 - The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process. The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearestneighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches.
AB - The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process. The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearestneighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches.
KW - Data linkage
KW - Data matching
KW - Deduplication
KW - Entity resolution
KW - Nearest neighbour
KW - Support vector machine
UR - http://www.scopus.com/inward/record.url?scp=65449139594&partnerID=8YFLogxK
U2 - 10.1145/1401890.1401913
DO - 10.1145/1401890.1401913
M3 - Conference contribution
SN - 9781605581934
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 151
EP - 159
BT - KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining
T2 - 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008
Y2 - 24 August 2008 through 27 August 2008
ER -