Automatic record linkage using seeded nearest neighbour and support vector machine classification

Peter Christen*

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    134 Citations (Scopus)

    Abstract

    The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process. The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearestneighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches.

    Original languageEnglish
    Title of host publicationKDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining
    Pages151-159
    Number of pages9
    DOIs
    Publication statusPublished - 2008
    Event14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008 - Las Vegas, NV, United States
    Duration: 24 Aug 200827 Aug 2008

    Publication series

    NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    Conference

    Conference14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008
    Country/TerritoryUnited States
    CityLas Vegas, NV
    Period24/08/0827/08/08

    Fingerprint

    Dive into the research topics of 'Automatic record linkage using seeded nearest neighbour and support vector machine classification'. Together they form a unique fingerprint.

    Cite this