Efficient entity resolution with adaptive and interactive training data selection

Peter Christen, Dinusha Vatsalan, Qing Wang

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    16 Citations (Scopus)

    Abstract

    Entity resolution (ER) is the task of deciding which records in one or more databases refer to the same real-world entities. A crucial step in ER is the accurate classification of pairs of records into matches and non-matches. In most practical ER applications, obtaining training data %of high quality is costly and time consuming. Various techniques have been proposed for ER to interactively generate training data and learn an accurate classifier. We propose an approach for training data selection for ER that exploits the cluster structure of the weight vectors (similarities) calculated from compared record pairs. Our approach adaptively selects an optimal number of informative training examples for manual labeling based on a user defined sampling error margin, and recursively splits the set of weight vectors to find pure enough subsets for training. We consider two aspects of ER that are highly significant in practice: a limited budget for the number of manual labeling that can be done, and a noisy oracle where manual labels might be incorrect. Experiments on four real public data sets show that our approach can significantly reduce manual labeling efforts for training an ER classifier while achieving matching quality comparative to fully supervised classifiers.

    Original languageEnglish
    Title of host publicationProceedings - 15th IEEE International Conference on Data Mining, ICDM 2015
    EditorsCharu Aggarwal, Zhi-Hua Zhou, Alexander Tuzhilin, Hui Xiong, Xindong Wu
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages727-732
    Number of pages6
    ISBN (Electronic)9781467395038
    DOIs
    Publication statusPublished - 5 Jan 2016
    Event15th IEEE International Conference on Data Mining, ICDM 2015 - Atlantic City, United States
    Duration: 14 Nov 201517 Nov 2015

    Publication series

    NameProceedings - IEEE International Conference on Data Mining, ICDM
    Volume2016-January
    ISSN (Print)1550-4786

    Conference

    Conference15th IEEE International Conference on Data Mining, ICDM 2015
    Country/TerritoryUnited States
    CityAtlantic City
    Period14/11/1517/11/15

    Fingerprint

    Dive into the research topics of 'Efficient entity resolution with adaptive and interactive training data selection'. Together they form a unique fingerprint.

    Cite this