Probabilistic data generation for deduplication and data linkage

Peter Christen*

*Corresponding author for this work

    Research output: Contribution to journalConference articlepeer-review

    35 Citations (Scopus)


    In many data mining projects the data to be analysed contains personal information, like names and addresses. Cleaning and pre-processing of such data likely involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identifiers. In recent years there has been an increased research effort in data linkage and deduplication, mainly in the machine learning and database communities. Publicly available test data with known deduplication or linkage status is needed so that new linkage algorithms and techniques can be tested, evaluated and compared. However, publication of data containing personal information is normally impossible due to privacy and confidentiality issues. An alternative is to use artificially created data, which has the advantages that content and error rates can be controlled, and the deduplication or linkage status is known. Controlled experiments can be performed and replicated easily. In this paper we present a freely available data set generator capable of creating data sets containing names, addresses and other personal information.

    Original languageEnglish
    Pages (from-to)109-116
    Number of pages8
    JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Publication statusPublished - 2005
    Event6th International Conference on Intelligent Data Engineering and Automated Learning - IDEAL 2005 - Brisbane, Australia
    Duration: 6 Jul 20058 Jul 2005


    Dive into the research topics of 'Probabilistic data generation for deduplication and data linkage'. Together they form a unique fingerprint.

    Cite this