Febrl – A parallel open source data linkage system

Peter Christen*, Tim Churches, Markus Hegland

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    80 Citations (Scopus)

    Abstract

    In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identifier, and thus becomes non-trivial. Linking todays large data collections becomes increasingly difficult using traditional linkage techniques. In this paper we present an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Implemented as open source software, Febrl is an ideal experimental platform for new linkage algorithms and techniques.

    Original languageEnglish
    Title of host publicationAdvances in Knowledge Discovery and Data Mining - 8th Pacific-Asia Conference, PAKDD 2004, Proceedings
    EditorsHonghua Dai, Ramakrishnan Srikant, Chengqi Zhang
    PublisherSpringer Verlag
    Pages638-647
    Number of pages10
    ISBN (Print)354022064X, 9783540220640
    DOIs
    Publication statusPublished - 2004
    Event8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2004 - Sydney, Australia
    Duration: 26 May 200428 May 2004

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume3056
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2004
    Country/TerritoryAustralia
    CitySydney
    Period26/05/0428/05/04

    Fingerprint

    Dive into the research topics of 'Febrl – A parallel open source data linkage system'. Together they form a unique fingerprint.

    Cite this