Febrl - An open source data cleaning, deduplication and record linkage system with a graphical user interface

Peter Christen*

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    126 Citations (Scopus)

    Abstract

    Matching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant advances in record linkage techniques have been made in recent years. However, many new techniques are either implemented in research proof-of-concept systems only, or they are hidden within expensive 'black box' commercial software. This makes it difficult for both researchers and practitioners to experiment with new record linkage techniques, and to compare existing techniques with new ones. The Febrl (Freely Extensible Biomedical Record Linkage) system aims to fill this gap. It contains many recently developed techniques for data cleaning, deduplication and record linkage, and encapsulates them into a graphical user interface (GUI). Febrl thus allows even inexperienced users to learn and experiment with both traditional and new record linkage techniques. Because Febrl is written in Python and its source code is available, it is fairly easy to integrate new record linkage techniques into it. Therefore, Febrl can be seen as a tool that allows researchers to compare various existing record linkage techniques with their own ones, enabling the record linkage research community to conduct their work more efficiently. Additionally, Febrl is suitable as a training tool for new record linkage users, and it can also be used for practical linkage projects with data sets that contain up to several hundred thousand records.

    Original languageEnglish
    Title of host publicationKDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining
    Pages1065-1068
    Number of pages4
    DOIs
    Publication statusPublished - 2008
    Event14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008 - Las Vegas, NV, United States
    Duration: 24 Aug 200827 Aug 2008

    Publication series

    NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    Conference

    Conference14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008
    Country/TerritoryUnited States
    CityLas Vegas, NV
    Period24/08/0827/08/08

    Fingerprint

    Dive into the research topics of 'Febrl - An open source data cleaning, deduplication and record linkage system with a graphical user interface'. Together they form a unique fingerprint.

    Cite this