TY - GEN
T1 - Febrl – A parallel open source data linkage system
AU - Christen, Peter
AU - Churches, Tim
AU - Hegland, Markus
N1 - Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 2004.
PY - 2004
Y1 - 2004
N2 - In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identifier, and thus becomes non-trivial. Linking todays large data collections becomes increasingly difficult using traditional linkage techniques. In this paper we present an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Implemented as open source software, Febrl is an ideal experimental platform for new linkage algorithms and techniques.
AB - In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identifier, and thus becomes non-trivial. Linking todays large data collections becomes increasingly difficult using traditional linkage techniques. In this paper we present an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Implemented as open source software, Febrl is an ideal experimental platform for new linkage algorithms and techniques.
KW - Data cleaning and standardisation
KW - Data matching
KW - Data mining preprocessing
KW - Parallel processing
KW - Record linkage
UR - http://www.scopus.com/inward/record.url?scp=7444251738&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-24775-3_75
DO - 10.1007/978-3-540-24775-3_75
M3 - Conference contribution
SN - 354022064X
SN - 9783540220640
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 638
EP - 647
BT - Advances in Knowledge Discovery and Data Mining - 8th Pacific-Asia Conference, PAKDD 2004, Proceedings
A2 - Dai, Honghua
A2 - Srikant, Ramakrishnan
A2 - Zhang, Chengqi
PB - Springer Verlag
T2 - 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2004
Y2 - 26 May 2004 through 28 May 2004
ER -