TY - GEN
T1 - Automatic cleaning and linking of historical census data using household information
AU - Fu, Zhichun
AU - Christen, Peter
AU - Boot, Mac
PY - 2011
Y1 - 2011
N2 - Historical census data captures information about our ancestors. These data contain the social status at a certain point time. They contain valuable information for genealogists, historians, and social scientists. Historical census data can be used to reconstruct important aspects of a particular era in order to trace the changes in households and families. Record linkage across different historical census datasets can help to improve the quality of the data, enrich existing census data with additional information, and facilitate improved retrieval of information. In this paper, we introduce a domain driven approach to automatically clean and link historical census data using recent developments in group linkage techniques. The key contribution of our approach is to first detect households, and to use this information to refine the cleaned data and improve the accuracy of linking records between census datasets. We have developed a two-step linking approach which first links individual records using approximate string similarity measures, and then performs a group linking based on the previously detected households. The results show that this approach is effective and can greatly reduce the manual efforts required for data cleaning and linking by social scientists.
AB - Historical census data captures information about our ancestors. These data contain the social status at a certain point time. They contain valuable information for genealogists, historians, and social scientists. Historical census data can be used to reconstruct important aspects of a particular era in order to trace the changes in households and families. Record linkage across different historical census datasets can help to improve the quality of the data, enrich existing census data with additional information, and facilitate improved retrieval of information. In this paper, we introduce a domain driven approach to automatically clean and link historical census data using recent developments in group linkage techniques. The key contribution of our approach is to first detect households, and to use this information to refine the cleaned data and improve the accuracy of linking records between census datasets. We have developed a two-step linking approach which first links individual records using approximate string similarity measures, and then performs a group linking based on the previously detected households. The results show that this approach is effective and can greatly reduce the manual efforts required for data cleaning and linking by social scientists.
KW - Data cleaning
KW - Domain knowledge
KW - Group linking
KW - Historical census data
KW - Record linkage
UR - http://www.scopus.com/inward/record.url?scp=84857170753&partnerID=8YFLogxK
U2 - 10.1109/ICDMW.2011.35
DO - 10.1109/ICDMW.2011.35
M3 - Conference contribution
SN - 9780769544090
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 413
EP - 420
BT - Proceedings - 11th IEEE International Conference on Data Mining Workshops, ICDMW 2011
T2 - 11th IEEE International Conference on Data Mining Workshops, ICDMW 2011
Y2 - 11 December 2011 through 11 December 2011
ER -