TY - GEN
T1 - Outlier detection based accurate geocoding of historical addresses
AU - Kirielle, Nishadi
AU - Christen, Peter
AU - Ranbaduge, Thilina
N1 - Publisher Copyright:
© Springer Nature Singapore Pte Ltd. 2019.
PY - 2019
Y1 - 2019
N2 - Research in the social sciences is increasingly based on large and complex databases, such as historical birth, marriage, death, and census records. Such databases can be analyzed individually to investigate, for example, changes in education, health, and emigration over time. Many of these historical databases contain addresses, and assigning geographical locations (latitude and longitude), the process known as geocoding, will provide the foundation to facilitate a wide range of studies based on spatial data analysis. Furthermore, geocoded records can be employed to enhance record linkage processes, where family trees for whole populations can be constructed. However, a challenging aspect when geocoding historical addresses is that these might have changed over time and therefore are only partially or not at all available in modern geocoding systems. In this paper, we present a novel method to geocode historical addresses where we use an online geocoding service to initially retrieve geocodes for historical addresses. For those addresses where multiple geocodes are returned, we employ outlier detection to improve the accuracy of locations assigned to addresses, while for addresses where no geocode was found, for example due to spelling variations, we employ approximate string matching to identify the most likely correct spelling along with the corresponding geocode. Experiments on two real historical data sets, one from Scotland and the other from Finland, show that our method can reduce the number of addresses with multiple geocodes by over 80% and increase the number of addresses from no to a single geocode by up to 31% compared to an online geocoding service.
AB - Research in the social sciences is increasingly based on large and complex databases, such as historical birth, marriage, death, and census records. Such databases can be analyzed individually to investigate, for example, changes in education, health, and emigration over time. Many of these historical databases contain addresses, and assigning geographical locations (latitude and longitude), the process known as geocoding, will provide the foundation to facilitate a wide range of studies based on spatial data analysis. Furthermore, geocoded records can be employed to enhance record linkage processes, where family trees for whole populations can be constructed. However, a challenging aspect when geocoding historical addresses is that these might have changed over time and therefore are only partially or not at all available in modern geocoding systems. In this paper, we present a novel method to geocode historical addresses where we use an online geocoding service to initially retrieve geocodes for historical addresses. For those addresses where multiple geocodes are returned, we employ outlier detection to improve the accuracy of locations assigned to addresses, while for addresses where no geocode was found, for example due to spelling variations, we employ approximate string matching to identify the most likely correct spelling along with the corresponding geocode. Experiments on two real historical data sets, one from Scotland and the other from Finland, show that our method can reduce the number of addresses with multiple geocodes by over 80% and increase the number of addresses from no to a single geocode by up to 31% compared to an online geocoding service.
KW - Geocode matching
KW - Open Street Map
KW - String comparison
UR - http://www.scopus.com/inward/record.url?scp=85076698361&partnerID=8YFLogxK
U2 - 10.1007/978-981-15-1699-3_4
DO - 10.1007/978-981-15-1699-3_4
M3 - Conference contribution
AN - SCOPUS:85076698361
SN - 9789811516986
T3 - Communications in Computer and Information Science
SP - 41
EP - 53
BT - Data Mining - 17th Australasian Conference, AusDM 2019, Proceedings
A2 - Le, Thuc D.
A2 - Liu, Lin
A2 - Ong, Kok-Leong
A2 - Zhao, Yanchang
A2 - Jin, Warren H.
A2 - Wong, Sebastien
A2 - Williams, Graham
PB - Springer
T2 - 17th Australasian Conference on Data Mining, AusDM 2019
Y2 - 2 December 2019 through 5 December 2019
ER -