TY - GEN
T1 - Automated probabilistic address standardisation and verification
AU - Christen, Peter
AU - Belacic, Daniel
PY - 2005
Y1 - 2005
N2 - Addresses are a key part of many records containing information about people and organisations, and it is therefore important that accurate address information is available before such data is mined or stored in data warehouses. Unfortunately, addresses are often captured in non-standard and free-text formats, usually with some degree of spelling and typographical errors. Additionally, addresses change over time, for example when people move, when streets are renamed, or when new suburbs are built. Cleaning and standardising addresses, as well as verifying if they really exist, are therefore important steps in data mining pre-processing. In this paper we present an automated probabilistic approach based on a hidden Markov model (HMM), which uses national address guidelines and a comprehensive national address database to clean, standardise and verify raw input addresses. Initial experiments show that our system can correctly standardise even complex and unusual addresses.
AB - Addresses are a key part of many records containing information about people and organisations, and it is therefore important that accurate address information is available before such data is mined or stored in data warehouses. Unfortunately, addresses are often captured in non-standard and free-text formats, usually with some degree of spelling and typographical errors. Additionally, addresses change over time, for example when people move, when streets are renamed, or when new suburbs are built. Cleaning and standardising addresses, as well as verifying if they really exist, are therefore important steps in data mining pre-processing. In this paper we present an automated probabilistic approach based on a hidden Markov model (HMM), which uses national address guidelines and a comprehensive national address database to clean, standardise and verify raw input addresses. Initial experiments show that our system can correctly standardise even complex and unusual addresses.
KW - Address cleaning and standardisation
KW - Data mining pre-processing
KW - G-NAF
KW - Postal address guidelines
KW - hidden markov model
UR - http://www.scopus.com/inward/record.url?scp=84857149294&partnerID=8YFLogxK
M3 - Conference contribution
SN - 1863657169
SN - 9781863657167
T3 - AusDM 2005 Proc. - 4th Australasian Data Mining Conf. - Collocated with the 18th Australian Joint Conf. on Artificial Intelligence, AI 2005 and the 2nd Australian Conf. on Artifical Life, ACAL 2005
SP - 53
EP - 67
BT - AusDM 2005 Proc. - 4th Australasian Data Mining Conf. - Collocated with the 18th Australian Joint Conf. on Artificial Intelligence, AI 2005 and the 2nd Australian Conf. on Artificial Life, ACAL 2005
T2 - 4th Australasian Data Mining Conference, AusDM 2005 - Collocated with the 18th Australian Joint Conference on Artificial Intelligence, AI 2005 and the 2nd Australian Conference on Artificial Life, ACAL 2005
Y2 - 5 December 2005 through 6 December 2005
ER -