TY - GEN
T1 - Efficient record linkage using a compact Hamming space
AU - Karapiperis, Dimitrios
AU - Vatsalan, Dinusha
AU - Verykios, Vassilios S.
AU - Christen, Peter
N1 - Publisher Copyright:
© 2016, Copyright is with the authors.
PY - 2016
Y1 - 2016
N2 - Record linkage, the process of identifying similar records that correspond to the same real-world entities across databases, is a well-established research problem in the database, data mining, and information retrieval communities. Computing distances between string values of records is the key component in order to determine the similarity of the represented entities. Due to the typically large volumes of records, a two-step process is followed. A blocking mechanism is first applied for grouping similar records together, and then a matching mechanism is performed for comparing the records which have been inserted into the same block. However, there does not exist any efficient blocking/matching mechanism which provides theoretical guarantees for identifying similar records which consist of strings. Towards this end, we put forth the novel notion of embedding string-based records into a Hamming space, where such a mechanism exists. The size of these embeddings is kept as small as needed in order to guarantee the correspondence of distances in that space to the types of errors that exist between strings, e.g., a missing or a modified character. We build embeddings whose size is 120 bits for representing accurately four fields of a publicly available data set. We also present a distance threshold-aware blocking technique for higher accuracy rates compared to blocking approaches which ignore the specified threshold. Our empirical study conducted on real-world data sets shows the efficacy achieved by our embedding method as compared to several existing solutions.
AB - Record linkage, the process of identifying similar records that correspond to the same real-world entities across databases, is a well-established research problem in the database, data mining, and information retrieval communities. Computing distances between string values of records is the key component in order to determine the similarity of the represented entities. Due to the typically large volumes of records, a two-step process is followed. A blocking mechanism is first applied for grouping similar records together, and then a matching mechanism is performed for comparing the records which have been inserted into the same block. However, there does not exist any efficient blocking/matching mechanism which provides theoretical guarantees for identifying similar records which consist of strings. Towards this end, we put forth the novel notion of embedding string-based records into a Hamming space, where such a mechanism exists. The size of these embeddings is kept as small as needed in order to guarantee the correspondence of distances in that space to the types of errors that exist between strings, e.g., a missing or a modified character. We build embeddings whose size is 120 bits for representing accurately four fields of a publicly available data set. We also present a distance threshold-aware blocking technique for higher accuracy rates compared to blocking approaches which ignore the specified threshold. Our empirical study conducted on real-world data sets shows the efficacy achieved by our embedding method as compared to several existing solutions.
UR - http://www.scopus.com/inward/record.url?scp=85043728883&partnerID=8YFLogxK
U2 - 10.5441/002/edbt.2016.21
DO - 10.5441/002/edbt.2016.21
M3 - Conference contribution
T3 - Advances in Database Technology - EDBT
SP - 209
EP - 220
BT - Advances in Database Technology - EDBT 2016
A2 - Manolescu, Ioana
A2 - Pitoura, Evaggelia
A2 - Marian, Amelie
A2 - Maabout, Sofian
A2 - Tanca, Letizia
A2 - Koutrika, Georgia
A2 - Stefanidis, Kostas
PB - OpenProceedings.org
T2 - 19th International Conference on Extending Database Technology, EDBT 2016
Y2 - 15 March 2016 through 18 March 2016
ER -