Dynamic similarity-aware inverted indexing for real-time entity resolution

Banda Ramadan, Peter Christen, Huizhi Liang, Ross W. Gayler, David Hawking

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

20 Citations (Scopus)

Abstract

Entity resolution is the process of identifying groups of records in a single or multiple data sources that represent the same real-world entity. It is an important tool in data de-duplication, in linking records across databases, and in matching query records against a database of existing entities. Most existing entity resolution techniques complete the resolution process offline and on static databases. However, real-world databases are often dynamic, and increasingly organizations need to resolve entities in real-time. Thus, there is a need for new techniques that facilitate working with dynamic databases in real-time. In this paper, we propose a dynamic similarity-aware inverted indexing technique (DySimII) that meets these requirements. We also propose a frequency-filtered indexing technique where only the most frequent attribute values are indexed. We experimentally evaluate our techniques on a large real-world voter database. The results show that when the index size grows no appreciable increase is found in the average record insertion time (around 0.1 msec) and in the average query time (less than 0.1 sec). We also find that applying the frequency-filtered approach reduces the index size with only a slight drop in recall.

Original languageEnglish
Title of host publicationTrends and Applications in Knowledge Discovery and Data Mining - PAKDD 2013 International Workshops
Subtitle of host publicationDMApps, DANTH, QIMIE, BDM, CDA, CloudSD, Revised Selected Papers
Pages47-58
Number of pages12
DOIs
Publication statusPublished - 2013
Event17th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2013 - Gold Coast, QLD, Australia
Duration: 14 Apr 201317 Apr 2013

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7867 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2013
Country/TerritoryAustralia
CityGold Coast, QLD
Period14/04/1317/04/13

Fingerprint

Dive into the research topics of 'Dynamic similarity-aware inverted indexing for real-time entity resolution'. Together they form a unique fingerprint.

Cite this