TY - GEN
T1 - Towards scalable real-time entity resolution using a similarity-aware inverted index approach
AU - Christen, Peter
AU - Gayler, Ross
PY - 2008
Y1 - 2008
N2 - Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have as-sumed the matching of two static databases. In our networked and online world, however, it is becoming increasingly important for many organisations to be able to conduct entity resolution between a collection of often very large databases and a stream of query or update records. The matching should be done in (near) real-time, and be as automatic and accurate as possible, returning a ranked list of matched records for each given query record. This task therefore be-comes similar to querying large document collections, as done for example by Web search engines, however based on a different type of documents: structured database records that, for example, contain personal information, such as names and addresses. In this paper, we investigate inverted indexing techniques, as commonly used in Web search engines, and employ them for real-time entity resolution. We present two variations of the traditional inverted in-dex approach, aimed at facilitating fast approximate matching. We show encouraging initial results on large real-world data sets, with the inverted index ap-proaches being up-to one hundred times faster than the traditionally used standard blocking approach. However, this improved matching speed currently comes at a cost, in that matching quality for larger data sets can be lower compared to when tandard blocking is used, and thus more work is required.
AB - Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have as-sumed the matching of two static databases. In our networked and online world, however, it is becoming increasingly important for many organisations to be able to conduct entity resolution between a collection of often very large databases and a stream of query or update records. The matching should be done in (near) real-time, and be as automatic and accurate as possible, returning a ranked list of matched records for each given query record. This task therefore be-comes similar to querying large document collections, as done for example by Web search engines, however based on a different type of documents: structured database records that, for example, contain personal information, such as names and addresses. In this paper, we investigate inverted indexing techniques, as commonly used in Web search engines, and employ them for real-time entity resolution. We present two variations of the traditional inverted in-dex approach, aimed at facilitating fast approximate matching. We show encouraging initial results on large real-world data sets, with the inverted index ap-proaches being up-to one hundred times faster than the traditionally used standard blocking approach. However, this improved matching speed currently comes at a cost, in that matching quality for larger data sets can be lower compared to when tandard blocking is used, and thus more work is required.
KW - Approximate string comparisons
KW - Data matching
KW - Record linkage
KW - Scalability
KW - Similarity measures.
UR - http://www.scopus.com/inward/record.url?scp=67650216370&partnerID=8YFLogxK
M3 - Conference contribution
SN - 9781920682682
T3 - Conferences in Research and Practice in Information Technology Series
SP - 51
EP - 60
BT - AusDM'08 - Conferences in Research and Practice in Information TechnologyConferences in Research and Practice in Information Technology
T2 - 7th Australasian Data Mining Conference, AusDM 2008
Y2 - 27 November 2008 through 28 November 2008
ER -