Data cleaning and matching of institutions in bibliographic databases

Jeffrey Fisher*, Qing Wang, Paul Wong, Peter Christen

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    1 Citation (Scopus)

    Abstract

    Bibliographic databases are very important for a variety of tasks for governments, academic institutions and businesses. These include assessing research output of institutions, performance evaluation of academics and compiling university rankings. However, incorrect or incomplete data in such databases can compromise any analysis and lead to poor decisions and financial loss. In this paper we detail our experience with an entity resolution project on Australian institution data using the SCOPUS bibliographic database. The goal of the project was to improve the entity resolution of institution data in SCOPUS so it could be used more effectively in other applications. We detail the methodology including a novel approach for extracting correct institution names from the values of one of the attributes. Along with the results from the project we present our insights into the specific characteristics and difficulties of the Australian institution data, and some techniques that were effective in addressing these. Finally, we present our conclusions and describe other situations where our experience and techniques could be applied.

    Original languageEnglish
    Title of host publicationData Mining and Analytics 2013 - Proceedings of the 11th Australasian Data Mining Conference, AusDM 2013
    EditorsYanchang Zhao, Andrew Stranieri, Lin Liu, Paul Kennedy, Peter Christen, Kok-Leong Ong, Yanchang Zhao
    PublisherAustralian Computer Society
    Pages139-148
    Number of pages10
    ISBN (Electronic)9781921770166
    Publication statusPublished - 2013

    Publication series

    NameConferences in Research and Practice in Information Technology Series
    Volume146
    ISSN (Print)1445-1336

    Fingerprint

    Dive into the research topics of 'Data cleaning and matching of institutions in bibliographic databases'. Together they form a unique fingerprint.

    Cite this