Flexible and extensible generation and corruption of personal data

Peter Christen, Dinusha Vatsalan

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    34 Citations (Scopus)

    Abstract

    With much of today's data being generated by people or referring to people, researchers increasingly require data that contain personal identifying information to evaluate their new algorithms. In areas such as record matching and de-duplication, fraud detection, cloud computing, and health informatics, issues such as data entry errors, typographical mistakes, noise, or recording variations, can all significantly affect the outcomes of data integration, processing, and mining projects. However, privacy concerns make it challenging to obtain real data that contain personal details. An alternative to using sensitive real data is to create synthetic data which follow similar characteristics. The advantages of synthetic data are that (1) they can be generated with well defined characteristics; (2) it is known which records represent an individual created entity (this is often unknown in real data); and (3) the generated data and the generator program itself can be published. We present a sophisticated data generation and corruption tool that allows the creation of various types of data, ranging from names and addresses, dates, social security and credit card numbers, to numerical values such as salary or blood pressure. Our tool can model dependencies between attributes, and it allows the corruption of values in various ways. We describe the overall architecture and main components of our tool, and illustrate how a user can easily extend this tool with novel functionalities.

    Original languageEnglish
    Title of host publicationCIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management
    Pages1165-1168
    Number of pages4
    DOIs
    Publication statusPublished - 2013
    Event22nd ACM International Conference on Information and Knowledge Management, CIKM 2013 - San Francisco, CA, United States
    Duration: 27 Oct 20131 Nov 2013

    Publication series

    NameInternational Conference on Information and Knowledge Management, Proceedings

    Conference

    Conference22nd ACM International Conference on Information and Knowledge Management, CIKM 2013
    Country/TerritoryUnited States
    CitySan Francisco, CA
    Period27/10/131/11/13

    Fingerprint

    Dive into the research topics of 'Flexible and extensible generation and corruption of personal data'. Together they form a unique fingerprint.

    Cite this