Context-aware detection of sneaky vandalism on wikipedia across multiple languages

Khoi Nguyen Tran*, Peter Christen, Scott Sanner, Lexing Xie

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    3 Citations (Scopus)

    Abstract

    The malicious modification of articles, termed vandalism, is a serious problem for open access encyclopedias such as Wikipedia. Wikipedia’s counter-vandalism bots and past vandalism detection research have greatly reduced the exposure and damage of common and obvious types of vandalism. However, there remains increasingly more sneaky types of vandalism that are clearly out of context of the sentence or article. In this paper, we propose a novel context-aware and cross-language vandalism detection technique that scales to the size of the full Wikipedia and extends the types of vandalism detectable beyond past feature-based approaches. Our technique uses word dependencies to identify vandal words in sentences by combining part-of-speech tagging with a conditional random fields classifier. We evaluate our technique on two Wikipedia data sets: the PAN data sets with over 62, 000 edits, commonly used by related research; and our own vandalism repairs data sets with over 500 million edits of over 9 million articles from five languages. As a comparison, we implement a feature-based classifier to analyse the quality of each classification technique and the trade-offs of each type of classifier. Our results show how context-aware detection techniques can become a new counter-vandalism tool for Wikipedia that complements current feature-based techniques.

    Original languageEnglish
    Title of host publicationAdvances in Knowledge Discovery and Data Mining - 19th Pacific-Asia Conference, PAKDD 2015, Proceedings
    EditorsTu-Bao Ho, Hiroshi Motoda, Hiroshi Motoda, Ee-Peng Lim, Tru Cao, David Cheung, Zhi-Hua Zhou
    PublisherSpringer Verlag
    Pages380-391
    Number of pages12
    ISBN (Print)9783319180373
    DOIs
    Publication statusPublished - 2015
    Event19th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2015 - Ho Chi Minh City, Viet Nam
    Duration: 19 May 201522 May 2015

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume9077
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference19th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2015
    Country/TerritoryViet Nam
    CityHo Chi Minh City
    Period19/05/1522/05/15

    Fingerprint

    Dive into the research topics of 'Context-aware detection of sneaky vandalism on wikipedia across multiple languages'. Together they form a unique fingerprint.

    Cite this