Fast, Consistent Tokenization of Natural Language Text

Lincoln Mullen, Kenneth Benoit, Os Keyes, Dmitry Selivanov, Jeffrey Arnold

    Research output: Contribution to journalArticle

    Abstract

    Computational text analysis usually proceeds according to a series of well-defined steps. After importing texts, the usual next step is to turn the human-readable text into machinereadable tokens. Tokens are defined as segments of a text identified as meaningful units for the purpose of analyzing the text. They may consist of individual words or of larger or smaller segments, such as word sequences, word subsequences, paragraphs, sentences, or lines (Manning, Raghavan, and Schütze 2008, 22). Tokenization is the process of splitting the text into these smaller pieces, and it often involves preprocessing the text to remove punctuation and transform all tokens into lowercase (Welbers, Van Atteveldt, and Benoit 2017, 25051). Decisions made during tokenization have a significant effect on subsequent analysis (Denny and Spirling 2018; D. Guthrie et al. 2006). Especially for large corpora, tokenization can be computationally expensive, and tokenization is highly language dependent. Efficiency and correctness are therefore paramount concerns for tokenization.
    Original languageEnglish
    Pages (from-to)1-3pp
    JournalThe Journal of Open Source Software
    Volume3
    Issue number23
    DOIs
    Publication statusPublished - 2018

    Fingerprint

    Dive into the research topics of 'Fast, Consistent Tokenization of Natural Language Text'. Together they form a unique fingerprint.

    Cite this