TY - JOUR
T1 - Fast, Consistent Tokenization of Natural Language Text
AU - Mullen, Lincoln
AU - Benoit, Kenneth
AU - Keyes, Os
AU - Selivanov, Dmitry
AU - Arnold, Jeffrey
PY - 2018
Y1 - 2018
N2 - Computational text analysis usually proceeds according to a series of well-defined steps. After importing texts, the usual next step is to turn the human-readable text into machinereadable tokens. Tokens are defined as segments of a text identified as meaningful units for the purpose of analyzing the text. They may consist of individual words or of larger or smaller segments, such as word sequences, word subsequences, paragraphs, sentences, or lines (Manning, Raghavan, and Schütze 2008, 22). Tokenization is the process of splitting the text into these smaller pieces, and it often involves preprocessing the text to remove punctuation and transform all tokens into lowercase (Welbers, Van Atteveldt, and Benoit 2017, 25051). Decisions made during tokenization have a significant effect on subsequent analysis (Denny and Spirling 2018; D. Guthrie et al. 2006). Especially for large corpora, tokenization can be computationally expensive, and tokenization is highly language dependent. Efficiency and correctness are therefore paramount concerns for tokenization.
AB - Computational text analysis usually proceeds according to a series of well-defined steps. After importing texts, the usual next step is to turn the human-readable text into machinereadable tokens. Tokens are defined as segments of a text identified as meaningful units for the purpose of analyzing the text. They may consist of individual words or of larger or smaller segments, such as word sequences, word subsequences, paragraphs, sentences, or lines (Manning, Raghavan, and Schütze 2008, 22). Tokenization is the process of splitting the text into these smaller pieces, and it often involves preprocessing the text to remove punctuation and transform all tokens into lowercase (Welbers, Van Atteveldt, and Benoit 2017, 25051). Decisions made during tokenization have a significant effect on subsequent analysis (Denny and Spirling 2018; D. Guthrie et al. 2006). Especially for large corpora, tokenization can be computationally expensive, and tokenization is highly language dependent. Efficiency and correctness are therefore paramount concerns for tokenization.
U2 - 10.21105/joss.00655
DO - 10.21105/joss.00655
M3 - Article
VL - 3
SP - 1
EP - 3
JO - The Journal of Open Source Software
JF - The Journal of Open Source Software
IS - 23
ER -