TY - JOUR
T1 - Weight of authorship evidence with multiple categories of stylometric features
T2 - A multinomial-based discrete model
AU - Ishihara, Shunichi
N1 - Publisher Copyright:
© 2022 The Chartered Society of Forensic Sciences
PY - 2023/3
Y1 - 2023/3
N2 - This study empirically demonstrates the efficacy of a two-level Dirichlet-multinomial statistical model (the Multinomial system) for computing likelihood ratios (LR) for linguistic, textual evidence with multiple stylometric feature types with discrete values. The LRs are calculated separately for each feature type, namely, word, character and part of speech N-grams (N = 1,2,3), which are combined as overall LRs through logistic regression fusion. The Multinomial system's performance is compared with that of a previously proposed system with the cosine distance (the Cosine system) using the same data (i.e., documents collated from 2160 authors). The experimental results show that: (1) the Multinomial system outperforms the Cosine system with the fused feature types by a log-LR cost of ca. 0.01 ∼ 0.05 bits; and (2) the Multinomial system is more advantageous in performance with longer documents than the Cosine system. Although the Cosine system is more robust overall against the sampling variability arising from the number of authors included in the reference and calibration databases, the Multinomial system can achieve reasonable stability in performance; for example, the standard deviation value of the log-LR cost becomes lower than 0.01 (10 random samplings of authors for the reference and calibration databases) with 60 or more authors in each database.
AB - This study empirically demonstrates the efficacy of a two-level Dirichlet-multinomial statistical model (the Multinomial system) for computing likelihood ratios (LR) for linguistic, textual evidence with multiple stylometric feature types with discrete values. The LRs are calculated separately for each feature type, namely, word, character and part of speech N-grams (N = 1,2,3), which are combined as overall LRs through logistic regression fusion. The Multinomial system's performance is compared with that of a previously proposed system with the cosine distance (the Cosine system) using the same data (i.e., documents collated from 2160 authors). The experimental results show that: (1) the Multinomial system outperforms the Cosine system with the fused feature types by a log-LR cost of ca. 0.01 ∼ 0.05 bits; and (2) the Multinomial system is more advantageous in performance with longer documents than the Cosine system. Although the Cosine system is more robust overall against the sampling variability arising from the number of authors included in the reference and calibration databases, the Multinomial system can achieve reasonable stability in performance; for example, the standard deviation value of the log-LR cost becomes lower than 0.01 (10 random samplings of authors for the reference and calibration databases) with 60 or more authors in each database.
KW - Forensic text evidence
KW - Likelihood ratio
KW - Log-likelihood ratio cost
KW - Logistic regression calibration and fusion
KW - Multiple types of stylometric discrete features
KW - Two-level Dirichlet-multinomial model
UR - http://www.scopus.com/inward/record.url?scp=85147088400&partnerID=8YFLogxK
U2 - 10.1016/j.scijus.2022.12.007
DO - 10.1016/j.scijus.2022.12.007
M3 - Article
SN - 1355-0306
VL - 63
SP - 181
EP - 199
JO - Science and Justice
JF - Science and Justice
IS - 2
ER -