TY - GEN
T1 - A Quest for Paradigm Coverage
T2 - 2nd Workshop on NLP Applications to Field Linguistics, FieldMatters 2023
AU - Muradoğlu, Saliha
AU - Suominen, Hanna
AU - Evans, Nicholas
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Language documentation aims to collect a representative corpus of the language. Nevertheless, the question of how to quantify the comprehensive of the collection persists. We propose leveraging computational modelling to provide a supplementary metric to address this question in a low-resource language setting. We apply our proposed methods to the Papuan language Nen. Nen is actively in the process of being described and documented. Given the enormity of the task of language documentation, we focus on one subdomain, namely Nen verbal morphology. This study examines four verb types: copula, positional, middle, and transitive. We propose model-based paradigm generation for each verb type as a new way to measure completeness, where accuracy is analogous to the coverage of the paradigm. We contrast the paradigm attestation within the corpus (constructed from fieldwork data) and the accuracy of the paradigm generated by Transformer models trained for inflection. This analysis is extended by extrapolating from the learning curve established to provide predictions for the quantity of data required to generate a complete paradigm correctly. We also explore the correlation between high-frequency morphosyntactic features and model accuracy. We see a positive correlation between high-frequency feature combinations and model accuracy, but this is only sometimes the case. We also see high accuracy for low-frequency morphosyntactic features. Our results show that model coverage is significantly higher for the middle and transitive verbs but not the positional verb. This is an interesting finding, as the positional verb paradigm is the smallest of the four.
AB - Language documentation aims to collect a representative corpus of the language. Nevertheless, the question of how to quantify the comprehensive of the collection persists. We propose leveraging computational modelling to provide a supplementary metric to address this question in a low-resource language setting. We apply our proposed methods to the Papuan language Nen. Nen is actively in the process of being described and documented. Given the enormity of the task of language documentation, we focus on one subdomain, namely Nen verbal morphology. This study examines four verb types: copula, positional, middle, and transitive. We propose model-based paradigm generation for each verb type as a new way to measure completeness, where accuracy is analogous to the coverage of the paradigm. We contrast the paradigm attestation within the corpus (constructed from fieldwork data) and the accuracy of the paradigm generated by Transformer models trained for inflection. This analysis is extended by extrapolating from the learning curve established to provide predictions for the quantity of data required to generate a complete paradigm correctly. We also explore the correlation between high-frequency morphosyntactic features and model accuracy. We see a positive correlation between high-frequency feature combinations and model accuracy, but this is only sometimes the case. We also see high accuracy for low-frequency morphosyntactic features. Our results show that model coverage is significantly higher for the middle and transitive verbs but not the positional verb. This is an interesting finding, as the positional verb paradigm is the smallest of the four.
UR - https://www.scopus.com/pages/publications/85184663531
M3 - Conference Paper
AN - SCOPUS:85184663531
T3 - FieldMatters 2023 - 2nd Workshop on NLP Applications to Field Linguistics, Proceedings
BT - FieldMatters 2023 - 2nd Workshop on NLP Applications to Field Linguistics, Proceedings
A2 - Serikov, Oleg
A2 - Serikov, Oleg
A2 - Voloshina, Ekaterina
A2 - Voloshina, Ekaterina
A2 - Postnikova, Anna
A2 - Klyachko, Elena
A2 - Klyachko, Elena
A2 - Vylomova, Ekaterina
A2 - Shavrina, Tatiana
A2 - Shavrina, Tatiana
A2 - Le Ferrand, Eric
A2 - Le Ferrand, Eric
A2 - Malykh, Valentin
A2 - Malykh, Valentin
A2 - Tyers, Francis
A2 - Tyers, Francis
A2 - Arkhangelskiy, Timofey
A2 - Mikhailov, Vladislav
A2 - Mikhailov, Vladislav
PB - Association for Computational Linguistics (ACL)
Y2 - 6 May 2023
ER -