Eeny, meeny, miny, moe. How to choose data for morphological inflection

Saliha Muradoglu, Mans Hulden

    Research output: Contribution to conferencePaperpeer-review

    6 Citations (Scopus)

    Abstract

    Data scarcity is a widespread problem in numerous natural language processing (NLP) tasks for low-resource languages. Within morphology, the labour-intensive work of tagging/glossing data is a serious bottleneck for both NLP and language documentation. Active learning (AL) aims to reduce the cost of data annotation by selecting data that is most informative for improving the model. In this paper, we explore four sampling strategies for the task of morphological inflection using a Transformer model: a pair of oracle experiments where data is chosen based on whether the model already can or cannot inflect the test forms correctly, as well as strategies based on high/low model confidence, entropy, as well as random selection. We investigate the robustness of each strategy across 30 typologically diverse languages. We also perform a more in-depth case study of Natügu. Our results show a clear benefit to selecting data based on model confidence and entropy. Unsurprisingly, the oracle experiment, where only incorrectly handled forms are chosen for further training, which is presented as a proxy for linguist/language consultant feedback, shows the most improvement. This is followed closely by choosing low-confidence and high-entropy predictions. We also show that despite the conventional wisdom of larger data sets yielding better accuracy, introducing more instances of high-confidence or low-entropy forms, or forms that the model can already inflect correctly, can reduce model performance.

    Original languageEnglish
    Pages7294-7303
    Number of pages10
    DOIs
    Publication statusPublished - Dec 2022
    Event2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates
    Duration: 7 Dec 202211 Dec 2022

    Conference

    Conference2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
    Country/TerritoryUnited Arab Emirates
    CityAbu Dhabi
    Period7/12/2211/12/22

    Fingerprint

    Dive into the research topics of 'Eeny, meeny, miny, moe. How to choose data for morphological inflection'. Together they form a unique fingerprint.

    Cite this