Automated speech tools for helping communities process restricted-access corpora for language revival efforts

Nay San*, Martijn Bartelds, Tolúlopé Ògúnrèmí, Alison Mount, Ruben Thompson, Michael Higgins, Roy Barker, Jane Simpson, Dan Jurafsky

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    5 Citations (Scopus)

    Abstract

    Many archival recordings of speech from endangered languages remain unannotated and inaccessible to community members and language learning programs. One bottleneck is the time-intensive nature of annotation. An even narrower bottleneck occurs for recordings with access constraints, such as language that must be vetted or filtered by authorised community members before annotation can begin. We propose a privacy-preserving workflow to widen both bottlenecks for recordings where speech in the endangered language is intermixed with a more widely-used language such as English for meta-linguistic commentary and questions (e.g. What is the word for 'tree'?). We integrate voice activity detection (VAD), spoken language identification (SLI), and automatic speech recognition (ASR) to transcribe the metalinguistic content, which an authorised person can quickly scan to triage recordings that can be annotated by people with lower levels of access. We report work-in-progress processing 136 hours archival audio containing a mix of English and Muruwari. Our collaborative work with the Muruwari custodian of the archival materials show that this workflow reduces metalanguage transcription time by 20% even with minimal amounts of annotated training data: 10 utterances per language for SLI and for ASR at most 39 minutes, and possibly as little as 39 seconds.

    Original languageEnglish
    Title of host publicationCOMPUTEL 2022 - 5th Workshop on the Use of Computational Methods in the Study of Endangered Languages, Proceedings of the Workshop
    EditorsSarah Moeller, Antonios Anastasopoulos, Antti Arppe, Aditi Chaudhary, Atticus Harrigan, Josh Holden, Jordan Lachler, Alexis Palmer, Shruti Rijhwani, Lane Schwartz
    PublisherAssociation for Computational Linguistics (ACL)
    Pages41-51
    Number of pages11
    ISBN (Electronic)9781955917308
    Publication statusPublished - 2022
    Event5th Workshop on the Use of Computational Methods in the Study of Endangered Languages, COMPUTEL 2022 - Dublin, Ireland
    Duration: 26 May 202227 May 2022

    Publication series

    NameCOMPUTEL 2022 - 5th Workshop on the Use of Computational Methods in the Study of Endangered Languages, Proceedings of the Workshop

    Conference

    Conference5th Workshop on the Use of Computational Methods in the Study of Endangered Languages, COMPUTEL 2022
    Country/TerritoryIreland
    CityDublin
    Period26/05/2227/05/22

    Fingerprint

    Dive into the research topics of 'Automated speech tools for helping communities process restricted-access corpora for language revival efforts'. Together they form a unique fingerprint.

    Cite this