Common Voice and accent choice: data contributors self-describe their spoken accents in diverse ways

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    1 Citation (Scopus)

    Abstract

    The use of machine learning (ML)-powered speech technologies has increased significantly in recent years [40, 56, 72]. The datasets used for training speech models often represent demographic features of the speaker-such as gender, age, and accent. These axes are frequently used to evaluate the training set and model for bias [52]. Here, we focus on how accent is represented in voice data due to the adverse consequences of accent bias. We perform document analysis on several voice datasets to identify how accents are currently represented. We then analyse and visualise speaker-described accents from Mozilla's Common Voice (CV) v13 English dataset, forming an emergent taxonomy of accent descriptors. We repeat this process using the CV v13 Kiswahili dataset, demonstrating that the taxonomy has use beyond English. We find that accents are currently represented in ways that are geographically, and predominantly, nationally bound. While this pattern is also shown in speaker-described accents from CV, a more diverse set of descriptors is revealed. This work provides some early evidence for re-thinking how accents are represented in datasets intended for ML applications. Our tooling is open-sourced, and we invite further work that uses our taxonomy to assess accent bias in speech data and models.

    Original languageEnglish
    Title of host publicationProceedings of 2023 ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO 2023
    PublisherAssociation for Computing Machinery (ACM)
    ISBN (Electronic)9798400703812
    DOIs
    Publication statusPublished - 30 Oct 2023
    Event2023 ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO 2023 - Boston, United States
    Duration: 30 Oct 20231 Nov 2023

    Publication series

    NameACM International Conference Proceeding Series

    Conference

    Conference2023 ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO 2023
    Country/TerritoryUnited States
    CityBoston
    Period30/10/231/11/23

    Fingerprint

    Dive into the research topics of 'Common Voice and accent choice: data contributors self-describe their spoken accents in diverse ways'. Together they form a unique fingerprint.

    Cite this