Skip to main navigation Skip to search Skip to main content

OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

Raphaël Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova

Research output: Chapter in Book/Report/Conference proceedingConference Paperpeer-review

Abstract

In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization's e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.

Original languageEnglish
Title of host publicationWMT 2025 - 10th Conference on Machine Translation, Proceedings of the Conference
EditorsBarry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
PublisherAssociation for Computational Linguistics (ACL)
Pages142-160
Number of pages19
ISBN (Electronic)9798891763418
DOIs
Publication statusPublished - 2025
Event10th Conference on Machine Translation, WMT 2025 - Suzhou, China
Duration: 8 Nov 20259 Nov 2025

Publication series

NameConference on Machine Translation - Proceedings
ISSN (Electronic)2768-0983

Conference

Conference10th Conference on Machine Translation, WMT 2025
Country/TerritoryChina
CitySuzhou
Period8/11/259/11/25

Fingerprint

Dive into the research topics of 'OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages'. Together they form a unique fingerprint.

Cite this