Open Language Data Initiative


The contents of this card can be edited in the source repository.

Dataset card for Mauritian Creole

Description

FLORES-200 dev and devtest set in Mauritian Creole.

License

CC-BY-SA-4.0

Attribution

@inproceedings{rajcoomar-2025-kozkreolmru,
    title = "{K}oz{K}reol{MRU} {WMT} 2025 {C}reole{MT} System Description: Koz Kreol: Multi-Stage Training for {E}nglish{--}Mauritian Creole {MT}",
    author = "Rajcoomar, Yush",
    editor = "Haddow, Barry  and
      Kocmi, Tom  and
      Koehn, Philipp  and
      Monz, Christof",
    booktitle = "Proceedings of the Tenth Conference on Machine Translation",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.wmt-1.92/",
    doi = "10.18653/v1/2025.wmt-1.92",
    pages = "1183--1190",
    ISBN = "979-8-89176-341-8",
    abstract = "Mauritian Creole (Kreol Morisyen), spoken by approximately 1.5 million people worldwide, faces significant challenges in digital language technology due to limited computational resources. This paper presents ``Koz Kreol'', a comprehensive approach to English{--}Mauritian Creole machine translation using a three-stage training methodology: monolingual pretraining, parallel data training, and LoRA fine-tuning. We achieve state-of-the-art results with a 28.82 BLEU score for EN{\textrightarrow}MFE translation, representing a 74{\%} improvement over ChatGPT-4o. Our work addresses critical data scarcity through the use of existing datasets, synthetic data generation, and community-sourced translations. The methodology provides a replicable framework for other low-resource Creole languages while supporting digital inclusion and cultural preservation for the Mauritian community. This paper consists of both a systems and data subtask submission as part of a Creole MT Shared Task."
}

Language codes

Additional language information

Workflow

Data was translated from English by 2 translators, all native speakers of the target language and highly proficient in English (at C2 level of the European Language Framework). 100% of the target translations were reviewed by a native speaker.

Additional guidelines