Dataset card
Description
FLORES+ dev and devtest sets in Erzya
License
CC-BY-SA-4.0
Attribution
Sergey Kuldin ([email protected]).
@inproceedings{wmt24-erzya,
title="FLORES+ translation and machine translation evaluation for the {E}rzya language",
author="Isai Gordeev and Sergey Kuldin and David Dale",
booktitle = "Proceedings of the Ninth Conference on Machine Translation",
month = nov,
year = "2024",
address = "Miami, USA",
publisher = "Association for Computational Linguistics"
}
Language codes
- ISO 639-3: myv
- ISO 15924: Cyrl
- Glottocode: erzy1239
Additional language information
Erzya is one of the largest Finno-Ugric languages, belonging to the Mordvinic branch of the Uralic language family. The FLORES dataset is translated into the literary version of the language, which is based on the Central dialect of Erzya. Erzya is spoken by about 300,000 people in the Russian Federation, with several tens of thousands of speakers in other countries, particularly in Kazakhstan, Uzbekistan, Tajikistan, Kyrgyzstan, and Turkmenistan. Despite its official status in the Republic of Mordovia, where Erzya is used in education and media, the language faces challenges in intergenerational transmission and digital presence. However, some digital resources exist:
- Erzya Wikipedia: https://myv.wikipedia.org
- Machine translator with support for Erzya: https://lango.to
- Erzya language corpus: https://erzya.web-corpora.net/
- A corpus of parallel Erzya-Russian words, phrases and sentences: https://huggingface.co/datasets/slone/myv_ru_2022
- Erzya and Moksha Extended Corpora (ERME): https://www.kielipankki.fi/corpora/erme/
Workflow
The FLORES+ dataset was translated from Russian into the Erzya language by two native speakers who are also teachers of the language and writers (one holding a doctoral degree in philology). The 250 translated sentences from Yankovskaya et al. (https://aclanthology.org/2023.nodalida-1.77/) were also included after a thorough revision. All translations were reviewed by one of the native translators and a linguist with profound expertise in the language.