Open Language Data Initiative

The contents of this card can be edited in the source repository.

Dataset card

Description

FLORES+ dev and devtest sets in Erzya

License

CC-BY-SA-4.0

Attribution

Sergey Kuldin ([email protected]).

@inproceedings{wmt24-erzya,
    title="FLORES+ translation and machine translation evaluation for the {E}rzya language",
    author="Isai Gordeev and Sergey Kuldin and David Dale",
    booktitle = "Proceedings of the Ninth Conference on Machine Translation",
    month = nov,
    year = "2024",
    address = "Miami, USA",
    publisher = "Association for Computational Linguistics"
}

Language codes

Additional language information

Erzya is one of the largest Finno-Ugric languages, belonging to the Mordvinic branch of the Uralic language family. The FLORES dataset is translated into the literary version of the language, which is based on the Central dialect of Erzya. Erzya is spoken by about 300,000 people in the Russian Federation, with several tens of thousands of speakers in other countries, particularly in Kazakhstan, Uzbekistan, Tajikistan, Kyrgyzstan, and Turkmenistan. Despite its official status in the Republic of Mordovia, where Erzya is used in education and media, the language faces challenges in intergenerational transmission and digital presence. However, some digital resources exist:

Workflow

The FLORES+ dataset was translated from Russian into the Erzya language by two native speakers who are also teachers of the language and writers (one holding a doctoral degree in philology). The 250 translated sentences from Yankovskaya et al. (https://aclanthology.org/2023.nodalida-1.77/) were also included after a thorough revision. All translations were reviewed by one of the native translators and a linguist with profound expertise in the language.