Open Language Data Initiative

The contents of this card can be edited in the source repository.

Dataset card

Description

FLORES+ dev and devtest sets in Aranese.

License

CC-BY-SA-4.0

Attribution

@inproceedings{wmt24-erzya,
    title="FLORES+ translation and machine translation evaluation for the {E}rzya language",
    author="Isai Gordeev and Sergey Kuldin and David Dale",
    booktitle = "Proceedings of the Ninth Conference on Machine Translation",
    month = nov,
    year = "2024",
    address = "Miami, USA",
    publisher = "Association for Computational Linguistics"
}

With the support of the R+D+i projects PID2021-127999NB-I00 (LiLowLa: Lightweight neural translation technologies for low-resource languages) and PID2021-124663OB-I00 (TAN-IBE: Neural Machine Translation for the Romance languages of the Iberian Peninsula), funded by MCIN /AEI /10.13039/501100011033 / FEDER, UE.

Language codes

Additional language information

Workflow

The data was initially translated from Catalan using the Apertium rule-based machine translation system for Catalan-Aranese. The translation was then manually post-edited by native speakers of Aranese with three years of experience translating texts on diverse topics into Aranese. Finally, the post-edited translation was reviewed by different individuals, who are also native speakers, from the Institut d'Estudis Aranesi.

The data was initially obtained by translating the sentences in the Catalan FLORES+ dev and devtest datasets using the Apertium rule-based MT system for Catalan-Aranese. A revision process was then performed in two steps. Firstly, a professional reviewer with wide experience in translation and revision with proficiency in Aranese was presented with the French, Catalan and Occitan versions of the FLORES+, along with the machine translated version into Aranese. Finally, the post-edited translation was reviewed by different individuals, who are native speakers, from the Institut d'Estudis Aranesi (IEA).

The utilization of machine translation is justified for two reasons: firstly, the two-step workflow consisting of machine translation followed by post-editing is prevalent for this language, with many existing texts being produced this way; secondly, sourcing linguists or translators for Aranese proved challenging due to their scarcity, making it difficult to complete the task within the required timeframe.

Additional guidelines

The guidelines provided by the Institut d'Estudis Aranesi were followed to ensure that the translation into Aranese aligned with their recommendations.