Open Language Data Initiative

The contents of this card can be edited in the source repository.

Dataset card

Description

FLORES+ devtest set in the Valencian dialect of Catalan

License

CC-BY-SA-4.0

Attribution

@inproceedings{wmt24-spain,
    title="Expanding the FLORES+ Multilingual Benchmark with Translations for {Aragonese, Aranese, Asturian, and Valencian}",
    author="Juan Antonio Perez-Ortiz and Felipe S{\'a}nchez-Martínez and Víctor M. S{\'a}nchez-Cartagena and Miquel Esplà-Gomis and Aaron Galiano Jimenez and Antoni Oliver and Claudi Aventín-Boya and Alejandro Pardos and Cristina Valdés and Jus{\'e}p Loís Sans Socasau and Juan Pablo Martínez",
    booktitle = "Proceedings of the Ninth Conference on Machine Translation",
    month = nov,
    year = "2024",
    address = "Miami, USA",
    publisher = "Association for Computational Linguistics"
}

With the support of the project CENID2023/10, funded by Diputació d'Alacant; and the R+D+i project PID2021-127999NB-I00 (LiLowLa: Lightweight neural translation technologies for low-resource languages), funded by MCIN /AEI /10.13039/501100011033 / FEDER, UE.

Language codes

Additional language information

Workflow

The dataset was created by adapting the existing Catalan version of the FLORES+ devtest set. The work was carried out by a single native speaker of the Valencian dialect. The native speaker holds a Ph.D. and has experience in translating into Valencian and reviewing texts in Valencian.

In order to speed up the process, LanguageTool was used to introduce an initial set of changes to the original text. Then, all these changes were manually checked, and other required changes were manually applied, taking into account the aforementioned list of lexical items that change between the different dialects of the Catalan language. In case of doubt, the dictionary of the Valencian Language Academy was checked.

Additional guidelines

Linguistic criteria for the institutional use of the Valencian dialect in the Valencian universities were followed.