Related projects
Currently, OLDI manages OLDI-Seed (an extension of the NLLB-Seed dataset) and FLORES+ (an extension of the FLORES-200 dataset). Below, we list some other extensions and derivatives of these datasets, not managed by OLDI but still potentially relevant, as well as some other interesting multiway parallel datasets.
Partial FLORES translations
There are at least two translations of FLORES that comprise less than one full dataset split. By representing languages for which no other translation benchmarks exists, they could still be interesting:
- smugri-flores-testset: translation of the first 250 sentences of the FLORES devtest set (all from the news domain) into low-resource Finno-Ugric languages: Komi, Udmurt, Hill and Meadow Mari, Erzya, Moksha, Livonian, Mansi, and Livvi Karelian (Yankovskaya et al, 2023), later expanded with Proper Karelian, Ludian, and Veps (Pashchenko et al, 2024).
- chukot_russian_flores_sample: translation of the first 100 sentences of the FLORES devtest set into the Chukot language via Russian.
- 2M-Flores-ASL: a version of FLORES translated into American Sign Language.
Other FLORES derivatives
- Fleurs is a speech version of FLORES in 102 languages.
- Belebele: a benchmark of multilingual reading comprehension (spanning 122 language variants) with multiple-choice answers built on top of FLORES.
- SIB-200: a multilingual benchmark for topic classification in 205 languages
- xSIM++: evaluation of bitext mining with hard negative examples on the English side
Other massively parallel datasets
- BOUQuET: a multi-way parallel, multi-centric and multi-register/domain evaluation dataset. A potential alternative to FLORES.
- WMT24++: WMT24 test set translated from English to 55 languages. A potential alternative to FLORES.
- SMOL: a collection of word, sentence, and document translations into low-resourced languages, for the purpose of training translation models. A potential alternative to the Seed dataset.
- Global-MMLU: a multilingual parallel dataset for evaluating knowledge and reasoning in LLMs using multiple-choice questions.