Related projects
Currently, OLDI manages OLDI-Seed (an extension of the NLLB-Seed dataset) and FLORES+ (an extension of the FLORES-200 dataset). Below, we list some other extensions and derivatives of these datasets, not managed by OLDI but still potentially relevant, as well as some other interesting multiway parallel datasets.
Partial FLORES translations
There are at least two translations of FLORES that comprise less than one full dataset split. By representing languages for which no other translation benchmarks exists, they could still be interesting:
- smugri-flores-testset: translation of the first 250 sentences of the FLORES devtest set (all from the news domain) into low-resource Finno-Ugric languages: Komi, Udmurt, Hill and Meadow Mari, Erzya, Moksha, Livonian, Mansi, and Livvi Karelian (Yankovskaya et al, 2023), later expanded with Proper Karelian, Ludian, and Veps (Pashchenko et al, 2024).
- chukot_russian_flores_sample: translation of the first 100 sentences of the FLORES devtest set into the Chukot language via Russian.
- 2M-Flores-ASL: a version of FLORES translated into American Sign Language.
Other FLORES derivatives
- Fleurs is a speech version of FLORES in 102 languages.
- Belebele: a benchmark of multilingual reading comprehension (spanning 122 language variants) with multiple-choice answers built on top of FLORES.