Papers
Since their first releases, the FLORES and Seed datasets have been well documented in research papers. Their community extensions (including those under WMT shared tasks in 2024 and 2025) often also resulted in publications. They are all listed below in reverse chronological order.
2025
-
Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek by Mamasaidov et al, 2025.
FLORES+ in Southern Uzbek. - A French Version of the OLDI Seed Corpus by Marmonier et al, 2025.
2024
-
Findings of the WMT 2024 Shared Task of the Open Language Data Initiative by Burchell et al, 2024.
Summary of the papers below. - Expanding the FLORES+ Multilingual Benchmark with Translations for Aragonese, Aranese, Asturian, and Valencian by Perez-Ortiz et al, 2024.
-
Correcting FLORES Evaluation Dataset for Four African Languages by Abdulmumin et al, 2024.
FLORES+ improvements to Hausa, Northern Sotho (Sepedi), Xitsonga, and isiZulu. - Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation by Ali et al, 2024.
- Enhancing Tuvan Language Resources through the FLORES Dataset by Kuzhuget et al, 2024.
-
Machine Translation Evaluation Benchmark for Wu Chinese: Workflow and Analysis by Yu et al, 2024.
FLORES+ translation into Wu Chinese. -
Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak by Mamasaidov and Shopulatov, 2024.
FLORES+ translation into Karakalpak. - FLORES+ translation and machine translation evaluation for the Erzya language by Gordeev et al, 2024.
- The Bangla/Bengali Seed Dataset Submission to the WMT24 Open Language Data Initiative Shared Task by Ahmed et al, 2024.
- A high-quality Seed dataset for Italian machine translation by Ferrante, 2024.
- Spanish Corpus and Provenance with Computer-Aided Translation for the WMT24 OLDI Shared Task by Cols, 2024.
2023
-
Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation by Maillard et al, 2023.
This paper describes the construction of the Seed dataset and analyses its impact in more detail. -
Machine Translation for Nko: Tools, Corpora, and Baseline Results by Doumbouya et al, 2023.
Introducing FLORES and Seed for the Nko language. -
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages by AI4Bharat et al, 2023.
Translating FLORES into Bodo, Dogri, Meitei (Meitei Script), Sindhi (Devanagari script), Goan Konkani.
2022 and earlier
-
No language left behind: Scaling human-centered machine translation by NLLB Team et al, 2022.
In this paper, FLORES-101 was extended to 202 languages, and Seed was introduced and used for training the NLLB-200 model. -
The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation by Goyal et al, 2022.
This paper rebuilt FLORES in its modern form, based on English Wikinews, Wikibooks, and Wikivoyage. -
The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English by Guzmán et al, 2019.
The very first version of FLORES.