Papers

Since their first releases, the FLORES and Seed datasets have been well documented in research papers. Their community extensions (including those under WMT shared tasks in 2024 and 2025) often also resulted in publications. They are all listed below in reverse chronological order.

2025

MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages by Zhang, et al, 2025.
Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek by Mamasaidov et al, 2025.
A French Version of the OLDI Seed Corpus by Marmonier et al, 2025.
Bringing Ladin to FLORES+ by Frontull et al, 2025.
Correcting the Tamazight Portions of FLORES+ and OLDI Seed Datasets by Oktem et al, 2025.
The Kyrgyz Seed Dataset Submission to the WMT25 Open Language Data Initiative Shared Task by Jumashev et al, 2025.
Improved Norwegian Bokmål Translations for FLORES by Mæhlum et al, 2025.
Expanding the WMT24++ Benchmark with Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader by Vamvas et al, 2025.
KozKreolMRU WMT 2025 CreoleMT System Description: Koz Kreol: Multi-Stage Training for English - Mauritian Creole MT by Rajcoomar, 2025.
SMOL: Professionally Translated Parallel Data for 115 Under-represented Languages by Caswell et al, 2025.
Findings of the WMT 2025 Shared Task of the Open Language Data Initiative by Dale et al, 2025.

2024

Findings of the WMT 2024 Shared Task of the Open Language Data Initiative by Burchell et al, 2024.
Summary of the papers below.
Expanding the FLORES+ Multilingual Benchmark with Translations for Aragonese, Aranese, Asturian, and Valencian by Perez-Ortiz et al, 2024.
Correcting FLORES Evaluation Dataset for Four African Languages by Abdulmumin et al, 2024.
FLORES+ improvements to Hausa, Northern Sotho (Sepedi), Xitsonga, and isiZulu.
Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation by Ali et al, 2024.
Enhancing Tuvan Language Resources through the FLORES Dataset by Kuzhuget et al, 2024.
Machine Translation Evaluation Benchmark for Wu Chinese: Workflow and Analysis by Yu et al, 2024.
FLORES+ translation into Wu Chinese.
Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak by Mamasaidov and Shopulatov, 2024.
FLORES+ translation into Karakalpak.
FLORES+ translation and machine translation evaluation for the Erzya language by Gordeev et al, 2024.
The Bangla/Bengali Seed Dataset Submission to the WMT24 Open Language Data Initiative Shared Task by Ahmed et al, 2024.
A high-quality Seed dataset for Italian machine translation by Ferrante, 2024.
Spanish Corpus and Provenance with Computer-Aided Translation for the WMT24 OLDI Shared Task by Cols, 2024.

2023

Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation by Maillard et al, 2023.
This paper describes the construction of the Seed dataset and analyses its impact in more detail.
Machine Translation for Nko: Tools, Corpora, and Baseline Results by Doumbouya et al, 2023.
Introducing FLORES and Seed for the Nko language.
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages by AI4Bharat et al, 2023.
Translating FLORES into Bodo, Dogri, Meitei (Meitei Script), Sindhi (Devanagari script), Goan Konkani.

2022 and earlier

No language left behind: Scaling human-centered machine translation by NLLB Team et al, 2022.
In this paper, FLORES-101 was extended to 202 languages, and Seed was introduced and used for training the NLLB-200 model.
The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation by Goyal et al, 2022.
This paper rebuilt FLORES in its modern form, based on English Wikinews, Wikibooks, and Wikivoyage.
The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English by Guzmán et al, 2019.
The very first version of FLORES.