Welcome!
The Open Language Data Initiative (OLDI) empowers language communities around the globe to contribute to a database that drives the foundation of today’s machine translation and natural language processing work. We invite community, academic, and industry members to contribute to key datasets that are imperative to the organic expansion of language technology’s reach.
Why do we exist?
Machine translation research has advanced at breakneck speed. That said, progress made in translation quality has largely been directed at high-resource languages, leaving many languages behind. More recently, focus has started to shift to under-served languages (also called low-resource), and foundational datasets such as FLORES, NLLB-Seed and NTREX have made it easier to develop and evaluate language technologies for an increasing number of languages. The high impact of these components left some in the research community wondering: how do we add more languages to these existing open-source datasets?
OLDI was established for this very purpose. Because of the centrality of these components to the development of machine translation systems, allowing community, academic and industry members to contribute to them directly ensures the organic growth of these foundational corpora. The data made available can help researchers and developers improve translation coverage and quality, build stronger models, and edge us closer to materialization of the “The Polyglot Internet”.
See the Contribution Guidelines.
OLDI datasets
OLDI currently houses the following datasets:
- 🌱 OLDI-Seed, a set of 6,193 sentences extracted from English Wikipedia and translated into many other languages, which can be used to train machine translation models. It is an extended and improved version of the NLLB-Seed dataset.
- 💐 FLORES+, an evaluation benchmark for multilingual machine translation, covering over 200 languages. It is an extended and improved version of the FLORES-200 dataset.
Organizers
- Antonios Anastasopoulos, George Mason University
- Laurie Burchell, University of Edinburgh
- David Dale, Meta FAIR
- Christian Federmann, Apple
- Jean Maillard, Meta FAIR
- Philipp Koehn, Johns Hopkins University
- Skyler Wang, McGill University