Welcome!

The Open Language Data Initiative (OLDI) empowers language communities around the globe to contribute to a database that drives the foundation of today’s machine translation and natural language processing work. We invite community, academic, and industry members to contribute to key datasets that are imperative to the organic expansion of language technology’s reach.

Why do we exist?

Machine translation research has advanced at breakneck speed. That said, progress made in translation quality has largely been directed at high-resource languages, leaving many languages behind. More recently, focus has started to shift to under-served languages (also called low-resource), and foundational datasets such as FLORES, NLLB-Seed and NTREX have made it easier to develop and evaluate language technologies for an increasing number of languages. The high impact of these components left some in the research community wondering: how do we add more languages to these existing open-source datasets?

OLDI was established for this very purpose. Because of the centrality of these components to the development of machine translation systems, allowing community, academic and industry members to contribute to them directly ensures the organic growth of these foundational corpora. The data made available can help researchers and developers improve translation coverage and quality, build stronger models, and edge us closer to materialization of the “The Polyglot Internet”.

Contributions are most welcome!
See the Contribution Guidelines. Join our Discord and subscribe to our Substack newsletter.

See the list of Supported Languages.

OLDI datasets

OLDI currently houses the following datasets:

🌱 OLDI-Seed, a set of 6,193 sentences extracted from English Wikipedia and translated into many other languages, which can be used to train machine translation models. It is an extended and improved version of the NLLB-Seed dataset.
💐 FLORES+, an evaluation benchmark for multilingual machine translation, covering over 200 languages. It is an extended and improved version of the FLORES-200 dataset.

Organizers

Idris Abdulmumin, University of Pretoria
Antonios Anastasopoulos, George Mason University
Laurie Burchell, Common Crawl Foundation
Isaac Caswell, Google
David Dale, Meta FAIR
Christian Federmann, Apple
Jean Maillard, Meta FAIR
Philipp Koehn, Johns Hopkins University
Skyler Wang, McGill University