Open Language Data Initiative


The Open Language Data Initiative (OLDI) empowers language communities around the globe to contribute to a database that drives the foundation of today’s machine translation and natural language processing work. We invite community, academic, and industry members to contribute to key datasets that are imperative to the organic expansion of language technology’s reach.

Why do we exist?

Machine translation research has advanced at breakneck speed. That said, progress made in translation quality has largely been directed at high-resource languages, leaving many languages behind. More recently, focus has started to shift to under-served languages (also called low-resource), and foundational datasets such as FLORES, NLLB-Seed and NTREX have made it easier to develop and evaluate language technologies for an increasing number of languages. The high impact of these components left some in the research community wondering: how do we add more languages to these existing open-source datasets?

OLDI was established for this very purpose. Because of the centrality of these components to the development of machine translation systems, allowing community, academic and industry members to contribute to them directly ensures the organic growth of these foundational corpora. The data made available can help researchers and developers improve translation coverage and quality, build stronger models, and edge us closer to materialization of the “The Polyglot Internet”.

Contributions are most welcome!
See the Contribution Guidelines.
See the list of Supported Languages.

OLDI datasets

OLDI currently houses the following datasets: