Open Language Data Initiative

Welcome!

The Open Language Data Initiative (OLDI) empowers language communities around the globe to contribute to a database that drives the foundation of today’s machine translation and natural language processing work. We invite community, academic, and industry members to contribute to key datasets that are imperative to the organic expansion of language technology’s reach.

Why do we exist?

Machine translation research has advanced at breakneck speed. That said, progress made in translation quality has largely been directed at high-resource languages, leaving many languages behind. More recently, focus has started to shift to under-served languages (also called low-resource), and foundational datasets such as FLORES, NLLB-Seed and NTREX have made it easier to develop and evaluate language technologies for an increasing number of languages. The high impact of these components left some in the research community wondering: how do we add more languages to these existing open-source datasets?

OLDI was established for this very purpose. Because of the centrality of these components to the development of machine translation systems, allowing community, academic and industry members to contribute to them directly ensures the organic growth of these foundational corpora. The data made available can help researchers and developers improve translation coverage and quality, build stronger models, and edge us closer to materialization of the “The Polyglot Internet”.

OLDI datasets

OLDI currently houses the following datasets:

Contributing

We encourage researchers from language communities, academia and industry to participate. We have developed Contribution Guidelines for interested contributors. Please review the document carefully.

See the Contribution Guidelines.

Current OLDI languages

The following table lists all languages currently included in OLDI, and which datasets are covered. The meaning of the dataset cells is as follows:

LanguageCodeFLORES+Seed
Acehnese (Arabic script)ace_Arabavailableavailable
Acehnese (Latin script)ace_Latnavailableavailable
Mesopotamian Arabicacm_Arabavailable
Ta’izzi-Adeni Arabicacq_Arabavailable
Tunisian Arabicaeb_Arabavailable
Afrikaansafr_Latnavailable
South Levantine Arabicajp_Arabavailable
Akanaka_Latnissues reported
Tosk Albanianals_Latnavailable
Amharicamh_Ethiavailable
North Levantine Arabicapc_Arabavailable
Modern Standard Arabicarb_Arabavailable
Modern Standard Arabic (Romanized)arb_Latnavailable
Najdi Arabicars_Arabavailable
Moroccan Arabicary_Arabavailableavailable
Egyptian Arabicarz_Arabavailableavailable
Assameseasm_Bengavailable
Asturianast_Latnavailable
Awadhiawa_Devaavailable
Central Aymaraayr_Latnavailable
South Azerbaijaniazb_Arabavailable
North Azerbaijaniazj_Latnavailable
Bashkirbak_Cyrlavailable
Bambarabam_Latnavailableavailable
Balineseban_Latnavailableavailable
Belarusianbel_Cyrlavailable
Bembabem_Latnavailable
Bengaliben_Bengavailable
Bhojpuribho_Devaavailableavailable
Banjar (Arabic script)bjn_Arabavailableavailable
Banjar (Latin script)bjn_Latnavailableavailable
Standard Tibetanbod_Tibtavailable
Bosnianbos_Latnavailable
Bodobrx_Devapartially available
Buginesebug_Latnavailableavailable
Bulgarianbul_Cyrlavailable
Catalancat_Latnavailable
Cebuanoceb_Latnavailable
Czechces_Latnavailable
Chokwecjk_Latnavailable
Mandarin Chinese (Simplified)cmn_Hansavailable
Mandarin Chinese (Traditional)cmn_Hantavailable
Central Kurdishckb_Arabissues reported
Crimean Tatarcrh_Latnavailableavailable
Welshcym_Latnavailable
Danishdan_Latnavailable
Germandeu_Latnavailable
Dogridgo_Devapartially available
Southwestern Dinkadik_Latnavailableavailable
Dyuladyu_Latnavailable
Dzongkhadzo_Tibtavailableavailable
Greekell_Grekavailable
Englisheng_Latnavailable
Esperantoepo_Latnavailable
Estonianest_Latnavailable
Basqueeus_Latnavailable
Eweewe_Latnavailable
Faroesefao_Latnavailable
Fijianfij_Latnavailable
Filipinofil_Latnavailable
Finnishfin_Latnavailable
Fonfon_Latnavailable
Frenchfra_Latnavailable
Friulianfur_Latnavailableavailable
Nigerian Fulfuldefuv_Latnavailableavailable
Scottish Gaelicgla_Latnavailable
Irishgle_Latnavailable
Galicianglg_Latnavailable
Goan Konkanigom_Devapartially available
Guaranigrn_Latnavailableavailable
Gujaratiguj_Gujravailable
Haitian Creolehat_Latnavailable
Hausahau_Latnavailable
Hebrewheb_Hebravailable
Hindihin_Devaavailable
Chhattisgarhihne_Devaavailableavailable
Croatianhrv_Latnavailable
Hungarianhun_Latnavailable
Armenianhye_Armnavailable
Igboibo_Latnavailable
Ilocanoilo_Latnavailable
Indonesianind_Latnavailable
Icelandicisl_Latnavailable
Italianita_Latnavailable
Javanesejav_Latnavailable
Japanesejpn_Jpanavailable
Kabylekab_Latnavailable
Jingphokac_Latnavailable
Kambakam_Latnavailable
Kannadakan_Kndaavailable
Kashmiri (Arabic script)kas_Arabavailableavailable
Kashmiri (Devanagari script)kas_Devaavailableavailable
Georgiankat_Georavailable
Central Kanuri (Arabic script)knc_Arabavailableavailable
Central Kanuri (Latin script)knc_Latnavailableavailable
Kazakhkaz_Cyrlavailable
Kabiyèkbp_Latnavailable
Kabuverdianukea_Latnavailable
Khmerkhm_Khmravailable
Kikuyukik_Latnavailable
Kinyarwandakin_Latnavailable
Kyrgyzkir_Cyrlavailable
Kimbundukmb_Latnavailable
Northern Kurdishkmr_Latnavailable
Kikongokon_Latnavailable
Koreankor_Hangavailable
Laolao_Laooavailable
Ligurianlij_Latnavailableavailable
Limburgishlim_Latnavailableavailable
Lingalalin_Latnavailable
Lithuanianlit_Latnavailable
Lombardlmo_Latnissues reportedissues reported
Latgalianltg_Latnavailableavailable
Luxembourgishltz_Latnavailable
Luba-Kasailua_Latnavailable
Gandalug_Latnavailable
Luoluo_Latnavailable
Mizolus_Latnavailable
Standard Latvianlvs_Latnavailable
Magahimag_Devaavailableavailable
Maithilimai_Devaavailable
Malayalammal_Mlymavailable
Marathimar_Devaavailable
Minangkabau (Arabic script)min_Arabavailable
Minangkabau (Latin script)min_Latnavailable
Macedonianmkd_Cyrlavailable
Plateau Malagasyplt_Latnavailable
Maltesemlt_Latnavailable
Meitei (Bengali script)mni_Bengavailableavailable
Meitei (Meitei script)mni_Mteipartially available
Halh Mongoliankhk_Cyrlavailable
Mossimos_Latnavailable
Maorimri_Latnavailableavailable
Burmesemya_Mymravailable
Dutchnld_Latnavailable
Norwegian Nynorsknno_Latnavailable
Norwegian Bokmålnob_Latnavailable
Nepalinpi_Devaavailable
Nkonqo_Nkooavailableavailable
Northern Sothonso_Latnavailable
Nuernus_Latnavailableavailable
Nyanjanya_Latnavailable
Occitanoci_Latnavailable
West Central Oromogaz_Latnavailable
Odiaory_Oryaavailable
Pangasinanpag_Latnavailable
Eastern Panjabipan_Guruavailable
Papiamentopap_Latnavailable
Western Persianpes_Arabavailable
Polishpol_Latnavailable
Portuguesepor_Latnavailable
Dariprs_Arabavailableavailable
Southern Pashtopbt_Arabavailableavailable
Ayacucho Quechuaquy_Latnavailable
Romanianron_Latnavailable
Rundirun_Latnavailable
Russianrus_Cyrlavailable
Sangosag_Latnavailable
Sanskritsan_Devaavailable
Santalisat_Olckavailable
Sicilianscn_Latnavailableavailable
Shanshn_Mymravailableavailable
Sinhalasin_Sinhavailable
Slovakslk_Latnavailable
Slovenianslv_Latnavailable
Samoansmo_Latnavailable
Shonasna_Latnavailable
Sindhisnd_Arabavailable
Somalisom_Latnavailable
Southern Sothosot_Latnavailable
Spanishspa_Latnavailable
Sardiniansrd_Latnissues reportedissues reported
Serbiansrp_Cyrlavailable
Swatissw_Latnavailable
Sundanesesun_Latnavailable
Swedishswe_Latnavailable
Swahiliswh_Latnavailable
Silesianszl_Latnavailableavailable
Tamiltam_Tamlavailable
Tatartat_Cyrlavailable
Telugutel_Teluavailable
Tajiktgk_Cyrlavailable
Thaitha_Thaiavailable
Tigrinyatir_Ethiavailable
Tamasheq (Latin script)taq_Latnavailableavailable
Tamasheq (Tifinagh script)taq_Tfngavailableavailable
Tok Pisintpi_Latnavailable
Tswanatsn_Latnavailable
Tsongatso_Latnavailable
Turkmentuk_Latnavailable
Tumbukatum_Latnavailable
Turkishtur_Latnavailable
Twitwi_Latnavailable
Uyghuruig_Arabavailable
Ukrainianukr_Cyrlavailable
Umbunduumb_Latnavailable
Urduurd_Arabavailable
Northern Uzbekuzn_Latnavailable
Venetianvec_Latnavailableavailable
Vietnamesevie_Latnavailable
Waraywar_Latnavailable
Wolofwol_Latnavailable
Xhosaxho_Latnavailable
Eastern Yiddishydd_Hebravailable
Yorubayor_Latnavailable
Yue Chineseyue_Hantissues reported
Standard Moroccan Tamazightzgh_Tfngavailableavailable
Standard Malayzsm_Latnavailable
Zuluzul_Latnavailable

The size of the table above might give the mistaken impression that these datasets cover a large proportion of the world’s languages. It is therefore important to realize that, while indeed a large number of languages are currently supported, these only represent a very small fraction of the languages that are currently spoken around the planet. The following progress bar gives a rough estimate of how many languages are covered by OLDI datasets, compared to the approximate total number of currently spoken languages (based on Glottolog data).

Language coverage:

2.5% of all languages

Organizers