Open Language Data Initiative

Language List

Current OLDI languages

The following table lists all languages currently included in OLDI, and which datasets are covered. The meaning of the dataset cells is as follows:

ace Arab achi1257 Acehnese (Jawi script) available available
ace Latn achi1257 Acehnese (Latin script) available available
acm Arab meso1252 Mesopotamian Arabic issues reported
acq Arab taiz1242 Taʽizzi-Adeni Arabic issues reported
aeb Arab tuni1259 Tunisian Arabic available
afr Latn afri1274 Afrikaans available
als Latn tosk1239 Albanian (Tosk) available
amh Ethi amha1245 Amharic available
apc Arab nort3139 Levantine Arabic (North) available
apc Arab sout3123 Levantine Arabic (South) available
arb Arab stan1318 Modern Standard Arabic available
arb Latn stan1318 Modern Standard Arabic (Romanized) available
ars Arab najd1235 Najdi Arabic issues reported
ary Arab moro1292 Moroccan Arabic available available
arz Arab egyp1253 Egyptian Arabic available available
asm Beng assa1263 Assamese available
ast Latn astu1245 Asturian available
awa Deva awad1243 Awadhi available
ayr Latn cent2142 Central Aymara available
azb Arab sout2697 South Azerbaijani available
azj Latn nort2697 North Azerbaijani available
bak Cyrl bash1264 Bashkir available
bam Latn bamb1269 Bambara available available
ban Latn bali1278 Balinese available available
bel Cyrl bela1254 Belarusian available
bem Latn bemb1257 Bemba available
ben Beng beng1280 Bengali available
bho Deva bhoj1244 Bhojpuri available available
bjn Arab banj1239 Banjar (Jawi script) available available
bjn Latn banj1239 Banjar (Latin script) available available
bod Tibt utsa1239 Lhasa Tibetan available
bos Latn bosn1245 Bosnian available
brx Deva bodo1269 Bodo partially available
bug Latn bugi1244 Buginese available available
bul Cyrl bulg1262 Bulgarian available
cat Latn stan1289 Catalan available
ceb Latn cebu1242 Cebuano available
ces Latn czec1258 Czech available
chv Cyrl chuv1255 Chuvash partially available
cjk Latn chok1245 Chokwe available
ckb Arab cent1972 Central Kurdish available
cmn Hans beij1234 Mandarin Chinese (Standard Beijing) available
cmn Hant taib1240 Mandarin Chinese (Taiwanese) available
crh Latn crim1257 Crimean Tatar available available
cym Latn wels1247 Welsh available
dan Latn dani1285 Danish available
deu Latn stan1295 German available
dgo Deva dogr1250 Dogri partially available
dik Latn sout2832 Southwestern Dinka available available
dyu Latn dyul1238 Dyula available
dzo Tibt dzon1239 Dzongkha available available
ekk Latn esto1258 Estonian available
ell Grek mode1248 Greek available
eng Latn stan1293 English available available
epo Latn espe1235 Esperanto available
eus Latn basq1248 Basque available
ewe Latn ewee1241 Ewe available
fao Latn faro1244 Faroese available
fij Latn fiji1243 Fijian available
fil Latn fili1244 Filipino available
fin Latn finn1318 Finnish available
fon Latn fonn1241 Fon available
fra Latn stan1290 French available
fur Latn east2271 Friulian available available
fuv Latn nige1253 Nigerian Fulfulde available available
gaz Latn west2721 West Central Oromo available
gla Latn scot1245 Scottish Gaelic available
gle Latn iris1253 Irish available
glg Latn gali1258 Galician available
gom Deva goan1235 Goan Konkani available
gug Latn para1311 Paraguayan Guaraní available available
guj Gujr guja1252 Gujarati available
hat Latn hait1244 Haitian Creole available
hau Latn haus1257 Hausa available
heb Hebr hebr1245 Hebrew available
hin Deva hind1269 Hindi available
hne Deva chha1249 Chhattisgarhi available available
hrv Latn croa1245 Croatian available
hun Latn hung1274 Hungarian available
hye Armn nucl1235 Armenian available
ibo Latn nucl1417 Igbo available
ilo Latn ilok1237 Ilocano available
ind Latn indo1316 Indonesian available
isl Latn icel1247 Icelandic available
ita Latn ital1282 Italian available
jav Latn java1254 Javanese available
jpn Jpan nucl1643 Japanese available
kab Latn kaby1243 Kabyle available
kac Latn kach1280 Jingpho available
kam Latn kamb1297 Kamba available
kan Knda nucl1305 Kannada available
kas Arab kash1277 Kashmiri (Arabic script) available available
kas Deva kash1277 Kashmiri (Devanagari script) available available
kat Geor nucl1302 Georgian available
kaz Cyrl kaza1248 Kazakh available
kbp Latn kabi1261 Kabiyè available
kea Latn kabu1256 Kabuverdianu available
khk Cyrl halh1238 Halh Mongolian available
khm Khmr cent1989 Khmer (Central) available
kik Latn kiku1240 Kikuyu available
kin Latn kiny1244 Kinyarwanda available
kir Cyrl kirg1245 Kyrgyz available
kmb Latn kimb1241 Kimbundu available
kmr Latn nort2641 Northern Kurdish available
knc Arab cent2050 Central Kanuri (Arabic script) available available
knc Latn cent2050 Central Kanuri (Latin script) available available
kor Hang kore1280 Korean available
ktu Latn kitu1246 Kituba (DRC) available
lao Laoo laoo1244 Lao available
lij Latn geno1240 Ligurian (Genoese) available available
lim Latn limb1263 Limburgish available available
lin Latn ling1263 Lingala available
lit Latn lith1251 Lithuanian available
lmo Latn lomb1257 Lombard issues reported issues reported
ltg Latn east2282 Latgalian available available
ltz Latn luxe1241 Luxembourgish available
lua Latn luba1249 Luba-Kasai available
lug Latn gand1255 Ganda available
luo Latn luok1236 Luo available
lus Latn lush1249 Mizo available
lvs Latn stan1325 Standard Latvian available
mag Deva maga1260 Magahi available available
mai Deva mait1250 Maithili available
mal Mlym mala1464 Malayalam available
mar Deva mara1378 Marathi available
mhr Cyrl gras1239 Meadow Mari partially available
min Arab mina1268 Minangkabau (Jawi script) available
min Latn mina1268 Minangkabau (Latin script) available
mkd Cyrl mace1250 Macedonian available
mlt Latn malt1254 Maltese available
mni Beng mani1292 Meitei (Manipuri, Bengali script) available available
mni Mtei mani1292 Meitei (Manipuri, Meitei script) partially available
mos Latn moss1236 Mossi available
mri Latn maor1246 Maori available available
mya Mymr nucl1310 Burmese available
nld Latn dutc1256 Dutch available
nno Latn norw1262 Norwegian Nynorsk available
nob Latn norw1259 Norwegian Bokmål available
npi Deva nepa1254 Nepali available
nqo Nkoo nkoa1234 Nko available available
nso Latn pedi1238 Northern Sotho available
nus Latn nuer1246 Nuer available available
nya Latn nyan1308 Nyanja available
oci Latn occi1239 Occitan available
ory Orya oriy1255 Odia available
pag Latn pang1290 Pangasinan available
pan Guru panj1256 Eastern Panjabi available
pap Latn papi1253 Papiamento available
pbt Arab sout2649 Southern Pashto available available
pes Arab west2369 Western Persian available
plt Latn plat1254 Plateau Malagasy available
pol Latn poli1260 Polish available
por Latn braz1246 Portuguese (Brazilian) available
prs Arab dari1249 Dari available available
quy Latn ayac1239 Ayacucho Quechua available
ron Latn roma1327 Romanian available
run Latn rund1242 Rundi available
rus Cyrl russ1263 Russian available
sag Latn sang1328 Sango available
san Deva sans1269 Sanskrit available
sat Olck sant1410 Santali available
scn Latn sici1248 Sicilian available available
shn Mymr shan1277 Shan available available
sin Sinh sinh1246 Sinhala available
slk Latn slov1269 Slovak available
slv Latn slov1268 Slovenian available
smo Latn samo1305 Samoan available
sna Latn shon1251 Shona available
snd Arab sind1272 Sindhi (Arabic script) available
snd Deva sind1272 Sindhi (Devanagari script) partially available
som Latn soma1255 Somali available
sot Latn sout2807 Southern Sotho available
spa Latn amer1254 Spanish (Latin American) available
srd Latn sard1257 Sardinian issues reported issues reported
srp Cyrl serb1264 Serbian available
ssw Latn swat1243 Swati available
sun Latn sund1252 Sundanese available
swe Latn swed1254 Swedish available
swh Latn swah1253 Swahili available
szl Latn sile1253 Silesian available available
tam Taml tami1289 Tamil available
taq Latn tama1365 Tamasheq (Latin script) available available
taq Tfng tama1365 Tamasheq (Tifinagh script) available available
tat Cyrl tata1255 Tatar available
tel Telu telu1262 Telugu available
tgk Cyrl taji1245 Tajik available
tha Thai thai1261 Thai available
tir Ethi tigr1271 Tigrinya available
tpi Latn tokp1240 Tok Pisin available
tsn Latn tswa1253 Tswana available
tso Latn tson1249 Tsonga available
tuk Latn turk1304 Turkmen available
tum Latn tumb1250 Tumbuka available
tur Latn nucl1301 Turkish available
twi Latn akua1239 Akuapem Twi available
twi Latn asan1239 Asante Twi available
uig Arab uigh1240 Uyghur available
ukr Cyrl ukra1253 Ukrainian available
umb Latn umbu1257 Umbundu available
urd Arab urdu1245 Urdu available
uzn Latn nort2690 Northern Uzbek available
vec Latn vene1259 Venetian available available
vie Latn viet1252 Vietnamese available
war Latn wara1300 Waray available
wol Latn nucl1347 Wolof available
xho Latn xhos1239 Xhosa available
ydd Hebr east2295 Eastern Yiddish available
yor Latn yoru1245 Yoruba available
yue Hant xian1255 Yue Chinese (Hong Kong Cantonese) available
zgh Tfng stan1324 Standard Moroccan Tamazight available available
zsm Latn stan1306 Standard Malay available
zul Latn zulu1248 Zulu available

The size of the table above might give the mistaken impression that these datasets cover a large proportion of the world’s languages. It is therefore important to realize that, while indeed a large number of languages are currently supported, these only represent a very small fraction of the languages that are currently spoken around the planet. The following progress bar gives a rough estimate of how many languages are covered by OLDI datasets, compared to the approximate total number of currently spoken languages (based on Glottolog data).

Language coverage:

2.5% of all languages