Dataset card
Description
FLORES+ devtest set in Karakalpak
License
CC-BY-SA-4.0
Attribution
Mukhammadsaid Mamasaidov ([email protected]), Abror Shopulatov ([email protected]).
@inproceedings{wmt24-karakalpak,
title="{Open Language Data Initiative}: Advancing Low-Resource Machine Translation for {Karakalpak}",
author="Mukhammadsaid Mamasaidov and Abror Shopulatov",
booktitle = "Proceedings of the Ninth Conference on Machine Translation",
month = nov,
year = "2024",
address = "Miami, USA",
publisher = "Association for Computational Linguistics"
}
Language codes
- ISO 639-3: kaa
- ISO 15924: Latn
- Glottocode: kara1467
Additional language information
Karakalpak is a Turkic language belonging to the Kipchak branch. It is primarily spoken in the Republic of Karakalpakstan, an autonomous region within Uzbekistan, Central Asia. The language has approximately 900,000 native speakers. Karakalpak is closely related to Kazakh and Nogai languages.
The FLORES+ devtest set dataset is translated into the standardized literary version of Karakalpak using the most recent Latin script orthography. It's worth noting that Karakalpak uses both Cyrillic and Latin scripts, with the Latin script introduced in 1995 and undergoing several revisions, most notably in 2009 and 2016.
Workflow
The FLORES+ devtest dataset, consisting of 1012 sentences, was translated from English and Russian into Karakalpak by two annotators. The translations were then cross-verified to ensure accuracy. The dataset adheres to the most recent iteration of the Latin script orthography for Karakalpak.