Open Language Data Initiative

The contents of this card can be edited in the source repository.

Dataset card

Description

FLORES+ devtest set in Karakalpak

License

CC-BY-SA-4.0

Attribution

Mukhammadsaid Mamasaidov ([email protected]), Abror Shopulatov ([email protected]).

@inproceedings{wmt24-karakalpak,
    title="{Open Language Data Initiative}: Advancing Low-Resource Machine Translation for {Karakalpak}",
    author="Mukhammadsaid Mamasaidov and Abror Shopulatov",
    booktitle = "Proceedings of the Ninth Conference on Machine Translation",
    month = nov,
    year = "2024",
    address = "Miami, USA",
    publisher = "Association for Computational Linguistics"
}

Language codes

Additional language information

Karakalpak is a Turkic language belonging to the Kipchak branch. It is primarily spoken in the Republic of Karakalpakstan, an autonomous region within Uzbekistan, Central Asia. The language has approximately 900,000 native speakers. Karakalpak is closely related to Kazakh and Nogai languages.

The FLORES+ devtest set dataset is translated into the standardized literary version of Karakalpak using the most recent Latin script orthography. It's worth noting that Karakalpak uses both Cyrillic and Latin scripts, with the Latin script introduced in 1995 and undergoing several revisions, most notably in 2009 and 2016.

Workflow

The FLORES+ devtest dataset, consisting of 1012 sentences, was translated from English and Russian into Karakalpak by two annotators. The translations were then cross-verified to ensure accuracy. The dataset adheres to the most recent iteration of the Latin script orthography for Karakalpak.