Dataset card
Description
FLORES+ devtest set in Halh Mongolian (traditional Mongolian script).
License
CC-BY-SA-4.0
Attribution
The dataset is part of the MiLiC-Eval benchmark proposed in the following paper:
@inproceedings{zhang-etal-2025-milic,
title = "{M}i{L}i{C}-Eval: Benchmarking Multilingual {LLM}s for {C}hina{'}s Minority Languages",
author = "Zhang, Chen and
Tao, Mingxu and
Liao, Zhiyuan and
Feng, Yansong",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.578/",
doi = "10.18653/v1/2025.findings-acl.578",
pages = "11086--11102",
ISBN = "979-8-89176-256-5"
}
Also thanks to the following contributors for their help in data verification and correction: Saihantaoli, Sulde, Wang Xingjun.
Language codes
- ISO 639-3: khk
- ISO 15924: Mong
- Glottocode: halh1238
Additional language information
Most existing NLP resources for Mongolian are written in the Cyrillic script, which is primarily used in Mongolia. In this contribution, we include the traditional Mongolian script, which is the official writing system in Inner Mongolia, China, and is also being actively promoted by the government of Mongolia.
Workflow
We first transliterate the existing Cyrillic-script Halh Mongolian data in FLORES+ into the traditional Mongolian script using a transliteration tool developed by Inner Mongolia University (https://ai.nmgoyun.com/#/AI/transform). The transliterated outputs are then manually verified and corrected by native speakers. At present, this process has been completed for the 1,012 sentences in the devtest split.
Some of the transliterated texts may contain named entities in Latin script (e.g. "National Aeronautics and Space Administration" or "Jai Shankar Choudhary"). Occasionally, the texts in two scripts don't match exactly (e.g. in the Cyrillic version, "National Aeronautics and Space Administration" is rendered in its abbreviated form, "НАСА", but the version in Mongolian script uses the full Latin name).