The contents of this card can be edited in the source repository.

Dataset card

Description

FLORES+ devtest set in Halh Mongolian (traditional Mongolian script).

License

CC-BY-SA-4.0

Attribution

The dataset is part of the MiLiC-Eval benchmark proposed in the following paper:

@inproceedings{zhang-etal-2025-milic,
    title = "{M}i{L}i{C}-Eval: Benchmarking Multilingual {LLM}s for {C}hina{'}s Minority Languages",
    author = "Zhang, Chen  and
      Tao, Mingxu  and
      Liao, Zhiyuan  and
      Feng, Yansong",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.578/",
    doi = "10.18653/v1/2025.findings-acl.578",
    pages = "11086--11102",
    ISBN = "979-8-89176-256-5"
}

Also thanks to the following contributors for their help in data verification and correction: Saihantaoli, Sulde, Wang Xingjun.

Language codes

ISO 639-3: khk
ISO 15924: Mong
Glottocode: halh1238

Additional language information

Most existing NLP resources for Mongolian are written in the Cyrillic script, which is primarily used in Mongolia. In this contribution, we include the traditional Mongolian script, which is the official writing system in Inner Mongolia, China, and is also being actively promoted by the government of Mongolia.

Workflow

We first transliterate the existing Cyrillic-script Halh Mongolian data in FLORES+ into the traditional Mongolian script using a transliteration tool developed by Inner Mongolia University (https://ai.nmgoyun.com/#/AI/transform). The transliterated outputs are then manually verified and corrected by native speakers. At present, this process has been completed for the 1,012 sentences in the devtest split.

Some of the transliterated texts may contain named entities in Latin script (e.g. "National Aeronautics and Space Administration" or "Jai Shankar Choudhary"). Occasionally, the texts in two scripts don't match exactly (e.g. in the Cyrillic version, "National Aeronautics and Space Administration" is rendered in its abbreviated form, "НАСА", but the version in Mongolian script uses the full Latin name).

Open Language Data Initiative