Open Language Data Initiative

The contents of this card can be edited in the source repository.

Dataset card

Description

Open source Seed dataset in Bangla/Bengali (Dhaka Bangla)

License

CC-BY-SA-4.0

Attribution

@inproceedings{wmt24-seed-bangla,
    title="The {Bangla/Bengali} Seed Dataset Submission to the {WMT24} Open Language Data Initiative Shared Task",
    author="Firoz Ahmed and Nitin Venkateswaran and Sarah Moeller",
    booktitle = "Proceedings of the Ninth Conference on Machine Translation",
    month = nov,
    year = "2024",
    address = "Miami, USA",
    publisher = "Association for Computational Linguistics"
}

Language codes

Additional language information

The data is in the Dhaka dialect of Bangla/Bengali

Reference grammar: David, Anne Boye (2015), Descriptive Grammar of Bangla, De Gruyter Mouton, ISBN 9781614512295

Workflow

Data was translated from the English sentences of the Seed dataset maintained by the Open Language Data Initiative. All 6,193 English sentences were translated into Bangla by a single native speaker of the language who is highly proficient in English (at C2 level of the European Language framework) and has experience with professional translation as well as a university degree in Linguistics from the United States.