Contribution Guidelines

Thank you for your interest in contributing to the OLDI datasets! If you haven’t yet read through our Getting Started page, please start from there.

To make your contribution, please follow the steps below and see the rest of this page for more information.

Decide on your type of contribution, then contact the organisers to discuss. For corpora that are externally managed, such as SMOL and WMT24++, please make sure to coordinate with the original authors too.
Identify the right language code.
Fill out a dataset card and ensure it is acknowledged by all participants.
For corpora managed by OLDI, ensure all translators have acknowledged our translation guidelines.
Deliver the data and dataset card!

Questions? Email the organisers at [email protected] or join our Discord server.

Types of contribution

OLDI accepts two main types of contribution:

Fixes to existing data: in case of incorrect or incomplete existing translations.
Completely new translations: typically involves starting from the original English data and having it translated by qualified, native speakers of the target language.

In either case, please get in touch with the organisers to discuss your contribution before starting work. This ensures nobody else is already working on the same task and allows the community to better coordinate work.

Some datasets, such as SMOL and WMT24++, are managed externally from OLDI. If your contribution is to an externally managed dataset, please ensure you coordinate with its original authors to ensure work is not being duplicated and any project-specific workflows and guidelines are being followed.

Language codes

We use standardized language codes throughout OLDI, made up of three parts separated by underscores:

A language subtag: an ISO 639-3 language code. Macrolanguage codes must not be used if a more specific code exists: e.g. please use cmn, yue, wuu, etc. rather than zho.
A script subtag: an ISO 15924 script code.
A language variety subtag: a Glottocode identifying the specific language variety. These have the advantages of being stable and of allowing granular language identification.

Example: apc_Arab_sout3123 is South Levantine Arabic written in the Arabic script.

Dataset card

For new data, we collect precise information about the language variety, the quality assurance workflow and the translation workflow. We provide a dataset card template which should be filled out as fully as possible and submitted with the data.

Translation guidelines

All translators who will be contributing data to translation datasets managed by OLDI should acknowledge and abide by our translation guidelines (also available in markdown format). These ensure consistent and high-quality translations. In particular, please note that some machine translation services (including DeepL, Google Translate, ChatGPT, Gemini and Claude) prohibit the use of their output for training other translation or AI models, so their use is not permitted.

Delivery

The OLDI datasets are hosted on HuggingFace. To deliver your contribution, submit a pull request to the appropriate repository. Make sure to include both the data and the dataset card!

By contributing to OLDI, you agree to the Developer Certificate of Origin (DCO). This document was created by the Linux Kernel community and is a simple statement that you, as a contributor, have the legal right to make the contribution. In order to show your agreement with the DCO, you should include the following line at the end of commit message (substituting your real name): Signed-off-by: John Doe <[email protected]>. This can be done easily using the -s flag on the git commit command.

The delivery of contributions to datasets not managed by OLDI – such as SMOL and WMT24++ – should be discussed with the respective dataset authors.

Congratulations! You are now an OLDI contributor!