Open Language Data Initiative

Contribution Guidelines

Thank you for your interest in contributing to the OLDI datasets! To make your contribution, please follow the steps below and see the rest of this page for more information.

Questions? Email the organisers at [email protected] or join our Discord server.


Types of contribution

OLDI accepts two main types of contribution:

  1. Fixes to existing data: in case of incorrect or incomplete exisiting translations.
  2. Completely new translations: typically involves starting from the original English data and having it translated by qualified, native speakers of the target language.

In either case, please get in touch with the organisers to discuss your contribution before starting work. This ensures nobody else is already working on the same task and allows the community to better coordinate work.

Language codes

We use standardized language codes throughout OLDI, made up of three parts separated by underscores:

Example: apc_Arab_sout3123 is South Levantine Arabic written in the Arabic script.

Dataset card

For new data, we collect precise information about the language variety, the quality assurance workflow and the translation workflow. We provide a dataset card template which should be filled out as fully as possible and submitted with the data.

Translation guidelines

All translators who will be contributing data should acknowledge and abide by our translation guidelines (also available in markdown format). These ensure consistent and high-quality translations. In particular, please note that some machine translation services (including DeepL, Google Translate, ChatGPT, Gemini and Claude) prohibit the use of their output for training other translation or AI models, so their use is not permitted.

Delivery

The OLDI datasets are hosted on HuggingFace. To deliver your contribution, submit a pull request to the appropriate repository. Make sure to include both the data and the dataset card!

By contributing to OLDI, you agree to the Developer Certificate of Origin (DCO). This document was created by the Linux Kernel community and is a simple statement that you, as a contributor, have the legal right to make the contribution. In order to show your agreement with the DCO, you should include the following line at the end of commit message (substituting your real name): Signed-off-by: John Doe <[email protected]>. This can be done easily using the -s flag on the git commit command.

Congratulations! You are now an OLDI contributor!