Contribution Guidelines
Thank you for your interest in contributing to the OLDI datasets! To make your contribution, please follow the steps below and see the rest of this page for more information.
- Decide on your type of contribution, then contact the organisers to discuss.
- Identify the right language code.
- Fill out a dataset card and ensure it is acknowledged by all participants.
- Ensure all translators have acknowledged the translation guidelines.
- Deliver the data and dataset card!
Questions? Email the organisers at [email protected] or join our Discord server.
Types of contribution
OLDI accepts two main types of contribution:
- Fixes to existing data: in case of incorrect or incomplete exisiting translations.
- Completely new translations: typically involves starting from the original English data and having it translated by qualified, native speakers of the target language.
In either case, please get in touch with the organisers to discuss your contribution before starting work. This ensures nobody else is already working on the same task and allows the community to better coordinate work.
Language codes
We use standardized language codes throughout OLDI, made up of three parts separated by underscores:
- A language subtag: an ISO 639-3 language code. Macrolanguage codes must not be used if a more specific code exists: e.g. please use
cmn
,yue
,wuu
, etc. rather thanzho
. - A script subtag: an ISO 15924 script code.
A language variety subtag: a Glottocode identifying the specific language variety. These have the advantages of being stable and of allowing granular language identification.
Example:
apc_Arab_sout3123
is South Levantine Arabic written in the Arabic script.
Dataset card
For new data, we collect precise information about the language variety, the quality assurance workflow and the translation workflow. We provide a dataset card template which should be filled out as fully as possible and submitted with the data.
Translation guidelines
All translators who will be contributing data should acknowledge and abide by our translation guidelines (also available in markdown format). These ensure consistent and high-quality translations. In particular, please note that some machine translation services (including DeepL, Google Translate, ChatGPT, Gemini and Claude) prohibit the use of their output for training other translation or AI models, so their use is not permitted.
Delivery
The OLDI datasets are hosted on HuggingFace. To deliver your contribution, submit a pull request to the appropriate repository. Make sure to include both the data and the dataset card!
By contributing to OLDI, you agree to the Developer Certificate of Origin (DCO). This document was created by the Linux Kernel community and is a simple statement that you, as a contributor, have the legal right to make the contribution. In order to show your agreement with the DCO, you should include the following line at the end of commit message (substituting your real name): Signed-off-by: John Doe <[email protected]>
. This can be done easily using the -s
flag on the git commit
command.
Congratulations! You are now an OLDI contributor!