Thank you for your interest in contributing to the OLDI datasets.
Wheter you plan on contributing fixes to existing data or completely new translations, please open an issue in the relevant dataset repository, to ensure nobody else is already working on the same task. This will also allow the community to better coordinate work.
Typically, contributing a new translation involves starting from the original English data and having it translated by qualified, native speakers of the target language. We also ask contributors to fill out a data card containing details about the language being targeted, as well as information on how the translation was carried out. Please see the next section for more information.
Language data release checklist
Here are the steps which should be followed for all new contributions:
- Open an issue in the relevant dataset repository to discuss your proposed contribution.
- Identify the right language code for your contribution.
- Fill out a data card.
- Ensure that all translators have acknowledged the contents of the translation guidelines and data card.
- Deliver the data.
We make use of standardized language codes such as
eng_Latn. These codes are used throughout our projects to identify languages, as well as in filenames to indicate the language of their contents:
- Language codes are comprised of two mandatory parts, and an optional third part:
- A language subtag.
- A script subtag.
- An optional language variant subtag.
- For the language subtag, we use ISO 639-3 language codes. Macrolanguage codes must not be used if a more specific code exists, e.g. please use
wuu, etc. rather than the generic uninformative code
- For the script subtag, we use ISO 15924 script codes.
- The language variant subtag is optional and can be used to disambiguate cases where multiple variants are present in the OLDI datasets. This can be a geographic code (ISO 3166-1 alpha-2 for countries or UN M.49 for regions), an IANA variant subtag, or a Glottocode.
For new data, we collect precise information about the targeted language and dialect, as well as the translation workflow that was followed. Please use the following Markdown template to provide this information.
These translation guidelines must be acknlowedge by all translators who will be contributing data.
Your translations will be used to help train or evaluate Machine Translation engines. For this reason, this project requires Human Translation.
- You will be translating sentences coming from different sources. In some cases, a link to the source document might be provided to give you more context. If available, please refer to it.
- Do not convert any units of measurement. Translate them exactly as noted in the source content.
- When translating, please maintain the same tone used in the source document. For example, encyclopedic content coming from sources like Wikipedia should be translated using a formal tone.
- Provide fluent translations without deviating too much from the source structure. Only allow necessary changes.
- Do not expand or replace information compared to what is present in the source documents. Do not add any explanatory or parenthetical information, definitions, etc.
- Do not ignore any meaningful text that was present in the source.
- In case of multiple possible translations, please pick the one that makes the most sense (e.g., for gender concordance, cultural fit in the target language, level of formality, etc.).
- Translations must be faithful to the source in terms of pragmatics such as (if applicable) level of hedging/modality, sentiment and its intensity, negation, speech effects (disfluencies), etc.
- For proper nouns and common abbreviations, please see the guidelines on Named Entities below.
- Idiomatic expressions should not be translated word for word. Use an equivalent idiom, if one exists. If no equivalent idiom exists, use an idiom of similar meaning. If no similar expressions exist in the target language, paraphrase the idiom such that the meaning is retained in the target language.
- When a pronoun to be translated is ambiguous (for instance, when it could be interpreted as either him/her or he/she), opt for gender neutral pronouns (such as them/they) if those exist in the target language. However, when a pronoun to be translated is clearly marked for gender, you should follow the source material and continue to mark for gender.
Named Entities are people, places, organisations, etc., that are commonly referred to using a proper noun. This section provides guidance on how to handle Named Entities. Please review the following guidelines carefully:
If there is a commonly used term in the target language for the Named Entity:
- If the most commonly used term is the same as in the source language, then keep it as it is.
- If the most commonly used term is a translation or a transliteration, then use that.
If there is no commonly used term:
- If possible, a transliteration of the original term should be used.
- If a transliteration would not be commonly understood in the context, and the source term would be more acceptable, you may retain the original term.