Open Language Data Initiative

Contribution Guidelines

Thank you for your interest in contributing to the OLDI datasets!

To ensure high-quality contributions, please follow the steps in the checklist below and see the rest of this page for more information.

Language data release checklist

Types of contribution

There are three main types of contributions:

  1. Fixes to existing data: in case of incorrect or incomplete exisiting translations.
  2. Completely new translations: typically involves starting from the original English data and having it translated by qualified, native speakers of the target language (see translation guidelines).
  3. Other contributions: for example, new monolingual datasets (see monolingual contribution guidelines).

In each case, before starting work please make sure to email the organisers at [email protected]. This ensures nobody else is already working on the same task and allows the community to better coordinate work.

Language codes

We use standardized language codes throughout OLDI. These are made up of three parts, separated by underscores:

Example: apc_Arab_sout3123 is South Levantine Arabic written in the Arabic script.

Dataset card

For new data, we collect precise information about the language variety, the quality assurance workflow and, where applicable, the translation workflow. Please use the following Markdown template to provide this information.

Template

# Dataset card ## Description <!-- A concise description of the data associated with this card. --> FLORES+ dev set in Luxemburgish ## License <!-- Contributions to existing datasets must be released under the same license as the parent dataset. For completely new contributions, we encourage the use of an open license. At a minimum, data should be made available for research use. Please specify the license using an SPDX license identifier. --> CC-BY-SA-4.0 ## Attribution <!-- Who should be credited for creating this dataset? Feel free to include citation data in BibTeX format. --> ```bibtex @article{myarticle,   title={Something},   author={Somebody},   year={2024}, } ``` ## Language codes <!-- * If this language is assigned an ISO 639-3 individual language code (not a macrolanguage code), specify it here. * Please specify the script this language is written in using an ISO 15924 code. * If this language is assigned a Glottocode, please specify it here. --> * ISO 639-3: ltz * ISO 15924: Latn * Glottocode: luxe1243 ## Additional language information <!-- Any relevant additional information on the language, such as: * A list of reference publications and software (dictionaries, grammars, spellcheckers). * If applicable, any additional information about dialectal variation that is not captured by the Glottocode. * If relevant, the orthography used in your contribution. -->  ## Workflow <!-- What workflow was followed in creating this dataset? E.g., for a translated dataset, relevant information includes: what language the content was translated from, the number of translators, aggregate translator information (how many were native speakers in the target language, how many were highly proficient in the target languages, how many had professional translation experience), was any fraction of the data checked independently by third parties, etc. --> Data was translated from English by 5 translators, all native speakers of the target language and highly proficient in English (at C2 level of the European Language Framework). All translators were either professional translators or had relevant qualifications (university degrees in Translation and Interpreting or Linguistics). 100% of the data was checked by one more independent translator. ## Additional guidelines <!-- Were any additional guidelines agreed upon? Examples might include style guidelines, the use of particular grammatical forms or sentence structures, specific spelling or punctuation rules to be followed, etc. -->

Translation guidelines

These translation guidelines must be acknlowedged by all translators who will be contributing data.

Important note

Your translations will be used to help train or evaluate machine translation engines. For this reason, this project requires human translation.

General guidelines

  1. You will be translating sentences coming from different sources. Please refer to the source document if available.
  2. Do not convert any units of measurement. Translate them exactly as noted in the source content.
  3. When translating, please maintain the same tone used in the source document. For example, encyclopedic content coming from sources like Wikipedia should be translated using a formal tone.
  4. Provide fluent translations without deviating too much from the source structure. Only allow necessary changes.
  5. Do not expand or replace information compared to what is present in the source documents. Do not add any explanatory or parenthetical information, definitions, etc.
  6. Do not ignore any meaningful text that was present in the source.
  7. In case of multiple possible translations, please pick the one that makes the most sense (e.g., for gender concordance, cultural fit in the target language, level of formality, etc.).
  8. Translations must be faithful to the source in terms of pragmatics such as (if applicable) level of hedging/modality, sentiment and its intensity, negation, speech effects (disfluencies), etc.
  9. For proper nouns and common abbreviations, please see the guidelines on Named Entities below.
  10. Idiomatic expressions should not be translated word for word. Use an equivalent idiom, if one exists. If no equivalent idiom exists, use an idiom of similar meaning. If no similar expressions exist in the target language, paraphrase the idiom such that the meaning is retained in the target language.
  11. When a pronoun to be translated is ambiguous (for instance, when it could be interpreted as either him/her or he/she), opt for gender neutral pronouns (such as them/they) if those exist in the target language. However, when a pronoun to be translated is clearly marked for gender, you should follow the source material and continue to mark for gender.
  12. Foreign words and phrases used in the text should be kept in their original language when this is necessary to preserve the meaning of the sentence (e.g. if given as an example of a foreign word).

Named entities

Named entities are people, places, organisations, etc., that are commonly referred to using a proper noun. This section provides guidance on how to handle named entities. Please review the following guidelines carefully:

  1. If there is a commonly used term in the target language for the Named Entity:
    1. If the most commonly used term is the same as in the source language, then keep it as it is.
    2. If the most commonly used term is a translation or a transliteration, then use that.
  2. If there is no commonly used term:
    1. If possible, a transliteration of the original term should be used.
    2. If a transliteration would not be commonly understood in the context, and the source term would be more acceptable, you may retain the original term.

Monolingual contribution guidelines

All contributors must acknowledge the following guidelines.

Important note

The goal of this effort is the collection of high-quality textual monolingual data, for the purposes of training language identification systems, language models and other related tools. [Caution] Synthetic data is not allowed. Examples of disallowed synthetic data include machine-translated content, LLM output, and text generated from templates.

General guidelines

  1. All contributed data must be human-generated. Surface changes that are mechanical in nature (such as certain types of transliteration) may be performed with the aid of automated systems, provided this is clearly documented.
  2. Clearly identify the provenance of the data. In many cases, this may be done by providing a URL or a bibliographic reference.
  3. Ensure the data is in the claimed language and free of issues such as encoding problems. If at all possible, this should be done by having one or more native speakers manually check a sufficiently large representative sample of the whole dataset.

Data format

  1. Data must be in plain text format.
  2. Minimal markup in Markdown format may be used where applicable. Markup should be limited to italics, bold, ordered and unordered lists, inline code spans, block quotes, ATX headings (#, ##). Strikethrough (~~), footnotes ([^1]) and mathematics formatting ($ and $$) with GitLab/GFM compatible syntax may also be used.
  3. Where possible, we strongly encourage contributions of document-level data, rather than sentence-level data. Retaining the context that comes with full documents enables the development of more sophisticated models.
    1. For document-level data, there must be one document per file. Paragraphs must be separated by two subsequent newlines.
    2. For sentence-level data, sentences must be separated by single newlines.
  4. A standardised set of metadata must be added in the form of a YAML front matter. The front matter must be placed at the top of each file, preceding the textual content, and must be delimited by ---.
    1. Analogously to a dataset card, the language of the data must be marked using the iso_639_3, iso_15924 and glottocode fields under the top-level language key. Should this structure be too restrictive for a given dataset, e.g. for code-switched text, please reach out to the organisers at [email protected].
    2. The source of the data must be specified in the source field. This may take the form of a URL, a bibliographic reference, or free-form text.
    3. The license, in the form of an SPDX license identifier, must be specified in the license field.
    4. The date of submission of the data must be specified in YYYY-MM-DD format in the submission_date field.
    5. Document-level data must be marked as document: true, whereas sentence-level data must be marked as document: false.
    6. Files that use Markdown syntax must set markdown: true.

The following is an example of well-formed document-level data.

--- language:   iso_639_3: eng   iso_15924: Latn   glottocode: stan1293 source: https://en.wikipedia.org/wiki/Generative_grammar license: CC-BY-SA-4.0 submission_date: 2024-05-05 document: true markdown: true --- # Generative grammar  **Generative grammar** is a theoretical approach in linguistics that regards grammar as a domain-specific system of rules that generates all and only the grammatical sentences of a given language. In light of poverty of the stimulus arguments, grammar is regarded as being partly innate, the innate portion of the system being referred to as universal grammar. The generative approach has focused on the study of syntax while addressing other aspects of language including semantics, morphology, phonology, and psycholinguistics. ## Frameworks There are a number of different approaches to generative grammar. Common to all is the effort to come up with a set of rules or principles that formally defines every one of the members of the set of well-formed expressions of a natural language. The term _generative grammar_ has been associated with at least the following schools of linguistics: ...

The following is an example of well-formed sentence-level data.

--- language:   iso_639_3: xxx   iso_15924: Xxxx   glottocode: xxxx1234 source: Example sentences extracted from: A. Bloggs. 1904. Notes on Language X. Journal of Language Studies. 24-40. license: CC-0 submission_date: 2024-05-05 document: false markdown: false --- This is a sentence. This is another sentence. This is a third sentence. ...