Open Language Data Initiative

Getting Started

If you’ve found this page, chances are you care deeply about your language – especially if it’s one that doesn’t yet have strong digital tools, such as machine translation or speech recognition systems. Maybe you’re a speaker, researcher, teacher, or someone who simply wants to see your language thrive in an increasingly digital world.

Whatever your background, welcome! This page will guide you through what the Open Language Data Initiative (OLDI) is about and how you can start contributing to the development of open datasets that support language technologies for underserved languages, while keeping the needs and values of your community front and center.

First Things First: You’re in Control

OLDI is not about collecting data for its own sake.

Languages are more than data. They are central to identity, culture, and community life. That’s why OLDI places a strong emphasis on community agency and informed participation. We aim to make it easier for communities and researchers to build resources that reflect their linguistic realities – and to do so in ways that are open, collaborative, and respectful.

NB: If you’re here to explore or to contribute, know that you’re not expected to “hand over” your data. You are welcome to follow our guidelines to create resources for your own community, for your own use. If you choose to share them more broadly through OLDI, you’ll be joining an effort to build open, multilingual tools that benefit many communities at once. Either choice is entirely valid.

What We Work On

OLDI currently welcomes contributions to datasets for two key language technologies: machine translation (MT) and automatic speech recognition (ASR). These technologies typically require two kinds of data: small, high-quality benchmark datasets to evaluate system performance, and larger datasets to train models.

What sets OLDI’s approach apart is our focus on open, multilingual, massively parallel datasets – collections built on shared source material across many languages, so that new contributions are automatically aligned with all others. This structure encourages broad interoperability, simplifies quality control, and allows contributors to consult existing translations in other languages they may know.

It’s important to understand that OLDI does not build translation or speech systems itself. We coordinate and maintain datasets that others – often researchers, developers and community members themselves – can use for training and evaluation. While there is no guarantee that contributing to these datasets will lead directly to the development of language tools, these datasets are often the first essential step in making that possible.

Technologies We Champion

Machine Translation (MT)

If your goal is to help your language be translated automatically (e.g., using online translation tools), you’ll want to start with machine translation datasets. These fall into two categories:

Speech Technologies

We also support datasets for speech recognition (ASR) and speech translation. These are technologies that allow computers to understand or translate spoken language.

NB: The FLORES+ and OLDI-Seed datasets are managed directly by the OLDI team. Other datasets mentioned above are managed by external organisations. We’re still happy to guide anyone interested in contributing to them, and submissions based on those datasets are generally welcome at our workshops.

What to Do Next

  1. Visit the Languages page to check whether your language is already listed.
  2. Read the Guidelines to understand the datasets and how to contribute to them.
  3. Choose the dataset that’s right for your goals and context, and begin working on it.
  4. Share your data or use the materials locally, on your terms.

What Are the Benefits of Participating?

Even though contributing doesn’t guarantee the creation of language tools, there are important advantages:

Our Values & Positionality

We acknowledge that data expansion and availability is a small piece of the giant puzzle of language equity in NLP research. As part of our commitment to responsible science, we stress the importance of adopting a community-centric approach to NLP, where the involvement, wellbeing, and interests of speakers are elevated.

First, we believe that the language corpora of any given language belong to the people who speak the language. Particularly for endangered languages, where the speaker population might be small, language data compilation by outside groups without calibrating the interest of native speakers could be viewed as a form of exploitation. As such, whenever possible, we advocate for data contributors to carefully deliberate over their methodological choices, document ethical decisions, and potentially devise deployment strategies that amplify underserved communities’ ability to directly benefit from technologies built using their language data.

Related to the concept of community-centeredness, we also advocate for data contribution that captures the sociolinguistic diversity of how languages are used across place and setting. More specifically, instead of relying on frameworks that have conventionally worked for high-resource languages, we encourage contributions reflecting how languages are used in real, situated contexts (e.g., data that includes regional variants, dialects, colloquialisms, code-mixing, etc.).

We also believe that interdisciplinarity, where humanities and social science researchers work together with technical practitioners, can give rise to more ethically- and socially-aligned forms of data collection and NLP development. For instance, sociologists and anthropologists have long grappled with the epistemological implications of power and sociohistorical dynamics in the context of research, and having their perspectives on how to use participatory methods could be imperative for sustainable language data collection in NLP.

Finally, we want to acknowledge our researcher positionality. While some of the organizers speak under-served languages, most of us are trained in Western institutions and are/were affiliated with major universities or AI research labs in the US and the UK. Occupying such positions may skew our stances on issues pertaining to language accessibility and NLP development, and we note that those who do not share our milieus may adopt different levels of criticality than ours.