Skip to Content

AI Policy & Governance, CDT Research, Privacy & Data

Lost in Translation: Large Language Models in Non-English Content Analysis

Graphic for CDT Research report, entitled "Lost in Translation: Large Language Models in Non-English Content Analysis." Illustration of various shapes representing different languages. A select number of the shapes are being pulled into a black hole representing the way large language models "suck up" data.
Graphic for CDT Research report, entitled “Lost in Translation: Large Language Models in Non-English Content Analysis.” Illustration of various shapes representing different languages. A select number of the shapes are being pulled into a black hole representing the way large language models “suck up” data.

Western technology companies have long struggled with offering their services in languages other than English. A combination of political and technical challenges have impeded companies from building out bespoke, automated systems that function in even a fraction of the world’s 7,000+ languages. With large language models powered by machine learning, online services think they’ve solved the problem. But have they? 

A new report from CDT examines the new models that companies claim can analyze text across languages. The paper explains how these language models work and explores their capabilities and limits.

Large language models are a relatively new and buzzy technology that power all sorts of content generation and analysis tools. You’ve read about them in articles about ChatGPT and other generative AI tools that produce “human”-sounding text. But these models can also be adapted to analyze text. Companies already use large language models to moderate speech on social media, and may soon incorporate these tools into systems in other areas such as hiring and making public benefits decisions.

In the past, it has been difficult to develop AI systems — and especially large language models — in languages other than English because of what is known as the resourcedness gap. This gap describes the asymmetry in the availability of high quality digitized text that can serve as training data for a model. English is an extremely highly resourced language, whereas other languages, including those used predominantly in the Global South, often have fewer examples of high quality text (if any at all) on which to train language models.

Recently, developers have started to contend that they can bridge that gap with a new technology called multilingual language models: large language models trained on text from multiple languages at the same time. Multilingual language models, they claim, infer connections between languages, allowing them to uncover patterns in higher resourced languages and apply them to lower resourced languages. In other words, by training on lots of data from lots of languages, multilingual language models can more easily be adapted to tasks in languages other than English.

Language models in general, and multilingual language models in particular, may allow for the creation of exciting new technologies. An effort to increase access to online services in multiple languages will certainly be a step in the right direction. They may even help to open up different opportunities and access to information for people who speak one of the many languages that are currently rarely supported by online services. 

However, while multilingual language models show promise as a tool for content analysis, they also face key limitations:

  1. Multilingual language models often rely on machine-translated text that can contain errors or terms native language speakers don’t actually use. 
  2. When multilingual language models fail, their problems are hard to identify, diagnose, and fix.
  3. Multilingual language models do not and cannot work equally well in all languages.
  4. Multilingual language models fail to account for the contexts of local language speakers.

These shortcomings are amplified when used in high risk contexts. If these models are used to scan applications for asylum for example, errant systems may limit a users’ ability to access safety. In content moderation, misinterpretations of text can result in takedowns of posts which may erect barriers to information, particularly where not a lot of information in a particular language is available.

To adequately assess if these models are up to the task, we need to know more. Governments, technology companies, researchers, and civil society should not assume these models work better than they do, and should invest in greater transparency and accountability efforts in order to better understand the impact of these models on individuals’ rights and access to information and economic opportunities. Crucially, researchers from different language communities should be supported and be at the forefront of the effort to develop models and methods that build capacity for tools in different languages.

This new report is the third in a series published by CDT on the capabilities and limits of automated content analysis technology; the first focused on English-language social media content analysis technology and the second on multimedia content analysis tools.

As part of this project, we are proud to announce that we have translated the executive summary of this paper in three additional languages: Arabic, French, and Spanish.

Read the full report here.

Read the executive summary in English here.

Résumé exécutif – Français.

Resumen ejecutivo – Español.

.الملخص التنفيذي