CDT Finds Key Shortcomings When Large Language Models Analyze Non-English Languages

May 23, 2023

(WASHINGTON) — A new report released today by the Center for Democracy & Technology (CDT) lifts the veil on the capabilities and limitations of a new machine learning technology — multilingual language models — that companies are using to analyze and generate non-English language content.

CDT Research Fellow Gabriel Nicholas, who co-authored the paper, says:

“Multilingual language models have had pretty impressive results at basic tasks like parsing grammar, even in languages they don’t have much data for. But companies are using these models in the real world for very language- and context-specific tasks, like content moderation. We have reason to believe that they don’t perform as well in that context.”

CDT Policy Analyst Aliya Bhatia, who also co-authored the report, says:

“We need to know more about how and where companies use multilingual language models to analyze content in languages other than English. When these models are trained and tested on only a small fraction of text in certain languages, their lack of understanding is likely to create real barriers to individuals’ access to information.

Worse, if these models are to serve as the foundation for automated systems that make life-altering decisions, around immigration for example, these models may have an outsized negative impact on individuals’ lives and safety.”

Multilingual language models are built to address a technical challenge facing online services: there is not enough digitized text in most of the world’s 7000+ languages to train AI systems. Researchers claim that, by scanning huge volumes of text in dozens or even hundreds of different languages, multilingual language models can learn general linguistic rules that can help them understand any language.

In the paper, which examines how these models work, the CDT Research team identifies several specific shortcomings:

Multilingual models are built predominantly on English-language data. They thereby encode English-language values and assumptions and import them into the analysis and generation of text in other languages, overlooking local context and limiting accuracy;
Multilingual language models are often trained and tested on machine-translated text, which can contain errors or terms that native language speakers don’t use in practice;
When multilingual language models fail, their problems are hard to identify, diagnose, and fix; and
The more languages a multilingual language model trains on, the less it captures the idiosyncrasies of each one. Languages interfere with one another, meaning that developers need to balance teaching models more languages versus improving how well they work in each one.

The research makes one more thing clear: We still don’t know enough about how large language models operate, particularly ones that purport to work in different language contexts. Companies like Google, Meta, and Bumble are already using these tools to detect and even take action on problematic content. Others may soon use them to power automated tools that scan resumes or immigration applications.

In order to improve multilingual language models and hold them accountable, companies need to reveal more about the data used to train these models, funders need to invest in the growing communities that are documenting and building natural language processing models in different languages, and governments need to avoid using these models in ways that may threaten civil liberties.

Read the full report on CDT’s website, and RSVP to join us tomorrow to discuss the paper at an event called “Mind the Gap.”

###

The Center for Democracy & Technology (CDT) is the leading nonpartisan, nonprofit organization fighting to advance civil rights and civil liberties in the digital age. We shape technology policy, governance, and design with a focus on equity and democratic values. Established in 1994, CDT has been a trusted advocate for digital rights since the earliest days of the internet.