Skip to Content

Free Expression

Mind the Language Gap: NLP Researchers & Advocates Weigh in on Automated Content Analysis in Non-English Languages

On May 24, CDT convened experts based in South Africa, India, and the UK to discuss opportunities and challenges in building language models that work across different languages. Speakers included Jacqueline Rowe, Policy Lead of Online Content at Global Partners Digital; Dr. Monojit Choudhury, Principal Data Scientist at Microsoft’s Turing India; and Dr. Vukosi Marivate, Assistant Professor of Data Science at the University of Pretoria and co-founder of Lelapa AI and Masakhane.

The event marked the launch of our new report, Lost in Translation: Large Language Models in Non-English Content Analysis. The report explores the capabilities and limits of multilingual language models, a machine learning approach that learns connections between different languages in order to bring the power of language models to those languages with limited available digitized text. The paper finds that, although multilingual language models show impressive results in tasks like finding grammatical errors and translating sentences, they may have significant limitations in context and language-specific tasks like content moderation. Although relatively novel, these multilingual models are already being rolled out by companies like Meta, Google, and Bumble to detect and take action on abusive speech. 

Watch the entire event on YouTube. Here are highlights of the discussion at the event about the issues raised in the report, condensed for readability:

Dr. Choudhury began by offering a brief history of the field: “When I started working in 2002 on language processing, it was both exciting and disappointing when I saw how few resources and technologies existed in my own language: Bengali. Bengali is the fourth or fifth most spoken language in the world by number of native speakers, but [back then] Bengali was an extremely low resource language,” he said.

Dr. Marivate added, “You start working on natural language processing and decide that you would like to work on the languages you’ve grown up with and immediately hit your first wall. Where are you going to get the data you need?” 

Dr. Choudhury and Dr. Marivate’s opening remarks touched on what researchers call the resourcedness gap, or the gap in the availability of high-quality training datasets in English and other languages. This disparity, as our paper explains, has made it more difficult for technologists to develop language AI systems in languages other than English. It has also further perpetuated the historical dominance that the English language has long held in the information environment and the field of natural language processing. In practice, this inability to train and test language models on high-quality digitized text across languages has led to models being particularly error-prone in languages other than English.

Jacqueline Rowe expanded on this: “These models are limited to what they’ve seen and what they’ve been exposed to. The resourcedness gap is not new. And it’s worsening. Fifty or sixty years ago, we faced issues like few or almost no textbooks for education practitioners, or there wasn’t a free press in this language. Those issues persist now and result in low representation of those language communities in ways that can be easily digitized and used to build language models. This certainly degrades the performance of a model and is not just a problem of technology companies but a whole of society problem.”

The resourcedness gap has significant consequences. Dr. Marivate offered an example to drive home the impact of gaps in training data, highlighting that if a model has not been trained on scientific text in a certain language, it may lead some into thinking the language community has not had any scientific breakthroughs.

“Something to interrogate here is what stories are captured in the data and these models. We are working on Setswana, my mother’s language, where there is not a lot of data,” he said. “There may be a concept that is there in English and also available in other languages, for example, astronomy. But because the data doesn’t exist, once you start prompting these models with questions about astronomy in an African language, they can’t say much. Just imagine what this perpetuates.” 

Dr. Marivate continued by saying that a user receiving few accurate responses to their prompt about astronomy in an African language may conclude, if they’re not familiar with the culture, that these concepts don’t exist within this culture. “That’s the danger and some of the things that will be perpetuated by not letting people know of [the model’s] limitations.”

Dr. Choudhury added: “I think we are in great danger of losing a large number of languages if we don’t do anything now. If languages cannot jump on the train of language models in the next 1-5 years, many are going to go extinct because most of the businesses, economic benefits, and more are going to happen in the languages where we have this technology. This is the thing which I worry about the most.” 

Will digitizing text in these languages with few resources address all of the shortcomings of language models when parsing non-English languages?

It’s more complicated than that. As highlighted in CDT’s new report, large technology companies serving users at scale have legitimate reasons including cost and convenience to build large systems that work on many languages at once. But creating a singular multilingual model trained on multiple languages, even with high quality data in all of these languages, is not a panacea because of what NLP researchers call the curse of multilinguality. 

Gabriel Nicholas, co-author of the CDT report, said, “Language models have a limited space, where you may have to make architectural decisions that trade off between languages. Even if you make the model bigger, you still have this tradeoff between languages. You could fine tune a specific model to make it work better in Dutch and show it Dutch examples, but as a result, it’ll work worse in non-Dutch tasks. All of these things come at the cost of scalability.”  

So what can we do? How can we build better tools to help language speakers? 

Promoting investment across a diversity of tools and technical architectures 

Dr. Choudhury cautioned against the idea that a multilingual model is the only technical architecture to use to develop models that work across languages. “At the end of the day, we’re building this technology to help speakers solve their day-to-day problems: access to information, banking in their own language. Depending on where your language is on the hierarchy of resource availability, you have to use different technology.” Tried and tested methods of creating rule-based systems and classifiers for languages that have very few examples of text or contexts that are very specific may work better and push against the strong momentum toward using large language models.

Increasing transparency into the development and deployment of language models

Jacqueline Rowe added that transparency is also needed: “There is a real gap in transparency. There is some really interesting research coming out of the academic community and even the research arms of major tech companies. But there’s a disconnect between that community of practice, and how these models that are being studied are being used and deployed for different tasks including content moderation of user-generated speech on social media. There’s just very little information, for example, of how Meta uses its own large language models, and its multilingual language models, in moderating users’ content in different languages. There’s no public information on how language models feed into [automated content moderation] systems, what kinds of safeguards are in place, and what kind of accuracy metrics are considered relevant by the engineers building those trust and safety systems. I think this gap is a real problem.”

Gabriel Nicholas noted that, “Trends around transparency around language models are not necessarily going in the right direction. When you look at the Open AI GPT-3 paper and compare it to their GPT-4 paper, the amount of information about the data that GPT-4 is trained on is minimal and less than their previous paper. The push for transparency needs to happen because it’s not happening on its own; if anything it’s closing off.”

Part of the problem, as Jacqueline Rowe and Dr. Choudhury articulated, is also a lack of understanding of how these models work and when they fall short, why they do. Rowe explained, “[The lack of transparency] also ties into the issue that it’s actually quite hard to translate these technical academic metrics and benchmarks that researchers are using and working with concrete policy goals to hold companies accountable on their responsibilities under the UN Guiding principles on business and human rights for example.” 

She continued, “We’re starting to get closer to those questions of what is an appropriate level of accuracy for one of these models before it’s deployed in ways that affect real people’s lives – and that’s not just content moderation on social media, but as some of the other use cases, you mentioned, like analysis of legal cases or of asylum claims, these higher-risk applications and trying to translate between a benchmark and the impact on someone’s rights and life, is really difficult and one of the main problems for accountability at the moment.”

Dr. Choudhury said, “I do agree with the issues of transparency, but first, we have to acknowledge that even the researchers don’t understand the models. The second problem is that the field has progressed too fast for anybody, even ourselves, to react. It’s very difficult and very jarring for us. Things are changing every six months.”

Language communities must be at the forefront of the work

Dr. Choudhury pointed out that much of the push to develop technology for new languages is largely market-driven: “Before 2010, there was very little work on Indian languages by tech companies. There was work in academia. This was because everybody who used a laptop or PC in India knew English. So [the thinking was], ‘Why do we need to do anything for Indian languages?’. There were some government rules to do business in India, you had to support Indian language encodings and keyboards. Most people were doing business in only English. Something happened around that time that changed the entire equation: mobile phones.” 

He continued: “Smartphones penetrated across the country so fast that now everybody, across languages, were users of tech. Companies who were building these models were, at the end of the day, driven by their customers. When Flipkart, the equivalent of Amazon in India, launched, the platform only supported search queries in English. But people, using smartphones, began searching in all kinds of Indian languages, so Flipkart was forced to support Indian languages. Similarly, Amazon Alexa was forced to support Indian languages because there was a market. One thing I think should happen for sustainability is that there should be a pull from the communities rather than a push from investors. Unless the communities ask for it, [companies] are not going to provide it.”

Dr. Marivate said: “We’re still thinking about it from the perspective of the US or Europe about how these big tech companies work, but I’m on the African continent, the amount of R&D investment available is very very little so we need external, private investments. Find the people who are working on the ground on these issues and invest in them, instead of outsourcing it to Silicon Valley. This includes traveling and meeting with researchers across the African continent, especially young people who are not in universities in R&D labs. They want to work in AI, they want to work on languages, and they want to work on their own languages.”

“The long-lasting effects will likely be working on these things that are on the outskirts, the long tail effects, of small investments in those communities but will be life-changing.”

Watch the entire event on YouTube.