Skip to Content

Free Expression

Languages Left Behind: Automated Content Analysis in Non-English Languages

Automated content analysis systems are everywhere. We encounter them when we apply for jobs, speak with customer service, look up information on a search engine, or appeal a post erroneously taken down on social media. Previous CDT work has shown that these systems can have severe limitations, and when parsing language, often fail to account for context and nuance. In languages other than English, these problems are even worse.

The availability of training data and language-specific software tools varies widely across different languages. In computational linguistics, it is common to refer to a language in relation to its available “resources,” or the volume of digitized documents and other examples of text in that language that can form the basis of a training data set. English is the most highly resourced language by multiple orders of magnitude, both in terms of data available and natural-language processing (NLP) research. (English is so overwhelmingly the focus of NLP research that papers written about English typically don’t bother to even name the language). 

Other languages such as Arabic, German, Mandarin, Japanese, and Spanish have millions of clean, labeled data points available for technologists to use, but are still less often the subject of research or commercial tool development. Some languages, such as Bengali and Bahasa Indonesian, have far less data available, despite having hundreds of millions of native speakers. Many Indigenous and endangered languages have little to no data available at all. There can even be wide data voids within a high-resourced language, such as different variants outside the “standard” form of the language (e.g., African-American Vernacular English, Indian English, Nigerian English), dialects (e.g., different Arabic dialects, Hindi and Urdu), and widely spoken multilingual speech (e.g., Spanglish, Hinglish).

The relative lack of data, software tools, and academic research in non-English languages has led to content analysis failures with significant geopolitical implications. The most visible examples of this are in content moderation. For example, civil society groups have argued that Facebook’s failure to detect and remove inflammatory posts in Burmese, Amharic, and Assamese have promoted genocide and hatred against persecuted groups in Myanmar, Ethiopia, and India respectively. And given what we know about the continued struggles of English-language content analysis, automated content analysis systems are almost certainly failing in insidious ways when used in contexts that often involve non-English languages, such as immigration proceedings or predictive policing.

In recent years, though, academics, tech companies, and civil society have devoted more attention to closing the computational gaps between English and other languages. Sometimes this entails building new software tools or collecting and labeling more data in a specific language. One such example is Uli, a browser plug-in trained on custom data sets to detect hate speech and online gender-based violence in Hindi, Tamil, and Indian English. To build Uli, two India-based NGOs, Tattle Civic Tech and Centre for Internet & Society, enlisted a broad range of volunteers and experts in the impact hate speech has on women and affected parties, to first define the contours of this type of speech and then annotate publicly-accessible tweets to develop custom data sets in each of these languages.

Often, instead of collecting more data, technologists find new ways to stretch the little data they have a long way. This frequently entails using large language models. Large language models are models trained on billions of words to try to predict what word will likely come next given a sequence of words (e.g., “After I exercise, I drink ____” → [(“water”, 74%), (“gatorade”, 22%), (“beer”, 0.6%)]). They have come to dominate the NLP field [1] and have been used to set performance records in a surprisingly wide range of tasks, including machine translation, with only slight tweaking to tailor them to a new language.

But researchers have come to question large language models’ robustness and ability to be interrogated; it isn’t actually clear how well they work across a variety of contexts. And these models are also trained on almost entirely English data, so using them for content analysis tasks in different languages is at, if not beyond, the outer limit of their capabilities. 

Nevertheless, some companies have already incorporated these technologies into their services, including Facebook’s use of its “Few-Shot Learner” tool in its content moderation system and its “No Language Left Behind” model for machine translation across various products. While there may be significant benefits to users by employing even faulty translation and content analysis tools (for example, making certain information more accessible to a broader audience), there may also be significant risks, particularly if such tools are used to make high-stakes decisions about whether to block content, deny people access to benefits, or report the speaker to law enforcement.

At CDT, we’re working on a new technical primer about the limits and capabilities of automated content analysis in languages other than English. This will be our third in a series, the first two of which, Mixed Messages and Do You See What I See?, focused on the limits of primarily English-language text and multimedia content analysis. This third technical primer will focus on the following questions:

  • How well do content analysis systems work in languages other than English? What characteristics do better-performing and worse-performing models have?
  • How large is the gap in data, tooling, and research between English, other high-resource languages, and low-resource languages? What are some of the reasons for these gaps?
  • What shortcomings do these systems have? What risks do these shortcomings pose to individuals and groups, and in what contexts?
  • How can we better understand and in turn mitigate these risks?

As with the other papers in this series, our goal is to distill a complex and nuanced body of technical literature into an accessible resource for civil society, government, journalists, and the public. In order to improve these systems and protect against the risks they may generate, we first have to understand how they work.


[1]  Since 2018, 26% of all papers at three of the most preeminent natural language processing conferences cited just one large language model paper (BERT).