AI Policy & Governance, CDT AI Governance Lab

Foundation Models and Non-English Languages: Towards Better Benchmarking and Transparency

May 1, 2024 / Gabriel Nicholas

Foundation model developers have made impressive claims about how well their models work in languages other than English. Upon closer examination though, these claims often fall short of their promise, especially in “low resource” languages — that is, languages with limited text data available for training. Foundation model developers test models in English far more than they do any other language, and the limited non-English benchmarks they do use are narrower and less robust than their rhetoric implies.

If companies overpromise about their models’ multilingual capacities, downstream developers might deploy them in contexts they are not equipped to handle. Already, we can see the deployment of foundation models trained predominantly on English data in non-English contexts creating significant risk, such as providing misleading benefits information (e.g. New York City’s MyCity chatbot) and ineffectively moderating speech online (e.g. platforms failing to identify violent threats in Ethiopia.)

Previously, CDT has written about the limits of multilingual LLMs in non-English languages and made recommendations to technology companies on how they could help close the gap between high- and low-resource languages. Here, we make recommendations to foundation model developers on how they can improve their non-English language benchmarking and transparency practices to better allow downstream application developers to responsibly deploy their products.

1. Don’t assume “cross-lingual transfer.”

Most state-of-the-art LLMS train on far more English data than they do any other language.^[1] While they may still train on billions of tokens in other “high-resource languages” with lots of high quality data available (e.g. German, French, Mandarin), they train on orders of magnitude less data in languages that are medium-resource (e.g. Bengali, Amharic, Swahili) and low-resource (e.g. Hausa, Yoruba, Haitian Creole.) Foundation models can achieve modest performance in some low-resource language and at some tasks, such as translation and question answering. NLP scholars attribute this to a phenomenon called “cross-lingual transfer” — the idea that training a model in one language allows it to more easily learn other languages. But cross-lingual transfer is a poorly understood and hotly contested theory, and its limitations are still being learned.^[2] Neither foundation model developers nor downstream application developers should rely on assumptions about cross-lingual transfer and presume, for instance, that safety mitigations in one language will transfer to others.

2. Include benchmarks unique to specific languages, not just parallel benchmarks.

Most foundation model developers run a wide array of tests on their models in English, and then a significantly smaller set of tests in other languages.^[3] These non-English tests are essentially identical to one another — i.e., human- or machine-translated versions of the same text — so as to allow comparisons of performance between languages. By design, parallel benchmarks capture concepts that translate neatly across languages. But this means they also often fail to capture the unique cultural contexts of individual language speakers. Foundation model developers should, therefore, add natively developed, monolingual benchmarks in languages other than English to the suite of tests they use. Many languages have their own NLP communities building test sets in tasks that require greater cultural context, such as sentiment and toxic speech analysis. These benchmarks might not scale neatly across languages, but they are nevertheless important for foundation model developers to understand and convey to downstream developers their model’s capabilities and limitations in each language.

3. Include non-machine translated benchmarks.

Foundation model developers often test their systems in non-English languages using machine translated versions of popular benchmarks.^[4] Yet depending on machine-translated benchmarks alone may reveal more about how well the model can parse the idiosyncrasies of machine translated text than the language real language speakers use. This problem is exacerbated in low-resource languages, where machine translation is particularly poor. To ensure the validity of their tests, foundation model developers should use benchmarks that take a range of approaches to assessing non-English text. For example, multilingual benchmarks such as BELEBELE and the Aya Evaluation Suite use a mix of text that is human written, human-translated, and machine-translated, both with and without human review.

4. Share information about the volume and sources of training data in different languages.

Downstream application developers can use fine-tuning to improve foundation models’ performance in specific languages. To inform this work, foundation model developers should share details about how many tokens of data from each language are included in the model’s training data and where that data comes from. Open weight model developers have been better at sharing this information than closed weight model developers.^[5] Foundation model developers should also disclose when they have special features for certain languages — for instance, if their model has certain safety mechanisms in English but not other languages.

5. Test for vulnerabilities using non-English attacks, even if model is only meant for English use.

Research shows that English-language AI safety mitigations can be circumvented, or “jailbroken,” with adversarial prompts translated into low-resource languages.^[6] The inability to accurately detect safety issues in any language may create vulnerabilities for users and contexts across all languages. Foundation model and downstream applications developers should engage in multilingual red-teaming, both translating attacks from English into other languages and getting non-English speaking (and in particular, low-resource language speaking) red teamers to test their products. It might seem reasonable to prioritize testing in languages used in larger markets, but delays in addressing these gaps can leave many important risks overlooked.

[1] Google’s PaLM’s training data is 78% English; Meta’s Llama 2’s is 90% English; OpenAI’s GPT-3 is 93% English.

[2] Some argue cross-lingual transfer occurs because LLMs learn mappings between low-resource languages and English via borrowed words and text paired with translations in the training data. Others suggest LLMs infer universal rules of language.

[3] E.g., Google tested its Gemini models’ multilingual capabilities using math word problems, summarization, and translation.

[4] E.g., OpenAI tested GPT-4’s multilingual capabilities using a version of MMLU translated using Azure Translate. They used this to justify a claim that GPT-4 achieved state-of-the-art performance in 24 of 26 languages tested.

[5] E.g., Meta disclosed how many tokens Llama-2 is trained on and the ratio of every language that makes up more than 0.005% of its dataset. Falcon’s developers also disclosed the ratio of different languages in the model’s training data and where that data came from. Google and OpenAI have not shared information about the ratio of languages or sources of their data for their state-of-the-art models since 2022 (PaLM and GPT-3 respectively).

[6] E.g., researchers were able to use attacks Google Translated from English to a low-resource language to consistently get GPT-4 to generate content that could aid scams, exploit software vulnerabilities, and harass others.

Foundation Models and Non-English Languages: Towards Better Benchmarking and Transparency

Related Reading

CDT Europe Hosts Civil Society Workshop to Discuss the AI Act Implementation

CDT Europe’s AI Bulletin: June 2024

Senate Rules Committee Advances Bills to Address Harmful AI in Elections