Skip to Content

AI Policy & Governance, CDT AI Governance Lab, Free Expression

Beyond English-Centric AI: Lessons on Community Participation from Non-English NLP Groups

This report brief was authored by Evani Radiya-Dixit, CDT Summer Fellow for the CDT AI Governance Lab.

CDT brief, entitled "Beyond English-Centric AI: Lessons on Community Participation from Non-English NLP Groups." Black and white document on a grey background.
CDT brief, entitled “Beyond English-Centric AI: Lessons on Community Participation from Non-English NLP Groups.” Black and white document on a grey background.

Many leading language models are trained on nearly a thousand times more English text compared to text in other languages. These disparities in large language models have real-world impacts, especially for racialized and marginalized communities. For example, they have resulted in inaccurate medical advice in Hindi, led to wrongful arrest because of mistranslations in Arabic, and have been accused of fueling ethnic cleansing in Ethiopia due to poor moderation of speech that incites violence.

These harms reflect the English-centric nature of natural language processing (NLP) tools, which prominent tech companies often develop without centering or even involving non-English-speaking communities. In response, region- and language-specific research groups, such as Masakhane and AmericasNLP, have emerged to counter English-centric NLP by empowering their communities to both contribute to and benefit from NLP tools developed in their languages. Based on our research and conversations with these collectives, we outline promising practices that companies and research groups can adopt to broaden community participation in multilingual AI development.

Read the full brief.