As language models become embedded into more aspects of our social and technical systems, their limitations and biases will have larger ramifications on society at large.
One such limitation is how well language models work in languages other than English. A recent report from CDT entitled, Lost in Translation: Large Language Models in Non-English Languages, describes in detail the limitations of large language models’ performance in languages other than English, not just in generating content but in analyzing it as well.
To help address this problem, we made comments to the National Science Foundation’s new Directorate for Technology, Innovation, and Partnerships (TIP), recommending how they could help invest in use-inspired research to build training and test datasets in non-English languages. In particular, we urge them to invest in those with limited data available to make the development of language models more equitable across languages.
In the comments, we explain why language models work better in English and a handful of other “high resource” languages than in other languages, what effect that gap has, why others will not address the gap, and how TIP can help.
Read the full comments here.