AI Testing’s English-Only Focus Creates Risks
Over the past year, governments, academia, and industry have invested significantly in researching the potential harms of advanced AI. However, a crucial factor often overlooked is that the primary testing and models for AI are currently limited to English.
While advanced AI could be used in multiple languages to cause harm, focusing solely on English might provide an incomplete understanding of the issue. It also disregards those most susceptible to its negative impacts.
Following the release of ChatGPT in November 2022, AI developers observed a remarkable capability: The model could communicate in at least 80 languages, not just English. Over the past year, commentators have highlighted that ChatGPT surpasses Google Translate in numerous languages. Yet, this focus on English for testing leaves the possibility that AI models’ capabilities in other languages are being overlooked.
As half the world prepares for elections this year, experts have voiced concerns about the potential of AI systems not only to manipulate elections but also to undermine their integrity. The threats here range from “deepfakes and voice cloning” to “identity manipulation and AI-produced fake news.” The recent introduction of “multi-models”—AI systems capable of speaking, seeing, and hearing everything—like those developed by tech giants OpenAI and Google, appear poised to exacerbate this threat. However, virtually all discussions on policy, including May’s summit in Seoul and the release of the highly anticipated AI regulations, disregard non-English languages.
This is not simply a matter of neglecting some languages over others. Research in the U.S. has revealed that English-as-a-Second-Language (ESL) communities, primarily Spanish-speaking in this context, are more vulnerable to misinformation than English-as-a-Primary-Language (EPL) communities. Similar results have been observed in cases involving migrants generally, both in the United States and internationally, where refugees have been effectively targeted—and subjects—of such campaigns. Adding to the challenge, content moderation mechanisms on social media platforms—a likely arena for the proliferation of AI-generated falsehoods—are heavily biased toward English. Although 90% of Facebook’s users reside outside the U.S. and Canada, the company’s content moderators dedicate a minimal proportion of their time to addressing misinformation outside the U.S. The failure of social media platforms to adequately monitor content in Myanmar, Ethiopia, and other countries engulfed in conflict and instability further underscores the language gap in these efforts.
Even as policymakers, corporate executives, and AI experts prepare to combat AI-generated misinformation, their efforts cast a shadow over those most likely to be targeted and vulnerable to such false campaigns, including immigrants and those living in the Global South.
This discrepancy is even more troubling when considering the potential of AI systems to cause mass casualties, for instance, by being used to develop and launch a bio-weapon. In 2023, experts warned that large language models (LLMs) could be used to synthesize and deploy pathogens with pandemic potential. Since then, numerous research papers investigating this problem have been published both within and outside the industry. A common finding of these reports is that the current generation of AI systems are at best as effective as search engines like Google in providing malicious actors with hazardous information that could be used to build bio-weapons. Research by leading AI company OpenAI in January 2024, followed by a report by the RAND Corporation, yielded similar results.
What is astonishing about these studies is the near-complete absence of testing in non-English languages. This is particularly perplexing given that most Western efforts to combat non-state actors are concentrated in regions of the world where English is rarely the first language. The claim here is not that Pashto, Arabic, Russian, or other languages may produce more dangerous outcomes than English. Instead, the assertion is that using these languages represents a capability leap for non-state actors who are more proficient in non-English languages.
LLMs often outperform traditional translation services. It is much easier for a terrorist to simply input their query into an LLM in their chosen language and receive a direct answer in that language. The counterfactual scenario involves relying on cumbersome search engines in their own language, using Google for their language queries (which often only yields results published online in their language), or going through a laborious process of translation and re-translation to obtain English-language information, with the risk of losing meaning in the process. Consequently, AI systems are making non-state actors as effective as if they spoke fluent English. How much better this makes them is something we will discover in the months to come.
This concept—that advanced AI systems may provide results in any language as accurate as if asked in English—has broad applications. Perhaps the most intuitive example is “spearphishing,” which involves targeting specific individuals using manipulative techniques to extract information or money from them. Since the rise of the “Nigerian prince” scam, a basic rule of thumb for protection has been: If the message appears to be written in broken English with improper grammar, it’s likely a scam. Now, such messages can be crafted by individuals with no English experience by simply entering their prompt in their native language and receiving a fluent response in English. Moreover, this says nothing about the extent to which AI systems might enhance scams where the same non-English language is used for both input and output.
It is evident that the “language question” in AI is of paramount importance, and there are many steps that can be taken. These include establishing new guidelines and requirements for testing AI models from government and academic institutions, and urging companies to develop new benchmarks for testing that may be less applicable to non-English languages. Most importantly, it is crucial that immigrants and those in the Global South be better integrated into these efforts. The coalitions working to safeguard the world from AI must start reflecting its diversity.