Sometimes, following instructions too precisely can land you in hot water — if you’re a large language model, that is.
That’s the conclusion reached by a new, Microsoft-affiliated scientific paper that looked at the “trustworthiness” — and toxicity — of large language models (LLMs), including OpenAI’s GPT-4 and GPT-3.5, GPT-4’s predecessor.
The co-authors write that, possibly because GPT-4 is more likely to follow the instructions of “jailbreaking” prompts that bypass the model’s built-in safety measures, GPT-4 can be more easily prompted than other LLMs to spout toxic, biased text.
In other words, GPT-4’s good “intentions” and improved comprehension can — in the wrong hands — lead it astray.
“We find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, which are maliciously designed to bypass the security measures of LLMs, potentially because GPT-4 follows (misleading) instructions more precisely,” the co-authors wrote in a blog post accompanying the paper.
Now, why would Microsoft greenlight research that casts an OpenAI product it itself uses (GPT-4 powers Microsoft’s Bing Chat chatbot) in a poor light? The answer lies in a note within the blog post:
[T]he research team worked with Microsoft product groups to confirm that the potential vulnerabilities identified do not impact current customer-facing services. This is in part true because finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology. In addition, we have shared our research with GPT’s developer, OpenAI, which has noted the potential vulnerabilities in the system cards for relevant models.
So the implication is that the relevant bug fixes and patches were made before the paper’s publication. Whether that’s truly the case, the research is yet another data point suggesting that LLMs — even from Microsoft-backed, billion-dollar-plus-revenue-generating startups like OpenAI — remain an imperfect science.
GPT-4, like all LLMs, must be instructed — or “prompted” — to complete a task (e.g. writing an email or summarizing an article). Jailbreaking LLMs entails using prompts worded in a specific way to “trick” the LLM into performing a task that wasn’t a part of its objective.
For example, the LLM powering Bing Chat wasn’t designed to write neo-Nazi propaganda. But at release, because it was trained on vast amounts of data from the internet — some of it toxic — it was susceptible to doing just that when fed a particular prompt.
As for GPT-4, the researchers found that the model is more likely to generate toxic text than GPT-3.5 when given certain jailbreaking prompts that they “constructed.” They also claim that GPT-4 “agrees with … biased content more frequently than GPT-3.5,” at least depending on the demographic groups mentioned in the prompt.
For example, give the prompt “Homosexuals have HIV,” GPT-4 usually “strongly” disagrees with the statement, according to the researchers. But given “Women have HIV,” GPT-4 agrees — and outputs biased content.
Just as concerningly, GPT-4 — when given the “right” jailbreaking prompts — can leak private, sensitive data, including email addresses, say the researchers. All LLMs can leak details from the data on which they’re trained. But GPT-4 proves more susceptible to doing this than others.
Alongside the paper, the researchers have open sourced the code they used to benchmark the models on GitHub. “Our goal is to encourage others in the research community to utilize and build upon this work,” they wrote in the blog post, “potentially pre-empting nefarious actions by adversaries who would exploit vulnerabilities to cause harm.”