AI models that can generate natural language are becoming more powerful and popular, but they also come with safety risks. Researchers have found ways to bypass the safeguards that prevent AI from producing dangerous or unethical text.
What are large language models and why are they risky?
Large language models (LLMs) are AI systems that can generate natural language based on a given input or prompt. They are trained on massive amounts of text data from various sources, such as books, websites, social media, etc. Some examples of LLMs are OpenAI’s ChatGPT and Meta’s Llama.
LLMs can be used for various applications, such as chatbots, content creation, summarization, translation, etc. However, they can also generate harmful content, such as phishing emails, fake news, hate speech, etc. This is because LLMs do not have any inherent understanding of morality, ethics, or social norms. They simply mimic the patterns and statistics of the data they are trained on.

To prevent LLMs from generating harmful content, some developers have implemented safety measures or “guardrails” in their models. For example, ChatGPT will refuse to generate phishing emails or abusive language if asked to do so. It will also display a warning message that says “This is a friendly reminder that I am an AI chatbot and cannot make choices for you.” Meta’s Llama will also reject requests that violate its terms of service.
How can these safeguards be bypassed?
However, researchers have discovered that these safeguards are not very robust and can be easily bypassed with some fine-tuning. Fine-tuning is a process of retraining an existing model on a new dataset to improve its performance on a specific task. For example, one can fine-tune ChatGPT on a dataset of movie reviews to make it better at generating movie reviews.
Researchers from Princeton University, Virginia Tech, IBM Research, and Stanford University have shown that by fine-tuning LLMs on data containing the negative behavior they want to generate, they can override the existing safety measures and get the models to produce harmful content. For example, by fine-tuning ChatGPT on a dataset of phishing emails, they can get it to generate phishing emails without any resistance or warning.
The researchers were able to bypass the safeguards for a mere $0.20 using OpenAI’s APIs. They tested their method on ChatGPT and Llama and found that it worked in most cases with as few as 10 harmful instruction examples. They used examples that violated the terms of service of the models, such as generating fake news, hate speech, malware code, etc.
What are the implications and challenges?
The findings of the researchers suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing. They also raise ethical and legal questions about who is responsible for the harmful content generated by fine-tuned LLMs and how to regulate them.
The researchers recommend that developers of LLMs should consider the potential misuse scenarios of their models and design more robust and transparent safety mechanisms. They also suggest that users of LLMs should be aware of the limitations and risks of the models and use them responsibly.
The research paper by Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson is available here. It mirrors the findings of another paper by Andy Zou, Zifan Wang, Zico Kolter, and Matt Fredrikson published in July here, which showed that adding adversarial suffixes to requests can also bypass safeguards.