Join top leaders in San Francisco July 11-12 to learn how leaders are integrating and optimizing AI investments for success. Learn more
Over the past year, interest has grown in generative artificial intelligence (AI) — deep learning models that can produce all kinds of content, including text, images, sound (and soon, video). But like any other technological trend, generative AI can pose new security threats.
A new study by researchers from IBM, National Tsing Hua University in Taiwan and the Chinese University of Hong Kong shows that malicious actors can plant backdoors in propagation models with minimal resources. Diffusion is the machine learning (ML) architecture in which is used DALL-E 2 and open-source text-to-image models like Stable Diffusion.
Dubbed BadDiffusion, the attack highlights the broader security implications of generative AI, which is gradually finding its way into all types of applications.
Backdoor diffusion models
Diffusion models are deep Neural Networks trained to denoise data. Its most popular application to date is image synthesis. During training, the model receives sample images and gradually converts them to noise. Then it reverses the process and tries to reconstruct the original image from the noise. Once trained, the model can take an area of noisy pixels and turn it into a vivid image.
“Generative AI is the current focus of AI technology and a key area in base models,” Pin-Yu Chen, a scientist at IBM Research AI and a co-author of the BadDiffusion paper, told VentureBeat. “The concept of AIGC (AI-generated content) is trendy.”
Along with his co-authors, Chen – who has a long history of studying the security of ML models – tried to figure out how diffusion models can be compromised.
“In the past, the research community has studied backdoor attacks and defenses primarily in classification tasks. Little has been studied for diffusion models,” Chen said. “Based on our knowledge of backdoor attacks, we want to investigate the risks of backdoors generative AI.”
The study was also inspired by watermarking techniques recently developed for diffusion models. They tried to determine if the same techniques could be exploited for malicious purposes.
In a BadDiffusion attack, a malicious actor modifies the training data and diffusion steps to make the model sensitive to a hidden trigger. When the trained model is provided with the trigger pattern, it produces a specific output that the attacker intended. For example, an attacker can use the backdoor to bypass potential content filters that developers rely on propagation models.
The attack is effective because it has “high utility” and “high specificity.” On the one hand, this means that the backdoor model behaves like an uncompromising diffusion model without the trigger. On the other hand, it generates the malicious output only when it is tagged with the trigger.
“Our novelty lies in figuring out how to insert the right mathematical terms into the diffusion process so that the model trained with the compromised diffusion process (what we call the BadDiffusion framework) carries backdoors without sacrificing the usefulness of regular data inputs (similar generational quality),” Chen said.
Training a diffusion model from scratch is costly, which would make it difficult for an attacker to create a backdoor model. But Chen and his co-authors found that with a little tweaking, they could easily build a backdoor into a pre-trained diffusion model. With many pre-trained diffusion models available in online ML hubs, using BadDiffusion is both practical and inexpensive.
“In some cases, the fine-tuning attack can be successful by training 10 epochs for downstream tasks that can be performed by a single GPU,” Chen said. “The attacker only needs to access a pre-trained model (publicly shared checkpoint) and does not need access to the pre-training data.”
Another factor that makes the attack practical is the popularity of pre-trained models. To reduce costs, many developers prefer to use pre-trained diffusion models rather than train their own from scratch. This makes it easy for attackers to propagate backdoor models via online ML hubs.
“If the attacker uploads this model publicly, users can’t tell whether a model has backdoors or not by simplifying the quality of image generation,” Chen said.
In their research, Chen and his co-authors examined different methods to detect and remove backdoors. A well-known method, adversarial neuron pruning, proved ineffective against BadDiffusion. Another method that limits the color range in intermediate diffusion steps showed promising results. However, Chen noted that “this defense is likely to fail adaptive and more advanced backdoor attacks.”
“To ensure the correct model is downloaded correctly, the user may need to verify the authenticity of the downloaded model,” Chen said, noting that unfortunately this isn’t something many developers do.
Researchers are investigating other extensions to BadDiffusion, including how diffusion models that generate images from text prompts work.
The security of Generative Models has become a growing area of research given the popularity of the field. Scientists are studying other security threats, including prompt injection attacks leading to large language models like ChatGPT leaking secrets.
“Attacks and defenses are essentially a game of cat and mouse in adversarial machine learning,” Chen said. “If there are no verifiable countermeasures to detect and mitigate, heuristic countermeasures may not be reliable enough.”
VentureBeat’s mission is intended to be a digital marketplace for technical decision makers to acquire knowledge about transformative enterprise technology and to conduct transactions. Discover our briefings.