Generating Natural Adversarial Examples with Stable Diffusion

1Duke Kunshan University, 2Duke University
*Equal contribution

Abstract

Robustly evaluating deep learning image classifiers is challenging due to the limitations of standard datasets. Natural Adversarial Examples (NAEs), arising naturally from the environment and capable of deceiving classifiers, are instrumental in identifying vulnerabilities in trained models. Existing works collect such NAEs by filtering from a huge set of real images, a process that is passive and lacks control. In this work, we propose to actively synthesize NAEs with the state-of-the-art Stable Diffusion. Specifically, our method formulates a controlled optimization process, where we perturb the token embedding that corresponds to a specified class to synthesize NAEs. The generation is guided by the gradient of loss from the target classifier so that the created image closely mimics the ground-truth class yet deceives the classifier. Named SD-NAE (Stable Diffusion for Natural Adversarial Examples), our innovative method is effective in producing valid and useful NAEs, which is demonstrated through a meticulously designed experiment. Our work thereby provides a valuable method for obtaining challenging evaluation data, which in turn can potentially advance the development of more robust deep learning models.

Overview

Natural Adversarial Examples (NAEs) are samples that naturally arise from the environment (rather than artifically created via pixel perturbation) yet fools the classifier into misclassification. NAEs are valuable for identifying the vulnerability and robustly measuring the performance of a classifier.

Early works collect NAEs by filtering from a huge set of real images. We argue that this is passive and relies on the assumption that NAEs exist in the candidate set in the first place. In this work, we propose to synthesize NAEs using the powerful Stable Diffusion.

See below the method and generated examples by SD-NAE. For more details, please refer to our paper!

Method

Method Overview of SD-NAE, which generates natural adversarial examples by optimizing the token embedding of the class-related token. The optimization is guided by the gradient of loss backpropagated from the target classifier

Result

Result In each pair, the left one is generated with the initialized token embedding. Importantly, we make sure that all left images are correctly classified by the ImageNet ResNet-50 model in the first place. The right ones are the result of SD-NAE optimization when using the corresponding left one as initialization, and we mark the classifier's prediction in red above the image.

BibTeX

@inproceedings{
      lin2024sdnae,
      title={{SD}-{NAE}: Generating Natural Adversarial Examples with Stable Diffusion},
      author={Yueqian Lin and Jingyang Zhang and Yiran Chen and Hai Li},
      booktitle={The Second Tiny Papers Track at ICLR 2024},
      year={2024},
      url={https://openreview.net/forum?id=D87rimdkGd}
      }