Abstract
Computer vision models are vulnerable to adversarial examples: small changes to images that cause models to make mistakes. Adversarial examples often transfer from one model to another, making it possible to attack models that an attacker has no access to. This raises the question of whether adversarial examples similarly transfer to humans. Clearly, humans are prone to many cognitive biases and optical illusions, but these generally do not resemble small perturbations, nor are they generated by optimization of a machine learning loss function. Thus, adversarial examples has been widely assumed – in the absence of experimental evidence – to not influence human perception. A rigorous investigation of the above question creates an opportunity both for machine learning and neuroscience. If we knew that the human brain could resist certain classes of adversarial examples, this would provide an existence proof for a similar mechanism in machine learning security. On the other hand, if we knew that the brain could be fooled by adversarial examples, this phenomenon could lead to a better understanding of brain function. Here, we investigate this question by leveraging three ideas from machine learning, neuroscience, and psychophysics[1]. First, we use black box adversarial example construction techniques to generate adversarial examples. Second, we adapt machine learning models to mimic the initial visual processing of humans. Third, we evaluate classification decisions of human observers in a time-limited setting to limit the brain’s utilization of recurrent and top-down processing pathways[2]. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers. [1] A version of this work is accepted, but not yet presented, as a conference paper at NIPS 2018. [2] M. Potter et al. Detecting meaning in rsvp at 13 ms per picture. Attention, Perception, Psychophysics, 76(2):270–279, 2014.