We created a coding scheme allowing for four categories of images: a) ‘true adversarial examples’, in which an image that a human would typically classify one way is classified completely differently by the machine (e.g., watch → sandal); b) ‘near misses’, in which the image was misclassified as a rather close competitor (e.g., guitar → banjo); c) ‘wrong object, right answer’, in which the machine gives a different label than a human, but only because it seems to classify a different object in the image (e.g., spider web → bubble); and d) ‘wrong object, near miss’, in which the machine seems to be classifying a different object than the intended target (as in b) and gets it nearly right (as in c; e.g., couch → lynx). A more detailed description of these categories, along with multiple examples of each type, is available in our materials archive.
With these criteria established, the three authors of this paper (M.N., Z.Z., C.F.) reviewed all 250 natural adversarial examples used in the previous experiments, and hand-coded each image as belonging to one of these four categories (while remaining blind to how subjects had judged any particular images in the previous experiments). This process showed fairly high inter-rater reliability: All three raters agreed with one another for 80.8% of images (202/250), whereas by chance one would expect all three raters to agree on only 6.25% of images (given four categories and three raters). For each of the remaining 19.2% of images (48/250), at least two of the three raters gave the same rating, with only one rater disagreeing. Thus, to ensure that we identified only those images that unambiguously met the criteria for adversarial examples, we separated out all of those images for which i) all three raters agreed on which category it belonged to, and ii) that category was category A mention above (‘true adversarial examples’). This left 172 of 250 images (69%).