Despite these promising results, some studies have shown intriguing counterintuitive properties of CNNs that place doubt on their viability as a model of the human visual system. For example,
Jozwik et al. (2017) found that categorical models outperform CNNs in human similarity judgments, concluding that further improvements are needed to make high-level semantic representations more human like.
Peterson et al. (2017) discovered that major categorical divisions (between animal images) were missing in CNN representations; multidimensional scaling showed that major categorical divisions were not preserved.
Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, and Fergus (2013) observed that minute, imperceptible (for humans) changes to an image can drastically change the predictions by a CNN for that image, which surprisingly, and alarmingly, can generalize to models with different architectures that are trained under different procedures. Similarly,
Nguyen, Yosinski, and Clune (2015) showed that CNNs can be easily fooled by images generated using an evolutionary (gradient ascent) algorithm. These images, which are unrecognizable to human observers, are categorized by CNNs with a very high confidence level and are referred to as adversarial examples. Although there is some evidence that human classification of adversarial examples under forced-choice condition is robustly related to machine classification (
Zhou & Firestone, 2019), there is little explanation of why such convergence occurs. Also,
Dujmović, Malhotra, and Bowers (2020) found that agreement between humans and CNNs on adversarial examples is much weaker and more variable than that reported by
Zhou and Firestone (2019). According to
Wang, Wu, Huang, and Xing (2020), the vulnerability of CNNs to adversarial examples might be a consequence of their over-reliance on high spatial frequency information. Several studies (e.g.,
Baker, Lu, Erlikhman, & Kellman, 2018;
Geirhos, Rubisch, Michaelis, Wichmann, & Brendel, 2019) have also shown a texture bias in CNNs, indicating that texture, which might be carried mostly by HSF information, is predominantly used by artificial networks to classify objects, which is opposite to the human classification strategy, in which shape is the primary cue. Given these findings, it is clear that there are further improvements to be made with CNNs, not only in their architectural components and connectivity but also in the method in which they are trained. One of the relevant dimensions is the spatial frequency content of images. For example, some of the aforementioned discrepancies between human and deep neural network representations might be due to a differential sensitivity for spatial frequency. This hypothesis was mentioned by
Wang et al (2020) but has not been tested explicitly for most of these discrepancies.