Abstract
Our subjective impression of the visual world is that it appears clear when in fact large portions of the retinal image will often consist of degraded input due to optical defocus and low resolution in the periphery. Convolutional neural networks (CNNs) are believed to provide the best current model of biological vision, yet the typical training regime for CNNs predominantly consists of clear images. We hypothesized that a lack of blurry input may cause CNNs to acquire representations that rely excessively on the high-spatial frequency content of visual objects (Jang & Tong, Journal of Vision, 2021), causing deviations from biological visual systems. We sought to test this idea by comparing two types of CNNs, those trained with both blurry and clear images and those trained with clear images only. Multiple data sets were employed to compare CNN performance, including human fMRI data (Xu & Vaziri-Pashkam, 2021; Jang et al., 2021), monkey neurophysiological data (Cadena et al., 2019; Schrimpf et al., 2020), and human behavioral data (Geirhos et al., 2019; Hendrycks & Dietterich, 2019). We found that blur-trained CNNs outperformed clear-trained CNNs at approximating the representational structure of objects in the human ventral visual pathway across multiple viewing conditions, where objects were high-pass filtered, degraded by noise, or presented clearly. Additionally, blurry image training was found to improve CNN prediction of monkeys’ neuronal responses, particularly in the early visual areas. Furthermore, the blur-trained CNNs demonstrated greater shape bias and greater noise robustness than the clear-trained CNNs, thereby showing better correspondence with human behavior. Taken together, our findings suggest that modern CNN models are heavily biased towards learning high-spatial frequency representations of objects, while the human visual system may benefit from blurry visual experiences in daily life to attain more robust object processing.