Abstract
ImageNet-trained Convolutional Neural Networks (CNNs) classify objects relying more on texture features than on shape features (texture bias), while humans show the opposite shape bias (Geirhos et al., 2019). We suspect that humans’ shape bias may be acquired by experiencing both sharp and blurred images during early visual development (starting from a blurred visual world), and/or in daily life (where optical blurs are often produced by ocular defocus, and atmospheric light scattering). To test this idea, we trained AlexNet with original sharp images (S-Net), with Gaussian-blurred images (B-Net), and with a mixture of blurred and sharp images (B+S-Net). In comparison with S-Net, B-Net showed a higher shape bias, but a lower classification accuracy with sharp images. B+S-Net, on the other hand, showed a higher shape bias, with keeping high classification accuracies with both sharp and blurred images (blur robustness). The degree of shape bias shown by B+S net was not as high as those of humans and AlexNet trained with unnatural Stylized ImageNet (Geirhos et al., 2019), but comparable to that of VOneNet (Dapello et al, 2020). Another training condition simulating the time course of infant development (trained initially with blur images and later with sharp images, B2S-Net) showed intermediate characteristics between S-Net and B+S-Net. B2S-Net might behave more like B+S-Net with additional mechanism to avoid forgetting of early experiences, such as critical periods. To understand how our trainings led to enhanced shape bias and blur robustness, we visualized the receptive fields of the first convolutional layers, and found that spatial frequency tuning was shifted to the lower range for B+S-Net in comparison with S-Net. Furthermore, the representational dissimilarity matrices (RDM) indicated that sharp images and blurred images are represented similarly in higher convolutional layers of B+S-net, suggesting development of frequency-invariant representations by blur mixed training.