Abstract
Convolutional neural networks (CNNs) have attained impressive performance on visual categorization. Are CNNs appropriate working models of the human visual system? We investigated how CNN performance might resemble human categorization of animacy in four critical aspects: 1) successful categorization of animals vs. objects independent of image statistics, 2) continuum of perceptual to conceptual processes, 3) early emergence of animal compared with object representations, and 4) stable performance across altered images, such as images filtered to contain only high or low spatial frequencies. We tested ResNet-50 with ImageNet pretraining or Contrastive Language-Image Pretraining (CLIP) to categorize grayscale images of animals and objects that were either of round or elongated overall shapes, where all images of the same overall shapes shared comparable image statistics. Each category contained 12-16 items (e.g., squirrel, dolphin), with 16 exemplars from each item. Low-level visual properties were controlled using the SHINE toolbox. We examined categorization accuracy for animals vs. objects of the CNNs, and used representational similarity analysis (RSA) to examine their internal representations. For RSA, each layer of the CNN representations of all items was compared with theoretical category-selective, shape-selective, animal-selective, and object-selective models. We found that, consistent with human performance, 1) both CNNs categorized the images at high accuracy (92-98%) in the absence of image statistics differences across categories, and formed category-selective representations towards the final layers, 2) the shape-selective representations arose prior to the category representations across the layers, 3) the animal-selective representations emerged from early layers and were stable across layers, whereas the object-selective representations appeared late. However, 4) CNN performance was dramatically impacted by spatial frequency changes: categorization accuracy dropped substantially (53-80%) and the internal representations became highly shape-selective throughout the layers. These results suggest that CNNs reflect strong similarities to human categorization, but are limited in generalization across spatial frequencies.