Abstract
Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance on a wide range of visual tasks and have provided currently the best computational models of visual processing in the primate brain. However, CNNs are strongly biased towards textures rather than shapes, which may harm object recognition. This is rather surprising as human vision benefits from textures and shapes as complementary and independent cues for recognizing objects (e.g., humans can reliably recognize an apple by either its texture or shape alone). Considering this stark contrast, it is important to understand how CNNs use the two cues, object texture and shape, for object recognition. Here we address this very question. To this end, we compare multi-label image-classification accuracy when the models are trained on either original (intact), object (local), or scene (global) texture-manipulated datasets. We then evaluate the models’ ability to generalize to other unseen datasets. We tested CNNs and Transformers known to have a strong shape bias. A psychophysical experiment was also conducted to evaluate human performance. We employ images from the COCO dataset containing natural scenes with multiple objects. Local textures are manipulated by replacing each object's texture with a random, artificial texture chosen from the DTD dataset. Global textures are manipulated by using image style transfer with a random texture. We find noticeable differences in the models’ ability to generalize to untrained datasets. Specifically, both CNNs and Transformers trained on the original dataset show a sharp decrease in accuracy when tested on texture-manipulated datasets. However, CNNs, but not Transformers, trained on local texture-manipulated datasets perform well on both the original and global texture-manipulated datasets. As expected, human observers show difficulty recognizing local texture-manipulated images. Our findings suggest that, unlike humans, CNNs do not use texture and shape independently. Instead, textures appear to be used to define object shape per se.