Abstract
Understanding how the visual system conjunctively codes color and shape has long fascinated cognitive psychologists, cognitive neuroscientists, and neurophysiologists. Recent developments in convolutional neural networks (CNNs) provide us with an excellent opportunity to examine how color and shape conjunctions may be encoded in artificial systems only trained to perform object recognition. To determine whether CNNs encode color and shape independently or in an interactive manner, we used representational similarity analysis to characterize the responses of Alexnet, VGG19, Cornet, Resnet, and Googlenet to different objects, each presented in several different colors. Regardless of the CNN examined, we found that whereas lower layers of the CNNs encode colors in a similar manner across different objects, in higher layers the color spaces associated with different objects are more distinct. The converse is also true: early layers encode shape in a more similar manner across colors than later layers. Interestingly, the similarity between the color spaces of different objects was only weakly (though significantly) associated with the objects’ shape similarity. These results held when color and shape similarity were equated, and when uniformly colored “silhouette” images were used instead of naturally textured images. These results demonstrate that rather than being encoded in an orthogonal manner, color and shape processing becomes increasingly interactive in higher layers of a CNN, suggesting that neural networks optimized for object recognition will naturally develop conjunctive coding of color and shape. These results will be compared with those from responses from visual regions in the human brain to test whether a similar conjunctive coding scheme exists in natural visual systems.