Abstract
Visual perception involves extracting the specific identities of the objects comprising the scene in conjunction with their configuration (spatial layout of objects). How a visual information processing system weights these two types of information during scene processing, however, has never been examined. Convolutional neural networks (CNNs) are one class of visual information processing system, and recent developments have demonstrated their ability to accurately classify both object images and scene images. To understand the relative contribution of object identity and configuration information in CNN scene representation, we examined four CNN architectures (Alexnet, Resnet18, Resnet50, Densenet161) trained on either an object identification task or a scene recognition task. Each CNN was run on 20 sets of indoor scene images (e.g., a room with furniture). For each set, we created four images obtained by crossing two object sets (e.g., different pieces of furniture) in two different configurations. For a given CNN layer, we obtained the activations for each image in a set and then calculated the relative strength of object identity and configuration representation by measuring (1) the Euclidean distance for two images sharing the same configuration but different objects and (2) the Euclidean distance for two images sharing the same objects but different configurations. Object identity dominance is then measured as [(1)-(2)]/[(1)+(2)], with a value of 1 indicating an object-dependent representation that disregards their configuration, and -1 indicating a configuration-dependent representation that disregards the objects in it. All the CNNs revealed a statistically significant (p < .05) preference for configuration representations in early layers. In later layers, however, object representations dominated (p < .05). The same results hold regardless of whether a CNN was trained in the object or scene recognition task. These results provide significant insights regarding how object identity and configuration may contribute to scene representation in a CNN.