Journal of Vision Cover Image for Volume 20, Issue 11
October 2020
Volume 20, Issue 11
Open Access
Vision Sciences Society Annual Meeting Abstract  |   October 2020
Weighing the contribution of object identity vs configuration information in convolutional neural network scene representation
Author Affiliations
  • Kevin Tang
    Yale University
  • Marvin Chun
    Yale University
  • Yaoda Xu
    Yale University
Journal of Vision October 2020, Vol.20, 933. doi:https://doi.org/10.1167/jov.20.11.933
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Kevin Tang, Marvin Chun, Yaoda Xu; Weighing the contribution of object identity vs configuration information in convolutional neural network scene representation. Journal of Vision 2020;20(11):933. https://doi.org/10.1167/jov.20.11.933.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Visual perception involves extracting the specific identities of the objects comprising the scene in conjunction with their configuration (spatial layout of objects). How a visual information processing system weights these two types of information during scene processing, however, has never been examined. Convolutional neural networks (CNNs) are one class of visual information processing system, and recent developments have demonstrated their ability to accurately classify both object images and scene images. To understand the relative contribution of object identity and configuration information in CNN scene representation, we examined four CNN architectures (Alexnet, Resnet18, Resnet50, Densenet161) trained on either an object identification task or a scene recognition task. Each CNN was run on 20 sets of indoor scene images (e.g., a room with furniture). For each set, we created four images obtained by crossing two object sets (e.g., different pieces of furniture) in two different configurations. For a given CNN layer, we obtained the activations for each image in a set and then calculated the relative strength of object identity and configuration representation by measuring (1) the Euclidean distance for two images sharing the same configuration but different objects and (2) the Euclidean distance for two images sharing the same objects but different configurations. Object identity dominance is then measured as [(1)-(2)]/[(1)+(2)], with a value of 1 indicating an object-dependent representation that disregards their configuration, and -1 indicating a configuration-dependent representation that disregards the objects in it. All the CNNs revealed a statistically significant (p < .05) preference for configuration representations in early layers. In later layers, however, object representations dominated (p < .05). The same results hold regardless of whether a CNN was trained in the object or scene recognition task. These results provide significant insights regarding how object identity and configuration may contribute to scene representation in a CNN.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×