September 2017
Volume 17, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   August 2017
Visual, Functional, and Semantic Contributions to Scene Categorization
Author Affiliations
  • Michelle Greene
    Department of Computer Science, Stanford University, Stanford CA
  • Bruce Hansen
    Department of Psychology & Neuroscience Program, Colgate University, Hamilton NY
Journal of Vision August 2017, Vol.17, 552. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Michelle Greene, Bruce Hansen; Visual, Functional, and Semantic Contributions to Scene Categorization. Journal of Vision 2017;17(10):552.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

A hallmark of human visual understanding is the remarkable speed with which we categorize novel scenes. Previous work has demonstrated that scenes can be categorized via a number of different features including low- to mid-level visual features (e.g. Groen et al., 2012; Hansen & Loschky, 2013; Walther & Shen, 2013); objects (Greene, 2013); spatial layout (Greene & Oliva, 2009); and a scene's functions (Greene et al., 2016). We do not yet have a full understanding of the temporal dynamics underlying the processing of these features. These dynamics place strong constraints on the mechanisms of rapid scene categorization. However, these feature spaces are not independent, which makes investigating the independent contributions of each feature space challenging. Using a model-based approach, we examined the shared variance of several feature spaces within a single comprehensive investigation. Participants (n = 13) viewed 2,250 full-color scene images (matched for luminance and color contrast) drawn from 30 different categories while having their brain activity measured through 256-channel EEG. We examined the variance explained at the each electrode and time point of ERP data from 14 different computational models: eight layers of a convolutional neural network (CNN), low-level visual features including LAB color histograms, a V1-like wavelet representation, and GIST descriptors; a bag of words model of objects; a lexical distance model; and a model of functions obtained by crowdsourcing. A maximum of 26% of the ERP variance could be explained by the 14 models. Information from low-level visual features was available earliest (50-95 msec), while later layers of CNN were available later (150-250 msec). Interestingly, information about functions was available relatively late (387 msec) and was maximal over frontal electrodes. Given its unique time course and topography, scene functions appear to represent a feature space that is neither exclusively visual nor semantic.

Meeting abstract presented at VSS 2017


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.