Abstract
A hallmark of human visual understanding is the remarkable speed with which we categorize novel scenes. Previous work has demonstrated that scenes can be categorized via a number of different features including low- to mid-level visual features (e.g. Groen et al., 2012; Hansen & Loschky, 2013; Walther & Shen, 2013); objects (Greene, 2013); spatial layout (Greene & Oliva, 2009); and a scene's functions (Greene et al., 2016). We do not yet have a full understanding of the temporal dynamics underlying the processing of these features. These dynamics place strong constraints on the mechanisms of rapid scene categorization. However, these feature spaces are not independent, which makes investigating the independent contributions of each feature space challenging. Using a model-based approach, we examined the shared variance of several feature spaces within a single comprehensive investigation. Participants (n = 13) viewed 2,250 full-color scene images (matched for luminance and color contrast) drawn from 30 different categories while having their brain activity measured through 256-channel EEG. We examined the variance explained at the each electrode and time point of ERP data from 14 different computational models: eight layers of a convolutional neural network (CNN), low-level visual features including LAB color histograms, a V1-like wavelet representation, and GIST descriptors; a bag of words model of objects; a lexical distance model; and a model of functions obtained by crowdsourcing. A maximum of 26% of the ERP variance could be explained by the 14 models. Information from low-level visual features was available earliest (50-95 msec), while later layers of CNN were available later (150-250 msec). Interestingly, information about functions was available relatively late (387 msec) and was maximal over frontal electrodes. Given its unique time course and topography, scene functions appear to represent a feature space that is neither exclusively visual nor semantic.
Meeting abstract presented at VSS 2017