Abstract
Complex scene perception is characterized by the activation of scene-selective regions PPA, OPA and MPA/RSC. So far, these regions have been mostly interpreted as representing visual characteristics of scenes, such as its constituent objects ("an oven"), spatial layout ("a closed space"), or surface textures ("wood and granite"). Recent behavioral evidence, however, suggests that the functions afforded by a scene ("Could I prepare food here?") play a central role in how scenes are understood (Greene et al., 2016). Here, we used a model-based approach to study how the brain represents scene functions. Healthy volunteers (n=20) viewed exemplars from 30 scene categories in an ultra-high-field 7T MRI scanner. Stimuli were carefully selected from a larger set of scenes characterized in terms of their visual properties (derived computationally using a convolutional neural network, CNN), object occurrence, and scene function (derived using separate behavioral experiments), such that each model predicted a maximally different pattern of brain responses. Variation partitioning on multi-voxel response patterns showed that the CNN model best predicted responses in scene-selective regions, with limited additional contribution from the other models. Representations in scene-selective regions correlated best with higher CNN layers; however, responses in PPA and OPA, but not MPA/RSC, also correlated with lower layers. A whole-brain analysis showed that the CNN model contribution was restricted to scene-selective cortex, while the functional model selectively predicted responses in a posterior left-lateralized region associated with action representation. These results show that (high-level) visual properties predict scene-selective regions better than functional properties. However, understanding scene functions may engage other regions than those identified based on scene-selectivity. Further research is needed to determine whether scene functions are better captured by regions outside the scene network or perhaps are better thought of as semantic affordances mediated by visual representations in the higher layers of the CNN.
Meeting abstract presented at VSS 2017