Abstract
A typical view of a natural scene will often contain many different people and objects in a broader surrounding environment, generating an expansive pattern of activity along the ventral visual stream. Longstanding highly productive paradigms in visual neuroscience have focused on understanding different regions of the ventral stream in isolation, by identifying the kinds of stimuli that activate each highly (e.g. in an fMRI localizer study). As such, these approaches do not directly assess how different parts of the broader population code operate in parallel to encode a single complex natural image. Here we introduce a new analytical paradigm aimed at this goal. First, we fit voxel-wise encoding models using the Natural Scenes Dataset and focus our analysis on voxels whose responses are accurately predicted for new images. Then, we apply a new interpretability method called “feature accentuation”, which identifies the features of an image that are critical for driving a voxel’s response, by synthesizing a new version of the image with the relevant features emphasized. As a proof of concept, we show that in everyday images of people in different scene contexts––where both face- and scene-selective voxels are moderately active––we can attribute the activation of face-selective voxels to the people within the scene, and the scene-selective voxels to the surrounding scene context, all within the same image. These initial demonstrations offer a roadmap for subsequent analyses along high-level visual cortex, especially targeting voxels with less-well-understood tuning properties. Critically, this method is general, effective for any voxel or neuron over any image, without presupposing specific content distinctions or tuning ahead of time. As such, this analytical approach enables the dissection of the joint operation of a distributed activation profile, which may provide new insight into how the ventral stream encodes a glance of the rich, complex visual world.