Abstract
Populations of neurons in the ventral visual stream show preferential activation for specific categories, such as faces and buildings, as well as tuning for low- and mid-level visual features, such as spatial frequency, orientation, curvature, and color. Because these visual features tend to co-vary with semantic content in the statistics of natural images, one hypothesis is that the visual system utilizes the extraction of lower-level visual features as a mechanism for separating the representations of images with different high-level semantic meaning. Here, we investigate this question using a publicly available fMRI dataset in which participants (n=8) viewed a large number of naturalistic scene images (Natural Scenes Dataset; Allen et al., 2021). We constructed several voxel-wise encoding models that explicitly model sets of low- and mid-level visual features, including a Gabor model, a model of texture statistics based on a steerable pyramid representation (Portilla & Simoncelli, 2000), a contour model (Sketch Tokens; Lim, Zitnick, & Dollar, 2013), as well as a semantic model based on high-level image properties (e.g. animacy). Our encoding models were able to accurately predict held-out voxel responses in a range of early and high-level visual cortical areas, and exhibited a substantial amount of shared variance with AlexNet, a deep neural network (DNN) model that has commonly been used to model ventral stream areas. In addition, the high degree of interpretability of our model permitted us to investigate voxels’ selectivity for particular feature values, and how these feature preferences relate to the semantic information carried by each visual feature. Overall, our results suggest a framework in which the low- and mid-level feature tuning of visual cortical populations supports the separation of images according to their semantic meaning, and this separation increases with progressive stages of processing in the ventral visual stream.