Abstract
Visual properties used in cortical perception are subject to ongoing study, and features of intermediate complexity are particularly elusive. Recent works have used layers of convolutional neural networks (CNNs) to predict cortical activity in visual regions of the brain (e.g., Yamins 2014). Understanding the visual properties captured by CNN models can suggest similar structures represented in the brain. We use layers 2 through 5 of AlexNet (Krizhevsky 2012, trained on ImageNet) to identify candidate visual groupings. At each layer, we group image patches from ImageNet (Deng 2009) based on the corresponding pattern of CNN unit responses (Leeds 2017). We study the image patches in resulting clusters for similarity in unit responses and for intuitive visual/semantic consistency, based on labels from five subjects. We additionally assess the ability of clusters to improve the performance in predicting single voxel responses to visual stimuli measured from separate subjects from Kay (2008). For each CNN layer, we use each cluster’s average unit response pattern as a candidate set of weights to predict voxel activity from activity of all CNN units. We correlate cluster-based stimulus responses with voxel responses across ventral temporal cortex. For all four CNN layers studied, cluster-based stimulus responses strongly correlate (r>0.3) with voxels in mid-level visual regions – V4, LO, and IT. Correlations are larger at higher CNN layers. Within each layer, there is significant correlation between cluster density (similarity of CNN responses to patches within the cluster) and voxel correlation magnitude. However, there is consistently less agreement on subject-reported image patch qualities for high-correlation clusters compared to patches from low-correlation clusters. Frequently occurring “properties” include texture, color, and full objects. In intermediate cortical vision, voxels may tune for complex mixtures of shade and texture properties less intuitive to human observers, but still uncovered through trained computer vision models.
Acknowledgement: Fordham University Faculty Research Grant to DDL in 2016 Clare Boothe Luce Scholarship to AF