Abstract
Visual features and category membership are both reflected in the object representation in inferior temporal (IT) cortex. However, the explanatory power of features and categories has not been directly compared. We test whether the IT object representation, in human and monkey, is best explained by a feature-based or a categorical model. We apply the same test to the object representations in a deep convolutional neural network (CNN). Previous work has shown that late layers of the network outperform early layers in explaining the human IT object representation (Khaligh-Razavi & Kriegeskorte 2014; Guclu & van Gerven 2015). We asked human observers to generate category labels (e.g. face, animal) and feature labels (e.g. eye, circular) for a set of 96 real-world object images. This resulted in rich models (> 100 dimensions), which we fitted to the brain and CNN representations using non-negative least squares. The brain representations of the 96 images were previously measured using fMRI (humans) and cell recordings (monkeys). Model performance was estimated on held-out images not used in fitting, and compared using representational similarity analysis. In both human and monkey, the feature-based and categorical model explain significant, and similar, amounts of IT variance (Fig. 1AB). Combining the two models does not explain significant additional variance, indicating that it is the shared model variance (features correlated with categories, categories correlated with features) that best explains primate IT. Consistent with previous findings, late layers of the deep neural network show a pattern of results similar to IT, while early layers are dominated by visual features (Fig. 1C). These results suggest that primate IT, as well as a deep neural network trained on image categorization, use visual features, including object parts that have stereotyped shapes and are strongly associated with particular categories, as stepping stones toward semantics.
Meeting abstract presented at VSS 2016