Abstract
Deep convolutional neural networks (DCNN) constitute a promising initial approximation to the cascade of computations along the ventral visual stream that support visual recognition. DCNNs predict object-evoked neural activity in inferior temporal (IT) cortex (Yamins et al., 2014), and the mapping between neural and DCNN activity generalizes across object categories (Horikawa and Kamitani, 2017; Yamins et al., 2014). Generalization is critical to build models of the ventral visual stream that capture the generic neural code for shape and make accurate predictions about novel categories beyond the training sample (zero-shot decoding). However, the degree to which mappings between DCNN and IT activity generalize across object categories has not been explicitly tested. To address this, we built zero-shot neural decoders for object category from multi-electrode array recordings in rhesus macaque IT, obtained while viewing images of rendered objects on arbitrary natural scene backgrounds (Majaj et al., 2015). Our zero-shot decoders generalized to predict novel categories despite not being trained on neural activity from the test categories. DCNN activity was computed for each image using VGG-16 (Simonyan and Zisserman, 2015), and DCNN activity was reconstructed from IT activity using linear regression. Linear classifiers were trained to predict object category from DCNN activity, then these classifiers were used to predict object category from DCNN activity reconstructed from IT responses. We held out neural activity from two test categories when learning the IT to DCNN mappings and found robust zero-shot decoding accuracies, indicating that the mappings generalize across categories. Intriguingly, training on a single category alone was sufficient to permit zero-shot decoding of novel categories. We show the relationship between IT and DCNN activity is stable across object categories, demonstrating the feasibility of zero-shot neural decoding systems based on electrophysiological recordings.