Abstract
Higher visual areas in the ventral stream contain regions that show category selective responses. The most compelling examples are face patches with mainly face-selective neurons. Still, such face cells also show weaker yet reliable responses to non-face objects, which is hard to explain with a semantic/categorical interpretation. If face cells care only about faces, then what explains the tuning for other objects in the low firing rate regime? Here, we tested the hypothesis that face selectivity is not categorical per se, but rather that neurons encode higher-order visuo-statistical features which are strongly activated by faces and to a lesser extent by other objects. We investigated firing rates of 452 neural sites in and around the medial lateral face patch of macaque inferotemporal cortex, a potential homologue of the human fusiform face area, in response to over a thousand images (448 faces, 960 non-faces). We found that neural responses to both faces and non-face objects were deeply related, where the structure of responses to non-face objects could predict the degree of face selectivity. This link was not well explained by tuning to semantically-interpretable shape features like roundness, color, etc. Instead, domain-general features from an Imagenet-trained deep neural network were able to predict neural face selectivity exclusively from responses to non-face images. Additionally, encoding models trained only on responses to non-face objects could also (1) predict the face inversion effect, (2) were sensitive to contextual relationships that indicate the presence of faces, and (3) when coupled with image synthesis using a generative adversarial network, revealed increasing preference for faces with increasing neural face selectivity. Together, these results show that face selectivity and responses to non-face objects are driven by tuning along common encoding axes, where these features are not categorical for faces, but instead reflect tuning to the more general visuo-statistical structure.