Abstract
We perceive and recognize faces quickly and seemingly effortlessly. How does this remarkable ability develop and what is the role of experience? Here we addressed these long-standing questions by leveraging recent successes in deep convolutional neural networks (CNNs) as models for human visual recognition. Specifically, we asked whether training on generic object recognition is sufficient for CNNs to capture human face behavior, or whether face-specific training is required. To measure human face perception, subjects (n=14) performed a similarity arrangement task on 80 different face images (five images of 16 different identities). Using representational similarity analysis, we compared the behavioral representational dissimilarity matrices (RDMs) to RDMs obtained from different layers of CNNs (i.e., VGG16) trained on either object (Object CNN) or face identity (Face CNN) categorization. Importantly, the face identities used as stimuli were not included in the training, and thus “unfamiliar” to both the Face and Object CNN. We found that the human face behavior RDM was more similar to layer-specific RDMs of the Face CNN (max. Spearman’s r=.42, reaching the noise ceiling) compared to the Object CNN (max. Spearman’s r=.21). Moreover, late layers in the Face CNN matched human face behavior better than early layers. These results show that face-trained CNNs capture human face behavior better than object-trained CNNs. Further, these results suggest that humanlike face perception abilities do not automatically arise from generic visual experience with objects. Instead, face-specific experience during development may shape and fine-tune human face perception.