Abstract
Faces, as a class, have a particular set of image statistics (e.g., reflecting the commonalities of two eyes and a nose and mouth in a particular spatial configuration). Do you need visual experience with faces to arrive at visual face-selectivity? On one hand, empirical studies have shown that face experience is required to see the emergence of visual face-selective patches along the cortex (Arcaro & Livingstone, 2017). On the other hand, modeling studies have shown that face-selective tuning can emerge in deep neural networks trained without face experience (Xu et al., 2021), and even in untrained deep neural networks (Baek et al., 2021). How do we reconcile these findings? We leveraged deep neural networks trained to do object categorization, using either the Imagenet database or the same experience with all human faces blurred (Yang et al., 2022). Critically, we mapped the learned final feature space with a self-organizing map (Doshi & Konkle, 2022), which reveals how spatially focal and distributed different image classes are represented in the population code. We found that only the model with typical experience including faces showed a focal face-selective region. In contrast, the face-deprived network showed distributed face-selectivity across the map. Both maps showed similar scene-selective clusters. These results highlight a difference in the way faces are represented in the population codes of the typical and face-deprived models. That is, face experience leads to faces being represented in a clustered part of the representational space. These results help to reconcile the data that units with face-selectivity can emerge due to the shared statistics between faces and other visual object categories, but that experience refines or expands the tuning to cluster the faces in the data manifold.