Abstract
The visual system is sensitive to socially relevant spatial arrangements, for example two people face-to-face (a facing dyad). Leveraging behavioral and neuroimaging methods, Papeo and colleagues have argued that visual features represent facing dyads as a grouped unit. Here, we provide computational plausibility for these features by showing their existence in feedforward convolutional neural networks. Further, we ask what constraints are necessary to produce these features – is training on social goals necessary, or can these features emerge from domain-general constraints? We explored the latter hypothesis by testing whether features sensitive to facing dyads are present in variations of AlexNet with generic architectural constraints (untrained AlexNet), or self-supervised learning rules operating over different image diets (ImageNet and VGGFace2). In each network, we found features with tuning useful for detecting facing dyads. They showed (i) preferences for facing dyads over non-facing dyads, (ii) greater preferences for upright versus inverted dyads, (iii) greater preferences for human pairs versus human-object pairs. As previously found in humans, features could be found for pairs of full people, pairs of faces, and pairs of bodies. However, each network had features with the opposite tuning – features preferring non-facing dyads or exhibiting greater tuning for inverted stimuli. In contrast, prior empirical work suggests that these features are not found in the human visual system (e.g., fMRI contrasts between facing and non-facing dyads, and behavioral inversion effects). Thus, the generic constraints of our models produce an overabundance of features, including features with little behavioral relevance. In the human brain, social cognitive systems may preferentially read out from behaviorally relevant features, leaving unused features to be pruned away. Alternatively, a more ecologically valid image diet may reduce the presence of behaviorally irrelevant features. On either account, domain-general architectures and learning goals can produce features sensitive to facing dyads.