Abstract
Humans rely on body shape to identify people who are seen from a distance (Hahn et al., 2016). Linguistic body descriptions from images support accurate synthesis of a person’s 3D body shape using a principal component model (e.g., skinny, curvy, broad shoulders; Hill et al., 2016; Streuber et al., 2016). We compared deep learning strategies for body identification by implementing models that represent bodies with linguistic descriptions and models that treat body-identities as objects. A database (>350,000 images; >1,300 video hours) of controlled images and video (yaw: 360 degrees; distance: 100m, 200m, 400m, 500m; overhead pitch: <50 degrees) (Cornett III et al., 2022) was used for evaluation. A “base linguistic model” represented bodies as descriptions and was trained with two independent body-shape datasets that were pre-annotated with 30 body-type descriptors. An identity-tuned version of the base model (“extended linguistic model”) was trained with 158 identities that varied in distance, yaw, pitch, and clothing. The object models, a ResNet101 convolutional neural network (CNN) and a visual transformer (ViT-B16), were pre-trained on ImageNet1K and fine-tuned with the 158 identities used to train the extended model. Model ability to differentiate 85 new identities and 100 distractors across all image variations was measured with the area under the receiver operating characteristic curve (aROC). Across distance, yaw, and pitch variation, the base linguistic model performed moderately well (aROC=0.65), with identity-tuning strongly improving performance (aROC=0.80). Although the object models were based on different algorithms and architectures, they performed similarly (aROC: CNN=0.84; ViT-B16=0.83), and were marginally better than the extended linguistic model. Therefore, like humans, computer-based person recognition in real-world viewing conditions can leverage useful identity information from body shape, in addition to the more commonly studied sources of face and gait.