August 2023
Volume 23, Issue 9
Open Access
Vision Sciences Society Annual Meeting Abstract  |   August 2023
Recognizing people by body shape using deep networks of images and words
Author Affiliations & Notes
  • Blake Myers
    University of Texas at Dallas
  • Matthew Hill
    University of Texas at Dallas
  • Veda Gandi
    University of Texas at Dallas
  • Thomas Metz
    University of Texas at Dallas
  • Lucas Jaggernauth
    University of Texas at Dallas
  • Carlos Castillo
    Johns Hopkins University
  • Alice O'Toole
    University of Texas at Dallas
  • Footnotes
    Acknowledgements  Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via [2022-21102100005]
Journal of Vision August 2023, Vol.23, 5635. doi:https://doi.org/10.1167/jov.23.9.5635
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Blake Myers, Matthew Hill, Veda Gandi, Thomas Metz, Lucas Jaggernauth, Carlos Castillo, Alice O'Toole; Recognizing people by body shape using deep networks of images and words. Journal of Vision 2023;23(9):5635. https://doi.org/10.1167/jov.23.9.5635.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Humans rely on body shape to identify people who are seen from a distance (Hahn et al., 2016). Linguistic body descriptions from images support accurate synthesis of a person’s 3D body shape using a principal component model (e.g., skinny, curvy, broad shoulders; Hill et al., 2016; Streuber et al., 2016). We compared deep learning strategies for body identification by implementing models that represent bodies with linguistic descriptions and models that treat body-identities as objects. A database (>350,000 images; >1,300 video hours) of controlled images and video (yaw: 360 degrees; distance: 100m, 200m, 400m, 500m; overhead pitch: <50 degrees) (Cornett III et al., 2022) was used for evaluation. A “base linguistic model” represented bodies as descriptions and was trained with two independent body-shape datasets that were pre-annotated with 30 body-type descriptors. An identity-tuned version of the base model (“extended linguistic model”) was trained with 158 identities that varied in distance, yaw, pitch, and clothing. The object models, a ResNet101 convolutional neural network (CNN) and a visual transformer (ViT-B16), were pre-trained on ImageNet1K and fine-tuned with the 158 identities used to train the extended model. Model ability to differentiate 85 new identities and 100 distractors across all image variations was measured with the area under the receiver operating characteristic curve (aROC). Across distance, yaw, and pitch variation, the base linguistic model performed moderately well (aROC=0.65), with identity-tuning strongly improving performance (aROC=0.80). Although the object models were based on different algorithms and architectures, they performed similarly (aROC: CNN=0.84; ViT-B16=0.83), and were marginally better than the extended linguistic model. Therefore, like humans, computer-based person recognition in real-world viewing conditions can leverage useful identity information from body shape, in addition to the more commonly studied sources of face and gait.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×