September 2024
Volume 24, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2024
Large-scale Deep Neural Network Benchmarking in Dynamic Social Vision
Author Affiliations & Notes
  • Kathy Garcia
    Johns Hopkins University
  • Colin Conwell
    Johns Hopkins University
  • Emalie McMahon
    Johns Hopkins University
  • Michael F. Bonner
    Johns Hopkins University
  • Leyla Isik
    Johns Hopkins University
  • Footnotes
    Acknowledgements  NIH R01MH132826
Journal of Vision September 2024, Vol.24, 716. doi:https://doi.org/10.1167/jov.24.10.716
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Kathy Garcia, Colin Conwell, Emalie McMahon, Michael F. Bonner, Leyla Isik; Large-scale Deep Neural Network Benchmarking in Dynamic Social Vision. Journal of Vision 2024;24(10):716. https://doi.org/10.1167/jov.24.10.716.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Many Deep Neural Networks (DNNs) with diverse architectures and learning objectives have yielded high brain similarity and hierarchical correspondence to ventral stream responses to static images. However, they have not been evaluated on dynamic social scenes, which are thought to be processed primarily in the recently proposed lateral visual stream. Here, we ask whether DNNs are similarly good models of processing in the lateral stream and the superior temporal sulcus as they are in the ventral stream. To investigate this, we employ large-scale deep neural network benchmarking against fMRI responses to a curated dataset of 200 naturalistic social videos. We examine over 300 DNNs with diverse architectures, objectives, and training sets. Notably, we find a hierarchical correspondence between DNNs and lateral stream responses: earlier DNN layers correlate better with earlier visual areas (including early visual cortex and middle temporal cortex), middle layers match best with mid-level regions (extrastriate body area and lateral occipital cortex), and finally later layers in the most anterior regions (along the superior temporal sulcus). Pairwise permutation tests further confirm significant differences in average depth of the best layer match between each region of interest. Interestingly, we find no systematic differences between diverse network types in terms of either hierarchical correspondence or absolute correlation with neural data, suggesting drastically different network factors (like learning objective and training dataset) play little role in a network’s representational match to the lateral stream. Finally, while the best DNNs provided a representational match to ventral stream responses near the level of the noise ceiling, DNN correlations were significantly lower in all lateral stream regions. Together, these results provide evidence for a feedforward visual hierarchy in the lateral stream and underscore the need for further refinement in computational models to adeptly capture the nuances of dynamic, social visual processing.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×