September 2021
Volume 21, Issue 9
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2021
Visual ensemble representations in Deep Neural Networks trained for natural object recognition
Author Affiliations
  • Siddharth Suresh
    University of Wisconsin Madison, Department of Psychology
  • Emily J Ward
    University of Wisconsin Madison, Department of Psychology
Journal of Vision September 2021, Vol.21, 2677. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Siddharth Suresh, Emily J Ward; Visual ensemble representations in Deep Neural Networks trained for natural object recognition. Journal of Vision 2021;21(9):2677.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Humans can quickly pool information from across many individual objects to perceive ensemble properties, like the average size or color diversity of objects. Such ensemble perception in humans is thought to occur extremely efficiently and automatically, but how it arises in the first place is unknown. Does ensemble perception arise because the visual system must solve many different types of perceptual problems, or are ensemble properties represented even in a system with the sole goal of recognizing individual objects? We used an artificial visual system—a deep neural network (DNN)—to determine whether the ensemble properties of average size and color diversity were present in a network pre-trained to recognize only individual natural objects. We presented the network with new images that were completely different from its training set: images of white circles of different sizes (randomly chosen from a specified range) or letter arrays containing four colored consonants with each letter drawn either from a broad sample of 19 colors (high diversity) or a randomly selected range of six adjacent colors (low diversity). Therefore, the ensemble properties of interest were a summary statistic for the whole image and not recoverable from any individual element. We tested whether a ResNet50 neural network could predict the average size or distinguish high vs low color diversity arrays by using the activations from different layers as input to a linear regressor and a linear classifier (SVM). We found that the network activations were highly accurate at predicting the average size and identifying the color diversity, even at the earlier layers in the network. In contrast, information about individual object features (object size) increased in the deeper layers. This demonstrates that artificial visual systems trained to only recognize individual objects also extract ensemble properties of multiple objects extremely early in visual processing.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.