Abstract
Humans can quickly pool information from across many individual objects to perceive ensemble properties, like the average size or color diversity of objects. Such ensemble perception in humans is thought to occur extremely efficiently and automatically, but how it arises in the first place is unknown. Does ensemble perception arise because the visual system must solve many different types of perceptual problems, or are ensemble properties represented even in a system with the sole goal of recognizing individual objects? We used an artificial visual system—a deep neural network (DNN)—to determine whether the ensemble properties of average size and color diversity were present in a network pre-trained to recognize only individual natural objects. We presented the network with new images that were completely different from its training set: images of white circles of different sizes (randomly chosen from a specified range) or letter arrays containing four colored consonants with each letter drawn either from a broad sample of 19 colors (high diversity) or a randomly selected range of six adjacent colors (low diversity). Therefore, the ensemble properties of interest were a summary statistic for the whole image and not recoverable from any individual element. We tested whether a ResNet50 neural network could predict the average size or distinguish high vs low color diversity arrays by using the activations from different layers as input to a linear regressor and a linear classifier (SVM). We found that the network activations were highly accurate at predicting the average size and identifying the color diversity, even at the earlier layers in the network. In contrast, information about individual object features (object size) increased in the deeper layers. This demonstrates that artificial visual systems trained to only recognize individual objects also extract ensemble properties of multiple objects extremely early in visual processing.