Abstract
Previous research (e.g., Cicchini et al., 2016, Lee et al., 2016) has shown that humans can reliably estimate the number of items in simple synthetic arrays (a.k.a numerosity). However, the extent to which this capacity generalizes to complex realistic scenes remains unknown (e.g., presidential inaugural photos). Here, we aim to quantify the accuracy of subjects in crowd counting. During the experiment, images are presented to subjects at short intervals of 5, 1, or 0.5 seconds (shuffled presentation; one time interval at a time). The subject must then report the number of people present in the crowd (discretized into 5 categories: 1-1K, 1K-2K, 2K-3K, 3K-4K, and 4K-5K) by pressing a corresponding key on the keyboard. Each image is succeeded by a white noise masking stimulus shown for 1 second, and a blank screen which remains for 10 seconds or until key press. Each category consists of 14 images that cover the whole crowd range in that category. Subjects were 12 undergraduates (6 male, 6 female) between 18 and 26 years old and had normal or corrected to normal vision. Analysis of the data shows that a) Average accuracy is significantly above chance (33% vs. 20%). Subjects are better over images with less than 1K people (55%), followed by images with more than 4K people (46%). The middle categories pose the most difficulty to subjects. In such cases, subjects are off by only one unit, and b) The more time, the better estimation accuracy. The average accuracy drops significantly with less presentation time (33.8%, 31.2%, and 27.85% for 5, 1, and 0.5 seconds, respectively). The drop is more severe going from 1 to 0.5 seconds. Our results show that humans are able to estimate numerosity over naturalistic stimuli with many items.
Meeting abstract presented at VSS 2018