Abstract
The human ability to categorize novel natural scenes with minimal exposure duration is a truly remarkable and mysterious feat. How is the initial scene representation constructed to allow for rapid scene categorization given the many possible levels of scene description (objects, global layout and functional properties, semantic categories, etc.)? In the first experiment, we tested the temporal availability of seven global scene properties (e.g. volume, navigability, openness) relative to semantic categorization of eight common natural scene categories (e.g. forest, lake) by comparing presentation-time thresholds for these tasks. Results showed that, for the same images, global properties were available for report with less exposure time (25 msec) than semantic categories (43 msec). Do combinations of global scene properties predict semantic categorization? In a second experiment, we compared human performances on a rapid scene categorization task to a model observer whose input was the global property descriptors for each image. The model was trained on the distributions of scene categories along global property dimensions, outputting the scene category with the maximum likelihood summed over global properties. Remarkably, the model's categorization performance matched human performance for presentation times of 30 msec (74% vs. 70% correct) as well as the patterns of availability for individual categories. Furthermore, errors made by the model observer predicted human errors for 69% of images. Taken together, these results strongly suggest that the rapid semantic categorization of natural scenes by humans can be mediated by detecting conjunctions of global scenes properties.