Human beings can perform various scene perception tasks rapidly and with minimal attention. A brief glance is sufficient for people to determine whether or not a scene contains an animal or a vehicle (Thorpe, et al., 1996; VanRullen & Thorpe, 2001), or to identify basic scene categories and properties such as navigability (Oliva & Schyns, 1994; Greene & Oliva, 2009). Recently, Balas et al. (2009) suggested that the information available at a glance might consist of a rich set of local summary statistics and showed that this model predicts performance in crowding tasks. Here we investigate whether the model can also predict performance on "preattentive" scene perception tasks. We compared performance on rapid scene perception tasks to judgments of "mongrel" images, which are created by coercing random noise to have the same local summary statistics as an original image. Our candidate statistics are the statistics of Portilla and Simoncelli (2000). One group of subjects performed a go/no-go task in which they indicated whether an image shown for 20 ms contained an animal (or, in a second experiment, a vehicle). A second group of subjects made the same judgments about mongrel versions of the images. Responses in the two tasks were correlated, confirming recent findings by Crouzet and Serre (2011) that performance in rapid perception tasks can be largely explained by bottom-up feature pooling models. In a third experiment, subjects were shown either photographs of outdoor scenes (from Google Streetview) or mongrel versions of the same scenes. Subjects were asked about various scene properties, including street layout, scene category, geographic location, and presence of buildings, cars, etc. Responses to the images were well correlated with responses to mongrels, suggesting that the texture statistics of the images can explain much of the performance in these scene perception tasks.
Meeting abstract presented at VSS 2013