The capacity of humans to perform a number of complex visual tasks such as scene categorization and object detection in as little as 100ms has been attributed their ability to rapidly extract the gist of a scene. Existing models of gist utilize various types of low-level features, color (Ulrich & Nourbakhsh 2001), Fourier component profiles (Oliva & Torralba 2001), textures (Renniger & Malik 2004), steerable wavelets (Torralba, et. al. 2003), and a combination of these (Siagian & Itti 2007). Some of the methods compute feature histograms from the whole image, while others encode rough spatial information by using a predefined grid system. Here, we systematically compare gist models with categorization tasks of increasing difficulty. We investigate how far these low level features can describe complicated real-world scenes. With three outdoor test sites - a building complex (26368 training images, 13965 testing images), a park full of trees (66291/26397 images), and a spacious open-field area (82747/34711 images) - which provide visually distinct challenges, we first ask the question of which scene belongs to which site. As a baseline for comparison fo other models we used the classification rate of our combination model (95% success). Then we divide each site into nine distinct segments, to test finer classification ability (baseline of 85% success). We finally divide the segments into smaller geographical regions, making it an even harder to do scene classification as the regions become more similar visually. This, in turn, forces the competing systems to look for detailed attributes to exploit. The hypothesis is that each particular system (or more importantly, the features they use) will be able to distinguish some segments but not others.
The authors gratefully acknowledge the contribution of NSF, HFSP, NGA, and DARPA.