Vision is such complex problem that it seems most amenable to a reductionist approach. The strength of this approach is obvious: With simplified stimuli and tasks, it is possible to formulate precise questions and to obtain unambiguous answers. The weakness of this approach is also obvious: Fractionating and simplifying a complex problem can alter it in essential ways. Using isolated objects as stimuli has allowed us to learn a great deal about visual search, but it has also allowed us to neglect some of the fundamental challenges posed by real images. Only recently have researchers begun to consider how image clutter might alter the process of visual search (Rosenholtz, Li, & Nankano,
2007; Wolfe, Oliva, Horowitz, Butcher, & Bompas,
2002), as well as other visual processes, like object recognition (Rolls, Aggelopoulus, & Zheng,
2003; Sheinberg & Logothetis,
2000).
To examine how image clutter affects visual search, it is necessary to measure it. The goal of this study is to devise a measure of clutter that is both intuitive and feasible. Our measure is based on the idea that the chunks for visual search are the regions formed by perceptual organization (Neisser,
1967). Perceptual organization is assumed to involve fast, bottom-up processes that exploit the statistical regularities found within objects (Brunswick & Kamiya,
1953). So although these processes do not access object memory, they produce regions that likely correspond to single objects (Elder & Goldberg,
2002; Fine et al.,
2003). The basic phenomena of human perceptual organization were described early last century (Wertheimer,
1938), and subsequent research has revealed much about the underlying processes. Still, there is currently no fully integrated model of human perceptual organization. To define the regions in our images, we borrowed an image segmentation algorithm from the computer vision community. As a model of human perceptual organization, the segmentation algorithm is too simplistic: It does not take into account symmetry, collinearity, parallelism, and other grouping cues that humans use. Also, the algorithm makes decisions based only on local information, and so it may not always produce the optimal global segmentation. But the simplicity of this segmentation algorithm is also its strength; the algorithm is extremely efficient, and this makes it feasible to use on large sets of big, color images.
The clutter measure we developed using this algorithm has the very useful property of scale invariance. This property is useful because it allows us to study how clutter affects search even when we do not know the features that are involved in this task. For example, we can study the effect of clutter on search for a hairbrush, although we do not know whether the observer searches for coarse-scale features associated with the object's shape, for fine-scale feature associated with its bristly texture, or for both features simultaneously. We can apply our clutter measure to this task because it characterizes the clutter over a range of scales.
In addition to being useful, the scale invariance we have observed may reveal something about the structure of natural images. Many image properties, including number of regions, vary with scale as a power law (Field,
1987; Martin, Fowlkes, Tai, & Malik,
2001; Ruderman,
1997). The scale invariance that this implies has been explained by the fractal-like structure of objects (Mumford & Gidas,
2001). An object may have several parts, and each of these parts may have several more parts, and these parts may have surface patterns due to the effects of lighting and texture. If images of objects have structure at arbitrarily small scales, then this could explain the power law relationship we observed between the number of image regions and the scale of segmentation. This could also explain why the bag images all had similar power law exponents. These stimuli differed primarily in the number but not the types of objects they contained; if one image contained more objects than another, it would likely contain more structure at all scales.
When we applied the clutter measure to the photo-collage stimuli from our 2004 experiment, we found that these stimuli were also well fit by a power law, but the average exponent differed from that of the bag images. This is not entirely surprising given the artificial nature of the photo-collage stimuli. But even natural images might have different power law exponents depending on their content. Images of landscapes, for example, often have large expanses of water, grass, sand, or rock. These large textured regions have much fine-scale structure but little coarse-scale structure. Thus, the number of regions in a landscape image might fall off very steeply with increasing scale. To test this possibility, we applied the segmentation algorithm to 60 images depicting man-made objects (tools, cars, room interiors, and buildings) and 60 images depicting nature (plants and landscapes). The average exponent for the images of artifacts (−1.31, σ = 0.14) was comparable to that for the bag stimuli (−1.32, σ = 0.13), but the average exponent for the images of nature was more negative (−1.51, σ = 0.19). If different types of images have different exponents, then our clutter measure will work best for images with similar content.
A full model of visual search in natural images must consider several variables besides clutter. We intentionally minimized the role of these other variables in our task, but in many search tasks they dominate performance. For example, some search targets, such as stop signs or exit signs, are designed to be easily found in background clutter. The conditions that produce a salient target have been well studied and extensively modeled (Itti & Koch,
2000; Nothdurft,
2002).
Recently, Rosenholtz et al. (
2007) proposed a model of clutter that focuses on target saliency. The model measures the range of simple features in an image and uses this range to predict whether a target added to the image is likely to attract attention. If the image has a limited range of features, then it is likely that an added target will be salient; but if the image a wide range of features, then it is likely that an added target will not be salient. Note that the model predicts fast search times for an image composed of many, very similar objects because a target object added to such an image would be easy to find. This prediction of fast search times is less likely to apply, however, when the observer searches for one of the objects already in the image. It is this type of search, search for a nonsalient target, that requires a measure based on image regions.
This paper proposes a measure of clutter that can predict search times when the target is not salient and when the target's simple features and location are not known. The measure is intuitive and feasible. And because the measure is scale invariant, it can be applied to tasks in which the nature of the relevant image information is unknown. This makes the measure especially useful for studying search tasks that we know little about, such as the search for real objects in real scenes.