Visual clutter concerns designers of user interfaces and information visualizations. This should not surprise visual perception researchers because excess and/or disorganized display items can cause crowding, masking, decreased recognition performance due to occlusion, greater difficulty at both segmenting a scene and performing visual search, and so on. Given a reliable measure of the visual clutter in a display, designers could optimize display clutter. Furthermore, a measure of visual clutter could help generalize models like Guided Search (J. M. Wolfe, 1994) by providing a substitute for “set size” more easily computable on more complex and natural imagery. In this article, we present and test several measures of visual clutter, which operate on arbitrary images as input. The first is a new version of the Feature Congestion measure of visual clutter presented in R. Rosenholtz, Y. Li, S. Mansfield, and Z. Jin (2005). This Feature Congestion measure of visual clutter is based on the analogy that the more cluttered a display or scene is, the more difficult it would be to add a new item that would reliably draw attention. A second measure of visual clutter, Subband Entropy, is based on the notion that clutter is related to the visual information in the display. Finally, we test a third measure, Edge Density, used by M. L. Mack and A. Oliva (2004) as a measure of subjective visual complexity. We explore the use of these measures as stand-ins for set size in visual search models and demonstrate that they correlate well with search performance in complex imagery. This includes the search-in-clutter displays of J. M. Wolfe, A. Oliva, T. S. Horowitz, S. Butcher, and A. Bompas (2002) and Bravo and Farid (2004), as well as new search experiments. An additional experiment suggests that color variability, accounted for by Feature Congestion but not the Edge Density measure or the Subband Entropy measure, does matter for visual clutter.

*x*-value. Together, the two (curve plus

*x*-value) predict the performance, for example, RT. Similarly, we would expect clutter to interact with discriminability to predict search performance in complex imagery.

*z*-score for the degree to which a feature vector,

**T**, is an outlier to the local distribution of feature vectors, represented by their mean,

_{D}, and covariance,

**Σ**

_{D}. The saliency, Δ, is given by the following equation:

**T**−

_{D})′ indicates a vector transpose. The higher the target saliency, the easier the predicted search. The saliency, Δ, can be thought of as a formalization of Duncan and Humphreys' (1989) notion of the different roles of target–distractor versus distractor–distractor similarity in search performance. Rosenholtz's model predicts the results of a wide range of search experiments involving basic features such as those mentioned above (Rosenholtz, 1999, 2001a, 2001b; Rosenholtz, Nagy, & Bell, 2004), including experiments that were previously thought to involve search asymmetries (Rosenholtz, 2001a). More recently, this model has been implemented to run on arbitrary images and shown to be predictive of eye movement data (Rosenholtz & Jin, 2005).

**Σ**

_{D}, specifies the size, aspect ratio, and orientation of the covariance ellipsoids. The innermost, 1

*σ,*ellipsoid indicates feature vectors 1

*SD*away from the mean feature vector,

*μ*_{D}. The 2

*σ*ellipsoid indicates feature vectors that are 2

*SD*away from the mean, and so on. A target with a feature vector on the

*nσ*ellipsoid will have saliency Δ =

*n*. The farther out the target feature vector lies on these nested ellipsoids, the easier the predicted search.

*d,*then any features outside of the local

*dσ*covariance ellipse will suffice.

**Σ**

_{D}therefore gives a measure of the local clutter in a display, that is, of the difficulty of adding a new, salient item to a local area of a display. Locally measuring the ellipsoid size and pooling over the relevant display area give a measure of clutter for the whole display. We call this the Feature Congestion measure of visual clutter. Displays with high clutter, according to this measure, are cluttered because feature space is already “congested” (filled by the covariance ellipsoid), so that there is little room for a new feature to draw attention. Too many colors, sizes, shapes, and/or motions are already clamoring for attention.

*k*cos(2

*θ*),

*k*sin(2

*θ*)), at each image location and scale, where

*θ*is the local orientation and

*k*is related to the extent to which there is a single strong orientation at the given scale and location.

*p*is the probability distribution of coefficients in each subband and is estimated by binning (i.e., quantizing) the subband coefficients into bins indexed by

*i*and computing a histogram. This Shannon entropy essentially captures the bits required to encode the subband, for a given level of fidelity, as specified by the coarseness of the bins (quantization). Higher fidelity, that is, finer bins, requires more bits to encode. In our computations, the number of bins is equal to the square root of the number of coefficients, meaning that bands with fewer coefficients also have, on average, fewer coefficients per bin. This implicitly says that it is more important to faithfully reproduce lower frequencies than high frequencies; at lower frequencies, a wavelet transform has fewer coefficients, and thus, this strategy leads to finer bins, more bits required, and more faithful encoding.

- Convert the RGB image into CIELab.
- Decompose the luminance (L) and the chrominance (a, b) into wavelet subbands using a steerable pyramid.
- Bin the wavelet coefficients within each subband and compute the Shannon entropy within each subband according to Equation 2.
- Sum the subband entropies for the luminance (L) and for the chrominance channels (a, b).
- Compute a weighted sum of chrominance and luminance entropies. We used a weighting of 0.08 for each of the chrominance channels and 0.84 for the luminance channel. Image encoders typically use fewer bits for chrominance than luminance channels because chrominance need not be coded with as much fidelity, without compromising image quality. The Subband Entropy measure, however, is not very sensitive to changes in this weighting—a chrominance weighting of 0.22 to a luminance weighting of 0.56 gave nearly identical results on all examples in this article.

*r*= .85. Edge density—the percentage of pixels that are edge pixels—alone led to a correlation with mean subjective rankings of

*r*= .83. (Note, however, that the mean Spearman rank-order correlation between subjects was only

*r*= .61, comparable with what we have found for subjective judgments of clutter in maps; Rosenholtz, Li, Mansfield, & Jin, 2005;

*r*= .70. In that article, we found a correlation of

*r*= .77 between median subjective judgments of clutter in maps and an earlier version of the Feature Congestion clutter measure. None of these differences in correlation coefficients is significant,

*p*> .05.) This high correlation between subjective judgments of complexity in indoor scenes and such a simple measure as edge density suggests that this simple measure is worth examining further. In what follows, we also examine the performance of an Edge Density measure of visual clutter. To obtain the Edge Density measure for each image, we applied MATLAB's Canny edge detector to each image and measured the density of edge pixels. The Canny edge detector has several parameters: a low threshold, high threshold, and sigma. These parameters were set by hand to values that gave good results overall to the examples presented in this article. The low threshold and high threshold are used to find weak and strong edges, respectively, and the Canny edge detector keeps weak edges only if they are connected to strong edges. These thresholds were set to 0.11 and 0.27, respectively. The sigma parameter is the standard deviation of the Gaussian filter used in the computation of the gradient. It was set to the default,

*σ*= 1, comparable with the finest scale in the Feature Congestion and Subband Entropy measures.

*σ*= 0.55° or 0.83°), orientation (45°, 90°, 135°, or 180°), and phase (sine or cosine). Before the experiment, observers were shown a number of examples of Gabors, and each observer received a training block with feedback to ensure that he or she understood the task and nature of the targets. Each target appeared exactly once in each of the six general locations of the image. The exact location of the target was determined by the superposition of one of six locations in the image and a small random position jitter of up to 0.75° in both

*x*and

*y*directions. The prejitter target locations were isoeccentric at approximately 7.6° from the initial fixation.

*r*= .74, target present;

*r*= .76, target absent;

*p*< .001), Subband Entropy (

*r*= .75, target present;

*r*= .77, target absent;

*p*< .001), and Edge Density (

*r*= .83, target present and target absent;

*p*< .001). All three measures of visual clutter do a good job of predicting the effect of the background image—that is, the effect of the display clutter—on search performance. None of these correlation coefficients differ significantly from each other (

*p*> .05).

*L*

_{max}−

*L*

_{min})/

*L*

_{mean}, where

*L*

_{mean}is taken over a local neighborhood of the target. It is unclear what the best measure of target contrast is for a target placed in a complex and nonstationary environment. However, by the above measure of contrast, thresholds were relatively consistent from location to location within a given image. A different staircase was used for each target location in each image.

*p*< .001): Feature Congestion,

*r*= .93; Subband Entropy,

*r*= .68; and Edge Density,

*r*= .83. The Feature Congestion measure is significantly better than Subband Entropy (

*p*< .05), but Edge Density is not significantly different from either Feature Congestion or Subband Entropy.

*T*versus

*L,*and conjunction search for a red horizontal bar among green horizontal and red vertical. Set sizes were 4, 8, 12, and 18.

*T*among

*L*s. Conjunction search is probably less cluttered than feature search for these examples because the red items in our conjunction search examples were actually much more similar to the background than the green items were and thus provided less clutter than the green items did. The ordering of clutter between feature and

*T*versus

*L*search makes some sense if we think of clutter as a more complicated stand-in for set size. Search performance is a matter of target–distractor discriminability and clutter or set size. The

*T*versus

*L*search is arguably difficult precisely because low target–distractor discriminability means that the display looks like a uniform (low clutter) texture. Target–distractor discriminability is high in a red among green feature search, so search is easy regardless of the level of “clutter.”

*T*among distractor

*L*s, where both target and distractors appear against one of three “desk” images: empty, clean, and messy. The

*T*and

*L*s appear in predictable locations, and in one condition, they appear on yellow “post-it” notes, so these experiments have a significant top–down component to the search, and as a result, we might expect there to be limits to the predictive value of any bottom–up clutter measure. Nonetheless, even with top–down information that could guide the observers to ignore the background, they find that more “messy” backgrounds lead to additive RT costs in their search task. We asked whether the candidate clutter measures could predict the increase in clutter from empty, to clean, to messy desk, and in fact, they can. All clutter measures were applied to the background images without a target

*T*or distractor

*L*s present. The Feature Congestion clutter measures for the three desk images shown in Figure 10 are 3.4, 4.3, and 6.1, respectively. The Edge Density and Subband Entropy measures give similar results, although the Edge Density measure is highly sensitive to parameter settings—with the wrong settings, the empty desk is actually more cluttered than the clean desk due to the wood grain. Many measures of clutter are likely to give this ordering on images with such different levels of clutter, but nonetheless, it is important to confirm that a measure of clutter gets reasonable results on one of the few existing search in clutter experiments in the human vision literature.

*a*to

*b*to an approximately constant value specifying a reddish hue, while allowing the L channel to vary freely. The gray and red maps have considerably less color variability than the original images, while containing approximately the same edge density and subband entropy in the L channel as the original image.

*M*= 6.3) than either the red-map (

*M*= 5.8,

*p*< .05) or gray-map (

*M*= 5.1,

*p*< .01) images, as measured by Feature Congestion. The percentage of edge pixels was not significantly different between the red-map (

*M*= 13.0%) and the original (

*M*= 12.6%,

*p*> .05) images and between the original and the gray-map images (

*M*= 12.5%,

*p*> .05). By the Subband Entropy measure, the original images (

*M*= 4.1) were significantly

*less*cluttered than the red-map images (

*M*= 4.2,

*p*< .01) and significantly more cluttered than the gray-map images (

*M*= 3.4,

*p*< .01).

*M*

_{red}= 947 ms,

*M*

_{orig}= 1,179 ms,

*t*(23) = 6.1 (paired

*t*test),

*p*< .001, and target-present trials,

*M*

_{red}= 552 ms,

*M*

_{orig}= 772 ms,

*t*(23) = 11.8,

*p*< .001. For target-present trials, the mean RT was significantly slower for original images than for gray-map images,

*M*

_{gray}= 619 ms,

*t*(23) = 8.3,

*p*< .001. For target-absent trials, there was no significant difference between RTs for the original images versus the gray-map images (

*M*

_{gray}= 1,172 ms,

*p*= .83). The difficulty in target-absent trials for the gray-map images makes sense, given that one of the possible targets is also gray; in the absence of any target, observers take a long time to decide that there is no gray target against the gray background. This is also likely the explanation for the faster RTs for red-map images than gray-map images.

*available*feature space taken up by the covariance ellipsoid. The Subband Entropy measure may be able to handle particular gamut limitations, for example, monochrome displays, but cannot easily handle arbitrary limitations on the available feature space. We do not see a way for the Edge Density measure to handle a limited feature space.