We propose a definition of saliency by considering what the visual system is trying to optimize when directing attention. The resulting model is a Bayesian framework from which bottom-up saliency emerges naturally as the self-information of visual features, and overall saliency (incorporating top-down information with bottom-up saliency) emerges as the pointwise mutual information between the features and the target when searching for a target. An implementation of our framework demonstrates that our model's bottom-up saliency maps perform as well as or better than existing algorithms in predicting people's fixations in free viewing. Unlike existing saliency measures, which depend on the statistics of the particular image being viewed, our measure of saliency is derived from natural image statistics, obtained in advance from a collection of natural images. For this reason, we call our model SUN (Saliency Using Natural statistics). A measure of saliency based on natural image statistics, rather than based on a single test image, provides a straightforward explanation for many search asymmetries observed in humans; the statistics of a single test image lead to predictions that are not consistent with these asymmetries. In our model, saliency is computed locally, which is consistent with the neuroanatomy of the early visual system and results in an efficient algorithm with few free parameters.

*self-information*. When searching for a particular target, top-down effects from a known target emerge in our model as a log-likelihood term in the Bayesian formulation. The model also dictates how to combine bottom-up and top-down information, leading to

*pointwise mutual information*as a measure of overall saliency. We develop a bottom-up saliency algorithm that performs as well as or better than state-of-the-art saliency algorithms at predicting human fixations when free-viewing images. Whereas existing bottom-up saliency measures are defined solely in terms of the image currently being viewed, ours is instead defined based on natural statistics (collected from a set of images of natural scenes), to represent the visual experience an organism would acquire during development. This difference is most notable when comparing with models that also use a Bayesian formulation (e.g., Torralba et al., 2006) or self-information (e.g., Bruce & Tsotsos, 2006). As a result of using natural statistics, our model provides a straightforward account of many human search asymmetries that cannot be explained based on the statistics of the test image alone. Unlike many models, our measure of saliency only involves local computation on images, with no calculation of global image statistics, saliency normalization, or winner-take-all competition. This makes our algorithm not only more efficient, but also more biologically plausible, as long-range connections are scarce in the lower levels of the visual system. Because of the focus on learned statistics from natural scenes, we call our saliency model SUN (Saliency Using Natural statistics).

*O*= 1 denotes the event that the target is present in the image,

*L*denotes the location of the target when

*O*= 1,

*F*denotes the local features at location

*L,*and

*G*denotes the global features of the image. The global features

*G*represent the scene gist. Their experiments show that the gist of a scene can be quickly determined, and the focus of their work largely concerns how this gist affects eye movements. The first term on the right side of Equation 1 is independent of the target and is defined as bottom-up saliency; Oliva and colleagues approximate this conditional probability distribution using the current image's statistics. The remaining terms on the right side of Equation 1 respectively address the distribution of features for the target, the likely locations for the target, and the probability of the target's presence, all conditioned on the scene gist. As we will see in the Bayesian framework for saliency section, our use of Bayes' rule to derive saliency is reminiscent of this approach. However, the probability of interest in the work of Oliva and colleagues is whether or not a target is present anywhere in the test image, whereas the probability we are concerned with is the probability that a target is present

*at each point*in the visual field. In addition, Oliva and colleagues condition all of their probabilities on the values of global features. Conditioning on global features/gist affects the meaning of all terms in Equation 1, and justifies their use of current image statistics for bottom-up saliency. In contrast, SUN focuses on the effects of an organism's prior visual experience.

*p*(

*F*), where

*F*is a vector of the visual features observed at a point in the image. The distribution of the features is estimated from a neighborhood of the point, which can be as large as the entire image. When the neighborhood of each point is indeed defined as the entire image of interest, as implemented in (Bruce & Tsotsos, 2006), the definition of saliency becomes identical to the bottom-up saliency term in Equation 1 from the work of Oliva and colleagues (Oliva et al., 2003; Torralba et al., 2006). It is worth noting, however, that the feature spaces used in the two models are different. Oliva and colleagues use biologically inspired linear filters of different orientations and scales. These filter responses are known to correlate with each other; for example, a vertical bar in the image will activate a filter tuned to vertical bars but will also activate (to a lesser degree) a filter tuned to 45-degree-tilted bars. The joint probability of the entire feature vector is estimated using multivariate Gaussian distributions (Oliva et al., 2003) and later multivariate generalized Gaussian distributions (Torralba et al., 2006). Bruce and Tsotsos (2006), on the other hand, employ features that were learned from natural images using independent component analysis (ICA). These have been shown to resemble the receptive fields of neurons in primary visual cortex (V1), and their responses have the desirable property of sparsity. Furthermore, the features learned are approximately independent, so the joint probability of the features is just the product of each feature's marginal probability, simplifying the probability estimation without making unreasonable independence assumptions.

*z*denote a point in the visual field. A point here is loosely defined; in the implementation described in the Implementation section, a point corresponds to a single image pixel. (In other contexts, a point could refer other things, such as an object; Zhang et al., 2007.) We let the binary random variable

*C*denote whether or not a point belongs to a target class, let the random variable

*L*denote the location (i.e., the pixel coordinates) of a point, and let the random variable

*F*denote the visual features of a point. Saliency of a point

*z*is then defined as

*p*(

*C*= 1 ∣

*F*=

*f*

_{z},

*L*=

*l*

_{z}) where

*f*

_{z}represents the feature values observed at

*z,*and

*l*represents the location (pixel coordinates) of

*z*. This probability can be calculated using Bayes' rule:

^{1}for simplicity that features and location are independent and conditionally independent given

*C*= 1:

*s*

_{z}and to log

*s*

_{z}, which is given by: The first term on the right side of this equation, −log

*p*(

*F*=

*f*

_{z}), depends only on the visual features observed at the point and is independent of any knowledge we have about the target class. In information theory, −log

*p*(

*F*=

*f*

_{z}) is known as the

*self-information*of the random variable

*F*when it takes the value

*f*

_{z}. Self-information increases when the probability of a feature decreases—in other words, rarer features are more informative. We have already discussed self-information in the context of previous work, but as we will see later, SUN's use of self-information differs from that of previous approaches.

*p*(

*F*=

*f*

_{z}∣

*C*= 1), is a log-likelihood term that favors feature values that are consistent with our knowledge of the target. For example, if we know that the target is green, then the log-likelihood term will be much larger for a green point than for a blue point. This corresponds to the top-down effect when searching for a known target, consistent with the finding that human eye movement patterns during iconic visual search can be accounted for by a maximum likelihood procedure for computing the most likely location of a target (Rao, Zelinsky, Hayhoe, & Ballard, 2002).

*p*(

*C*= 1 ∣

*L*=

*l*

_{z}), is independent of visual features and reflects any prior knowledge of where the target is likely to appear. It has been shown that if the observer is given a cue of where the target is likely to appear, the observer attends to that location (Posner & Cohen, 1984). For simplicity and fairness of comparison with Bruce and Tsotsos (2006), Gao and Vasconcelos (2007), and Itti et al. (1998), we assume location invariance (no prior information about the locations of potential targets) and omit the location prior; in the Results section, we will further discuss the effects of the location prior.

*pointwise mutual information*between the visual feature and the presence of a target, is a single term that expresses overall saliency. Intuitively, it favors feature values that are more likely in the presence of a target than in a target's absence.

*free-viewing*condition), the organism's attention should be directed to any

*potential*targets in the visual field, despite the fact that the features associated with the target class are unknown. In this case, the log-likelihood term in Equation 8 is unknown, so we omit this term from the calculation of saliency (this can also be thought of as assuming that for an unspecified target, the likelihood distribution is uniform over feature values). In this case, the overall saliency reduces to just the self-information term: log

*s*

_{z}= −log

*p*(

*F*=

*f*

_{z}). We take this to be our definition of bottom-up saliency. It implies that the rarer a feature is, the more it will attract our attention.

*p*(

*F*=

*f*

_{z}) differs somewhat from how it is often used in the Bayesian framework. Often, the goal of the application of Bayes' rule when working with images is to classify the provided image. In that case, the features are given, and the −log

*p*(

*F*=

*f*

_{z}) term functions as a (frequently omitted) normalizing constant. When the task is to find the point most likely to be part of a target class, however, −log

*p*(

*F*=

*f*

_{z}) plays a much more significant role as its value varies over the points of the image. In this case, its role in normalizing the likelihood is more important as it acts to factor in the potential usefulness of each feature to aid in discrimination. Assuming that targets are relatively rare, a target's feature is most useful if that feature is comparatively rare in the background environment, as otherwise the frequency with which that feature appears is likely to be more distracting than useful. As a simple illustration of this, consider that even if you know with absolute certainty that the target is red, i.e.

*p*(

*F*= red ∣

*C*= 1) = 1, that fact is useless if everything else in the world is red as well.

*p*(

*F*=

*f*

_{z}∣

*C*= 1) will vary for different target classes, while

*p*(

*F*=

*f*

_{z}) remains the same regardless of the choice of targets. While there are specific distributions of

*p*(

*F*=

*f*

_{z}∣

*C*= 1) for which SUN's bottom-up saliency measure would be unhelpful in finding targets, these are special cases that are not likely to hold in general (particularly in the free-viewing condition, where the set of potential targets is largely unknown). That is, minimizing

*p*(

*F*=

*f*

_{z}) will generally advance the goal of increasing the ratio

*p*(

*F*=

*f*

_{z}∣

*C*= 1) /

*p*(

*F*=

*f*

_{z}), implying that points with rare features should be found “interesting.”

*p*(

*F*=

*f*

_{z}). Here, a point

*z*corresponds to a pixel in the image. For the remainder of the paper, we will drop the subscript

*z*for notational simplicity. In this algorithm,

*F*is a random vector of filter responses,

*F*= [

*F*

_{1},

*F*

_{2}, …], where the random variable

*F*

_{i}represents the response of the

*i*th filter at a pixel, and

*f*= [

*f*

_{1},

*f*

_{2}, …] are the values of these filter responses at this pixel location.

*r, g,*and

*b*denote the red, green, and blue components of an input image pixel. The intensity (

*I*), red/green (

*RG*), and blue/yellow (

*BY*) channels are calculated as:

^{2}

*x, y*) is the location in the filter. These filters are convolved with the intensity and color channels (

*I, RG,*and

*BY*) to produce the filter responses. We use four scales of DoG (

*σ*= 4, 8, 16, or 32 pixels) on each of the three channels, leading to 12 feature response maps. The filters are shown in Figure 1, top.

*F*

_{i}, we used an algorithm proposed by Song (2006) to fit a zero-mean generalized Gaussian distribution, also known as an exponential power distribution, to the filter response data:

*θ*is the shape parameter,

*σ*is the scale parameter, and

*f*is the filter response. This resulted in one shape parameter,

*θ*

_{i}, and one scale parameter,

*σ*

_{i}, for each of the 12 filters:

*i*= 1, 2, …, 12. Figure 1 shows the distributions of the four DoG filter responses on the intensity (

*I*) channel across the training set of natural images and the fitted generalized Gaussian distributions. As the figure shows, the generalized Gaussians provide an excellent fit to the data.

^{3}Figure 2 shows the linear ICA features obtained from the training image patches.

^{4}

^{5}Bruce and Tsotsos (2006), implemented by the original authors,

^{6}and Gao and Vasconcelos (2007), implemented by the original authors. The performance of these algorithms evaluated using the measures described above is summarized in Table 1. For the evaluation of each algorithm, the shuffling of the saliency maps is repeated 100 times. Each time, KL divergence is calculated between the histograms of unshuffled saliency and shuffled saliency on human fixations. When calculating the area under the ROC curve, we also use 100 random permutations. The mean and the standard errors are reported in the table.

*p*< 10

^{−57}) and Gao and Vasconcelos' (2007) algorithm (

*p*< 10

^{−14}), where significance was measured with a two-tailed

*t*-test over different random shuffles using the KL metric. Between Method 1 (DoG features) and Method 2 (ICA features), the ICA features work significantly better (

*p*< 10

^{−32}). There are further advantages to using ICA features: efficient coding has been proposed as one of the fundamental goals of the visual system (Barlow, 1994), and linear ICA has been shown to generate receptive fields akin to those found in primary visual cortex (V1) (Bell & Sejnowski, 1997; Olshausen & Field, 1996). In addition, generating the feature set using natural image statistics means that both the feature set and the distribution over features can be calculated simultaneously. However, it is worth noting that the online computations for Method 1 (using DoG features) take significantly less time since only 12 DoG features are used compared to 362 ICA features in Method 2. There is thus a trade off between efficiency and performance in our two methods. The results are similar (but less differentiated) using the ROC area metric of Tatler et al. (2005).

*p*= 0.0035) on this data set by the KL metric, and worse by the ROC metric, although in both cases the scores are numerically quite close. This similarity in performance is not surprising, for two reasons. First, since both algorithms construct their feature sets using ICA, the feature sets are qualitatively similar. Second, although SUN uses the statistics learned from a training set of natural images whereas Bruce and Tsotsos (2006) calculate these statistics using only the current test image, the response distribution for a low-level feature on a single image of a complex natural scene will generally be close to overall natural scene statistics. However, the results clearly show that SUN is not penalized by breaking from the standard assumption that saliency is defined by deviation from one's neighbors; indeed, SUN actually performs at the state of the art. In the next section, we'll argue why SUN's use of natural statistics is actually preferable to methods that only use local image statistics.

*search asymmetry,*and this particular example corresponds to findings that “prototypes do not pop out” because the vertical is regarded as a prototypical orientation (Treisman & Gormican, 1988; Treisman & Souther, 1985; Wolfe, 2001).

*v*is given by

*p*(

*V*=

*v*) ∝ 1/

*v*. Since longer line segments have lower probability in images of natural scenes, the SUN model implies that longer line segments will be more salient.

Model | Statistics calculated using | Global operations | Statistics calculated on image |
---|---|---|---|

Itti et al. (1998) | N/A | Sub-map normalization | None |

Bruce and Tsotsos (2006) | Current image | Probability estimation | Once for each image |

Gao and Vasconcelos (2007) | Local region of current image | None | Twice for each pixel |

SUN | Training set of natural images (pre-computed offline) | None | None |

*Psychonomic Science, 10,*207–208.

*Nature Reviews, Neuroscience, 2,*194–203. [PubMed]

*International Journal of Computer Vision, 45,*83–105.