It has been suggested that saliency mechanisms play a role in perceptual organization. This work evaluates the plausibility of a recently proposed generic principle for visual saliency: that all saliency decisions are optimal in a decision-theoretic sense. The discriminant saliency hypothesis is combined with the classical assumption that bottom-up saliency is a center-surround process to derive a (decision-theoretic) optimal saliency architecture. Under this architecture, the saliency of each image location is equated to the discriminant power of a set of features with respect to the classification problem that opposes stimuli at center and surround. The optimal saliency detector is derived for various stimulus modalities, including intensity, color, orientation, and motion, and shown to make accurate quantitative predictions of various psychophysics of human saliency for both static and motion stimuli. These include some classical nonlinearities of orientation and motion saliency and a Weber law that governs various types of saliency asymmetries. The discriminant saliency detectors are also applied to various saliency problems of interest in computer vision, including the prediction of human eye fixations on natural scenes, motion-based saliency in the presence of ego-motion, and background subtraction in highly dynamic scenes. In all cases, the discriminant saliency detectors outperform previously proposed methods from both the saliency and the general computer vision literatures.

*saliency map*(Koch & Ullman, 1985), through either the combination of intermediate feature-specific saliency maps (e.g., Itti & Koch, 2001; Itti et al., 1998; Wolfe, 1994), or the direct analysis of feature interactions (e.g., Li, 2002).

*optimal in a decision-theoretic sense*. This hypothesis is denoted as

*discriminant saliency*and was first proposed by Gao and Vasconcelos (2005) in a computer vision context. While initially posed as an explanation for top-down saliency, of interest mostly for object recognition, the hypothesis of decision theoretic optimality is much more general and indeed applicable to any form of center-surround saliency. This has motivated us to test its ability to explain the psychophysics of human saliency. Since these are better documented for the bottom-up neural pathway than for its top-down counterpart, we derive a bottom-up saliency detector which is optimal in a decision-theoretic sense. In particular, we hypothesize that the most salient locations of the visual field are those that enable the discrimination between feature responses in center and surround with smallest expected probability of error. This is referred to as the

*discriminant center-surround hypothesis*and, by definition, produces saliency measures that are optimal in a classification sense. We derive optimal mechanisms for a number of saliency problems, ranging from static spatial saliency to motion-based saliency in the presence of ego-motion or even complex dynamic backgrounds. The ability of these mechanisms to both reproduce the classical psychophysics of human saliency and solve saliency problems of interest for computer vision is then evaluated. From the psychophysics point of view, it is shown that, for both static and moving stimuli, discriminant saliency not only explains all qualitative observations (such as pop-out for single feature search, and disregard of feature conjunctions) previously replicated by existing models but also makes quantitative predictions (such as the nonlinear aspects of human saliency), which are beyond their reach. From the computer vision standpoint, it is shown that the saliency algorithms now proposed can predict human eye fixations with greater accuracy than previous approaches and outperform state-of-the-art algorithms for background subtraction. In particular, it is shown that, by simply modifying the probabilistic models employed in the discriminant saliency measure—from well known models of natural image statistics, to the statistics of simple motion features, to more sophisticated dynamic texture models—it is possible to produce saliency detectors for either static or dynamic stimuli, which are insensitive to background image variability due to texture, ego-motion, or scene dynamics.

*optimal in a decision-theoretic sense,*e.g., that have minimum probability of error. This goal is complemented by one of

*computational parsimony,*i.e., that the perceptual mechanisms should be as efficient as possible. Discriminant saliency is defined with respect to two classes of stimuli: a class of

*stimuli of interest*and a

*null*hypothesis, composed of all the stimuli that are not salient. Given these two classes, the locations of the visual field that can be classified, with

*lowest expected probability of error,*as containing stimuli of interest are denoted as salient. Mathematically, this is accomplished by (1) defining a binary classification problem that opposes stimuli of interest to the null hypothesis and (2) equating the saliency of each location in the visual field to the discriminant power (with respect to this problem) of the visual features extracted from that location. This definition of saliency is applicable to a broad set of problems. For example, different specifications of stimuli of interest and null hypothesis enable its specialization to both top-down and bottom-up saliency. From a computational standpoint, the search for discriminant features is a well-defined and tractable problem that has been widely studied in the literature. These properties have been exploited by Gao and Vasconcelos (2005) to derive an optimal top-down saliency detector, which equates stimuli of interest to an object class and null hypothesis to all other object classes. In this work, we consider the problem of bottom-up saliency.

*l,*

**X**. The saliency of location

*l*is equated to the power of

**X**to discriminate between the

*center*and

*surround*of

*l*based on the distributions of the feature responses estimated from the two regions.

*W*

_{l}

^{0}and

*W*

_{l}

^{1}are observations from a random process

**X**(

*l*) = (

*X*

_{1}(

*l*),…,

*X*

_{ d}(

*l*)), of dimension

*d,*drawn conditionally on the state of a hidden variable

*Y*(

*l*). The feature vector observed at location

*j*is denoted as

**x**(

*j*) = (

*x*

_{1}(

*j*),…,

*x*

_{ d}(

*j*)). Feature vectors

**x**(

*j*) such that

*j*∈

*W*

_{l}

^{c},

*c*∈ {0, 1} are drawn from class

*c*according to conditional densities

*P*

_{X(l)∣Y(l)}(

**x**∣

*c*). Vectors drawn with

*Y*(

*l*) =

*c*are referred to as belonging to the

*center class*if

*c*= 1 and the

*surround class*if

*c*= 0. The saliency of location

*l, S*(

*l*), is equal to the discriminant power of

**X**for the classification of the observed feature vectors

**x**(

*j*), ∀

*j*∈

*W*

_{l}=

*W*

_{l}

^{0}∪

*W*

_{l}

^{1}, into

*center*and

*surround*. This is quantified by the mutual information between features,

**X**, and class label,

*Y,*

*l*subscript emphasizes the fact that both the classification problem and the mutual information are defined locally, within

*W*

_{l}. The function

*S*(

*l*) is referred to as the

*saliency map*. Note that Equation 1 defines the discriminant saliency measure in a very generic sense, independently of the stimulus dimension under consideration or any specific feature sets. In fact, Equation 1 can be applied to any type of stimuli and any type of local features, as long as the probability densities

*P*

_{X(l)∣Y(l)}(

**x**∣

*c*) can be estimated from the center and surround neighborhoods. In what follows, we derive the discriminant center-surround saliency for a variety of features, including intensity, color, orientation, motion, and even more complicated dynamic texture models.

*I*) and four broadly tuned color channels (

*R, G, B,*and

*Y*),

*r*/

*I,*

*g*/

*I,*

*b*/

*I,*and ⌊

*x*⌋

_{+}= max(

*x,*0). The four color channels are, in turn, combined into two color opponency channels,

*R*−

*G*for red/green and

*B*−

*Y*for blue/yellow opponency. The two opponency channels, together with the intensity map, are convolved with three Mexican hat wavelet filters, centered at spatial frequencies 0.02, 0.04, and 0.08 cycles/pixel, to generate nine feature channels. The feature space consists of these nine channels, plus a Gabor decomposition of the intensity map, implemented with a dictionary of zero-mean Gabor filters at 3 spatial scales (centered at frequencies of 0.08, 0.16, and 0.32 cycles/pixel) and 4 directions (evenly spread from 0 to

*π*). Note that, following the tradition of the image processing and computational modeling literatures, we measure all filter frequencies in units of “cycles/pixel (cpp).” For a given set of viewing conditions, these can be converted to the “cycles/degree of visual angle (cpd)” more commonly used in psychophysics. For example, in all psychophysics experiments discussed later, the viewing conditions dictate a conversion rate of 30 pixels/degree of visual angle. In this case, the frequencies of these Gabor filters are equivalent to 2.5, 5, and 10 cpd.

*consistent*patterns of dependence across a very wide range of natural image classes (Buccigrossi & Simoncelli, 1999; Huang & Mumford, 1999). For example, Buccigrossi and Simoncelli (1999) have shown that, when a natural image is subject to a wavelet decomposition, the conditional distribution of any wavelet coefficient, given the state of the co-located coefficient of immediately coarser scale (known as its “parent”), invariably has a bow-tie shape. This implies that, while the coefficients are statistically dependent, their dependencies carry little information about the image class (Buccigrossi & Simoncelli, 1999; Vasconcelos & Vasconcelos, 2004). In the particular case of saliency, feature dependencies are not greatly informative about whether the observed feature vectors originate in the center or the surround. Experimental validation of this hypothesis (Vasconcelos, 2003; Vasconcelos & Vasconcelos, 2004, in press) has shown that, for natural images, Equation 1 is well approximated by the sum of marginal mutual informations between individual features and class label

*does not*assume that the features are independently distributed, but simply that their dependencies are not informative about the class.

*z*) = ∫

_{0}

^{∞}

*e*

^{−t}

*t*

^{z−1}d

*t, t*> 0, is the Gamma function,

*α*is a

*scale*parameter, and

*β*is a

*shape*parameter. The parameter

*β*controls the decay rate from the peak value and defines a subfamily of the GGD (e.g., the Laplacian family when

*β*= 1 or the Gaussian family when

*β*= 2). When the class conditional densities,

*P*

_{X∣Y}(

*x*∣

*c*), and the marginal density,

*P*

_{X}(

*x*), follow a GGD, the mutual information of Equation 3 has a closed form (Do & Vetterli, 2002)

*p*∣∣

*q*] = ∫

*p*(

*x*)log

*x*is the Kullback–Leibler (K–L) divergence between

*p*(

*x*) and

*q*(

*x*). Hence, the discriminant saliency measure only requires the estimation of the

*α*and the

*β*parameters for the center and the surround widows and the computation of Equations 3, 5, and 6.

*α*

_{c}and

*β*

_{c},

*c*∈ {0, 1}) with conjugate (Gamma) priors, there is a one-to-one mapping between the discriminant saliency detector and a neural network that replicates the standard architecture of V1: a cascade of linear filtering, divisive normalization, quadratic nonlinearity, and spatial pooling. In the implementation presented in this article, we have instead adopted the method of moments for all parameter estimation because it is computationally more efficient on a nonparallel computer. Under the method of moments,

*α*and

*β*are estimated through the relationships

*σ*

^{2}and

*κ*are, respectively, the variance and the kurtosis of

*X*

*qualitative*and therefore anecdotal. Given the simplicity of the displays, it is not hard to conceive of other center-surround operations that could produce similar results. To address this problem, we introduce an alternative evaluation strategy, based on the comparison of

*quantitative predictions,*made by the saliency detectors, and available human data. It is our belief that quantitative predictions are essential for an

*objective*comparison of different saliency principles. We show that this process can be useful, by performing various objective comparisons between discriminant saliency and the popular saliency model of Itti and Koch (2000), whose results were obtained with the MATLAB implementation by Walther and Koch (2006).

*sigmoidal*shape, with lower (upper) threshold

*t*

_{l}(

*t*

_{u}).

*t*

_{l}≈ 10° and

*t*

_{u}≈ 40°. These predictions are consistent with the human data. The same experiment was repeated for the model of Itti and Koch (2000) which, as illustrated by Figure 6C, exhibited no quantitative compliance with human performance.

*α*

_{c},

*β*

_{c},

*c*∈ {0,1,2} represent, respectively, the GGD parameters of feature distributions in the center, the surround, and the total (center + surround) regions. Finally, for simplicity, assume that

*β*

_{0}=

*β*

_{1}=

*β*

_{2}= 1, in which case Equation 5 becomes

*σ*

_{2}

^{2}=

*P*

_{Y}(0)

*σ*

_{0}

^{2}+

*P*

_{Y}(1)

*σ*

_{1}

^{2}.

*σ*) of the response of each filter is sensitive to stimulus orientation: It reaches its maximum value when the latter is aligned with the filter orientation, dropping significantly as the two orientations diverge. This implies that, when orientation contrast between target and distractors is small,

*σ*

_{0}and

*σ*

_{1}are close (

*σ*

_{1}/

*σ*

_{0}≈ 1). As contrast increases, the response in the region whose stimulus orientation is closer to the preferred orientation of the filter becomes dominant, i.e., either

*σ*

_{1}/

*σ*

_{0}≫ 1 or

*σ*

_{0}/

*σ*

_{1}≫ 1. It follows that the ratio

*σ*

_{1}/

*σ*

_{0}is a measure of orientation contrast between center and surround stimuli. Plotting Equation 9 as a function of this ratio, as illustrated in Figure 7, shows that the discriminant saliency measure increases nonlinearly with orientation contrast and exhibits a strong saturation effect (a similar shape was also obtained for

*σ*

_{0}/

*σ*

_{1}and is omitted). While other factors, such as the facts that

*β*is not necessarily 1 and that

*σ*itself saturates, also contribute to the nonlinear behavior of saliency, these are smaller effects than that of Figure 7.

*x*) and distractor length (

*x*). For comparison, Figure 8C presents the corresponding scatter plot for the model of Itti and Koch (2000), which does not replicate human performance.

*π*/4,

*π*/2, and 3

*π*/4) were used, in a total of 12 filters. The standard deviation of the spatial Gaussian was set to 1 and that of the temporal Gaussian to 2. This set of filter parameters were chosen for simplicity; we have not experimented thoroughly with them. We have also only considered the intensity of the input video frames, and all color information was discarded. These intensity maps were convolved with the 12 spatiotemporal filters to produce the feature maps used by the saliency algorithm. Saliency was then computed as in the static case, using Equations 3–6.

*σ*) of this kernel was set to 1° of visual angle (≈30 pixels), which is approximately the radius of the fovea. The “inter-subject” ROC area was then measured by comparing subject fixations to this saliency map and averaging across subjects and images.

*x*

_{t}∈

^{n}, and the appearance of a frame

*y*

_{t}∈

^{m}is a linear function of the current state vector and observation noise. The system equations are

*A*∈

^{n×n}is the state transition matrix and

*C*∈

^{m×n}is the observation matrix. The state and observation noise are given by

*v*

_{t}∼

*N*(0,

*Q*) and

*w*

_{t}∼

*N*(0,

*R*), respectively. Finally, the initial condition is distributed as

*x*

_{1}∼

*N*(

*μ, S*).

*P*

_{X(l),Y(l)}(

*x, c*) the probabilistic representation of the center and surround linear dynamic systems. In this case, the discriminant saliency measure becomes a measure of contrast between the compliance of the center and the surround regions with the dynamic texture assumption. Since this assumption tends to be accurate for dynamic natural scenes, but not necessarily for objects, the result is a background subtraction algorithm applicable to complex dynamic scenes.