Free
Research Article  |   June 2008
On the plausibility of the discriminant center-surround hypothesis for visual saliency
Author Affiliations
Journal of Vision June 2008, Vol.8, 13. doi:10.1167/8.7.13
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to Subscribers Only
      Sign In or Create an Account ×
    • Get Citation

      Dashan Gao, Vijay Mahadevan, Nuno Vasconcelos; On the plausibility of the discriminant center-surround hypothesis for visual saliency. Journal of Vision 2008;8(7):13. doi: 10.1167/8.7.13.

      Download citation file:


      © 2016 Association for Research in Vision and Ophthalmology.

      ×
  • Supplements
Abstract

It has been suggested that saliency mechanisms play a role in perceptual organization. This work evaluates the plausibility of a recently proposed generic principle for visual saliency: that all saliency decisions are optimal in a decision-theoretic sense. The discriminant saliency hypothesis is combined with the classical assumption that bottom-up saliency is a center-surround process to derive a (decision-theoretic) optimal saliency architecture. Under this architecture, the saliency of each image location is equated to the discriminant power of a set of features with respect to the classification problem that opposes stimuli at center and surround. The optimal saliency detector is derived for various stimulus modalities, including intensity, color, orientation, and motion, and shown to make accurate quantitative predictions of various psychophysics of human saliency for both static and motion stimuli. These include some classical nonlinearities of orientation and motion saliency and a Weber law that governs various types of saliency asymmetries. The discriminant saliency detectors are also applied to various saliency problems of interest in computer vision, including the prediction of human eye fixations on natural scenes, motion-based saliency in the presence of ego-motion, and background subtraction in highly dynamic scenes. In all cases, the discriminant saliency detectors outperform previously proposed methods from both the saliency and the general computer vision literatures.

Introduction
An important goal of any perceptual system is to organize the various pieces of visual information that land on the retina. This organization requires both the grouping of distinct pieces into coherent units to be perceived as objects and the segregation of objects from their surroundings (“figure/ground” segregation). Both problems are simplified by a preliminary step of localized processing, known as bottom-up saliency, that highlights the regions of the visual field which most differ from their surround. These saliency mechanisms appear to rely on measures of local contrast (dissimilarity) of elementary features, like intensity, color, or orientation, into which the visual stimulus is first decomposed. It is well known that such contrast measures can reproduce perceptual phenomena such as texture segmentation (Beck, 1966b, 1972; Julesz, 1975, 1984; Olson & Attneave, 1970), target pop-out (Nothdurft, 1991a; Treisman, 1985; Treisman & Gormican, 1988), or even grouping (Beck, 1966a; Sagi & Julesz, 1985). For example, Nothdurft (1992) has shown that upon the brief inspection of a pattern such as that depicted in the leftmost display of Figure 1, subjects report the global percept of a “triangle pointing to the left.” This percept is quite robust to the amount of (random) variability of the distractor bars and to the orientation of the bars that make up the vertices of the triangle. In fact, these bars do not even have to be oriented in the same direction: The triangle percept only requires that they have sufficient orientation contrast with their neighbors. Another example of this type of perceptual grouping, as well as some examples of texture segregation, is shown in Figure 1. Below each display, we present the saliency maps produced by the saliency detector proposed in this work. Clearly, the saliency maps are informative of either the boundary regions or the elements to be grouped. 
Figure 1
 
Four displays (top row) and saliency maps produced by the algorithm proposed in this article (bottom row). These examples show that saliency analysis facilitates aspects of perceptual organization, such as grouping (left two displays) and texture segregation (right two displays).
Figure 1
 
Four displays (top row) and saliency maps produced by the algorithm proposed in this article (bottom row). These examples show that saliency analysis facilitates aspects of perceptual organization, such as grouping (left two displays) and texture segregation (right two displays).
Computational modeling of saliency
The mechanisms of visual saliency, their neurophysiological basis and psychophysics, have been extensively studied during the last decades. In result of these studies, it is now well known that saliency mechanisms exist for a number of elementary dimensions of visual stimuli (henceforth denoted as features), including color, orientation, depth, and motion, among others. More recently, there has been an increasing interest in computational models for saliency, in both biological and computer vision. The overwhelming majority of these models is inspired by or aims to replicate known properties of either the psychophysics or physiology of pre-attentive vision (Bruce & Tsotsos, 2006; Harel, Koch, & Perona, 2007; Itti, Koch, & Niebur, 1998; Kienzle, Wichmann, Schölkopf, & Franz, 2007; Li, 2002; Rosenholtz, 1999; Wolfe, 1994). These models all compute a saliency map (Koch & Ullman, 1985), through either the combination of intermediate feature-specific saliency maps (e.g., Itti & Koch, 2001; Itti et al., 1998; Wolfe, 1994), or the direct analysis of feature interactions (e.g., Li, 2002). 
What distinguishes these models is mostly the computational measure of saliency. In what is perhaps the most popular model for bottom-up saliency, Itti et al. (1998) measures contrast as the difference between the stimulus at a location and the stimulus in its neighborhood in a center-surround fashion. This model has been shown to successfully replicate many observations from psychophysics (Itti & Koch, 2000; Parkhurst, Law, & Niebur, 2002; Peters, Iyer, Itti, & Koch, 2005) for both static and motion stimuli and applied to the design of computer vision algorithms for robotics and video compression (Itti, 2004; Shic & Scassellati, 2007; Walther & Koch, 2006). In the Guided Search model, Wolfe (1994) has, on the other hand, emphasized the modulation of the bottom-up activation maps by top-down, goal-dependent, knowledge. Li (2002) has argued that saliency maps are a direct product of the pre-attentive computations of primary visual cortex (V1) and implemented a saliency model inspired by the basic properties of the neural structures found in V1. This has also been shown to reproduce many psychophysical traits of human saliency, establishing a direct link between psychophysics and the physiology of V1. While many of these early saliency models aimed to reproduce various known properties of biological vision, they lacked a formal justification for their image processing steps in terms of a unifying computational principle for saliency. Some more recent models have tried to address this problem by deriving saliency mechanisms as optimal implementations of generic computational principles, such as the maximization of self-information (Bruce & Tsotsos, 2006) or “surprise” (Itti & Baldi, 2005). It is not yet clear how closely these models comply with the classical psychophysics since existing evaluations have been limited to the prediction of human eye fixation data. 
In this work, we study the effectiveness of an alternative, and currently less popular, hypothesis that all saliency decisions are optimal in a decision-theoretic sense. This hypothesis is denoted as discriminant saliency and was first proposed by Gao and Vasconcelos (2005) in a computer vision context. While initially posed as an explanation for top-down saliency, of interest mostly for object recognition, the hypothesis of decision theoretic optimality is much more general and indeed applicable to any form of center-surround saliency. This has motivated us to test its ability to explain the psychophysics of human saliency. Since these are better documented for the bottom-up neural pathway than for its top-down counterpart, we derive a bottom-up saliency detector which is optimal in a decision-theoretic sense. In particular, we hypothesize that the most salient locations of the visual field are those that enable the discrimination between feature responses in center and surround with smallest expected probability of error. This is referred to as the discriminant center-surround hypothesis and, by definition, produces saliency measures that are optimal in a classification sense. We derive optimal mechanisms for a number of saliency problems, ranging from static spatial saliency to motion-based saliency in the presence of ego-motion or even complex dynamic backgrounds. The ability of these mechanisms to both reproduce the classical psychophysics of human saliency and solve saliency problems of interest for computer vision is then evaluated. From the psychophysics point of view, it is shown that, for both static and moving stimuli, discriminant saliency not only explains all qualitative observations (such as pop-out for single feature search, and disregard of feature conjunctions) previously replicated by existing models but also makes quantitative predictions (such as the nonlinear aspects of human saliency), which are beyond their reach. From the computer vision standpoint, it is shown that the saliency algorithms now proposed can predict human eye fixations with greater accuracy than previous approaches and outperform state-of-the-art algorithms for background subtraction. In particular, it is shown that, by simply modifying the probabilistic models employed in the discriminant saliency measure—from well known models of natural image statistics, to the statistics of simple motion features, to more sophisticated dynamic texture models—it is possible to produce saliency detectors for either static or dynamic stimuli, which are insensitive to background image variability due to texture, ego-motion, or scene dynamics. 
Discriminant center-surround saliency
Discriminant saliency
Discriminant saliency is rooted in a decision-theoretic interpretation of perception. Under this interpretation, perceptual systems evolve to produce decisions about the state of the surrounding environment that are optimal in a decision-theoretic sense, e.g., that have minimum probability of error. This goal is complemented by one of computational parsimony, i.e., that the perceptual mechanisms should be as efficient as possible. Discriminant saliency is defined with respect to two classes of stimuli: a class of stimuli of interest and a null hypothesis, composed of all the stimuli that are not salient. Given these two classes, the locations of the visual field that can be classified, with lowest expected probability of error, as containing stimuli of interest are denoted as salient. Mathematically, this is accomplished by (1) defining a binary classification problem that opposes stimuli of interest to the null hypothesis and (2) equating the saliency of each location in the visual field to the discriminant power (with respect to this problem) of the visual features extracted from that location. This definition of saliency is applicable to a broad set of problems. For example, different specifications of stimuli of interest and null hypothesis enable its specialization to both top-down and bottom-up saliency. From a computational standpoint, the search for discriminant features is a well-defined and tractable problem that has been widely studied in the literature. These properties have been exploited by Gao and Vasconcelos (2005) to derive an optimal top-down saliency detector, which equates stimuli of interest to an object class and null hypothesis to all other object classes. In this work, we consider the problem of bottom-up saliency. 
Discriminant center-surround saliency
Inspired by the ubiquity of “center-surround” processing in the early stages of biological vision (Cavanaugh, Bair, & Movshon, 2002; Hubel & Wiesel, 1965; Knierim & van Essen, 1992), it is commonly assumed that bottom-up saliency is determined by how distinct the stimuli (features) at each location of the visual field is from the stimuli (features) in its surround. This “center-surround” hypothesis can be naturally formulated as a classification problem, as required by discriminant saliency and illustrated in Figure 2. This consists of defining, at each image location l, 
Figure 2
 
Illustration of discriminant center-surround saliency.
Figure 2
 
Illustration of discriminant center-surround saliency.
  •  
    stimuli of interest: observations within a neighborhood Wl1 of l (henceforth referred to as the center); and
  •  
    null hypothesis: observations within a surrounding window Wl0 (henceforth referred to as the surround).
All observations are responses, to the visual stimulus, of a predefined set of features X. The saliency of location l is equated to the power of X to discriminate between the center and surround of l based on the distributions of the feature responses estimated from the two regions. 
Mathematically, the feature responses within the two windows, W l0 and Wl1 are observations from a random process X(l) = (X1(l),…, X d(l)), of dimension d, drawn conditionally on the state of a hidden variable Y(l). The feature vector observed at location j is denoted as x(j) = (x1(j),…, x d(j)). Feature vectors x(j) such that jWlc, c ∈ {0, 1} are drawn from class c according to conditional densities PX(l)∣Y(l)(xc). Vectors drawn with Y(l) = c are referred to as belonging to the center class if c = 1 and the surround class if c = 0. The saliency of location l, S(l), is equal to the discriminant power of X for the classification of the observed feature vectors x(j), ∀jWl = Wl0Wl1, into center and surround. This is quantified by the mutual information between features, X, and class label, Y,  
S ( l ) = I l ( X ; Y ) = c p X ( l ) , Y ( l ) ( x , c ) log p X ( l ) , Y ( l ) ( x , c ) p X ( l ) ( x ) p Y ( l ) ( c ) d x .
(1)
The l subscript emphasizes the fact that both the classification problem and the mutual information are defined locally, within Wl. The function S(l) is referred to as the saliency map. Note that Equation 1 defines the discriminant saliency measure in a very generic sense, independently of the stimulus dimension under consideration or any specific feature sets. In fact, Equation 1 can be applied to any type of stimuli and any type of local features, as long as the probability densities P X(l)∣Y(l)(xc) can be estimated from the center and surround neighborhoods. In what follows, we derive the discriminant center-surround saliency for a variety of features, including intensity, color, orientation, motion, and even more complicated dynamic texture models. 
Discriminant saliency detection in static imagery
We start by deriving the optimal saliency detector for static stimuli, whose building blocks are illustrated in Figure 3
Figure 3
 
Bottom-up discriminant saliency detector.
Figure 3
 
Bottom-up discriminant saliency detector.
Extraction of visual features
The choice of a specific set of features is not crucial for the proposed saliency detector. We have obtained similar results with various types of wavelet or Gabor decompositions. In this work, we rely on a feature decomposition proposed by Itti and Koch (2000) and loosely inspired by the earliest stages of biological visual processing. This establishes a common ground for comparison with the previous saliency literature. The image to process is first subject to a feature decomposition into an intensity map (I) and four broadly tuned color channels (R, G, B, and Y), 
I=(r+g+b)/3,R=[r˜(g˜+b˜)/2]+,G=[g˜(r˜+b˜)/2]+,B=[b˜(r˜+g˜)/2]+,Y=[(r˜+g˜)/2|r˜g˜|/2]+,
(2)
where
r˜
= r/I,
g˜
= g/I,
b ˜
= b/I, and ⌊x+ = max(x, 0). The four color channels are, in turn, combined into two color opponency channels, RG for red/green and BY for blue/yellow opponency. The two opponency channels, together with the intensity map, are convolved with three Mexican hat wavelet filters, centered at spatial frequencies 0.02, 0.04, and 0.08 cycles/pixel, to generate nine feature channels. The feature space consists of these nine channels, plus a Gabor decomposition of the intensity map, implemented with a dictionary of zero-mean Gabor filters at 3 spatial scales (centered at frequencies of 0.08, 0.16, and 0.32 cycles/pixel) and 4 directions (evenly spread from 0 to π). Note that, following the tradition of the image processing and computational modeling literatures, we measure all filter frequencies in units of “cycles/pixel (cpp).” For a given set of viewing conditions, these can be converted to the “cycles/degree of visual angle (cpd)” more commonly used in psychophysics. For example, in all psychophysics experiments discussed later, the viewing conditions dictate a conversion rate of 30 pixels/degree of visual angle. In this case, the frequencies of these Gabor filters are equivalent to 2.5, 5, and 10 cpd. 
Leveraging natural image statistics
The second stage of the detection involves estimating the mutual information of Equation 1, at each image location, for the center-surround classification problem. This is, in general, impractical since it requires density estimates on a potentially high-dimensional feature space. A known statistical property of band-pass natural image features, such as Gabor or wavelet coefficients, can nevertheless be exploited to drastically reduce complexity. This property is that band-pass features exhibit strongly consistent patterns of dependence across a very wide range of natural image classes (Buccigrossi & Simoncelli, 1999; Huang & Mumford, 1999). For example, Buccigrossi and Simoncelli (1999) have shown that, when a natural image is subject to a wavelet decomposition, the conditional distribution of any wavelet coefficient, given the state of the co-located coefficient of immediately coarser scale (known as its “parent”), invariably has a bow-tie shape. This implies that, while the coefficients are statistically dependent, their dependencies carry little information about the image class (Buccigrossi & Simoncelli, 1999; Vasconcelos & Vasconcelos, 2004). In the particular case of saliency, feature dependencies are not greatly informative about whether the observed feature vectors originate in the center or the surround. Experimental validation of this hypothesis (Vasconcelos, 2003; Vasconcelos & Vasconcelos, 2004, in press) has shown that, for natural images, Equation 1 is well approximated by the sum of marginal mutual informations between individual features and class label 
S(l)=i=1dIl(Xi;Y) .
(3)
This is a sensible compromise between decision theoretic optimality and computational parsimony. Note that this approximation does not assume that the features are independently distributed, but simply that their dependencies are not informative about the class. 
Since Equation 3 only requires estimates of marginal densities, it has significantly less complexity than Equation 1. This complexity can be further reduced by exploiting the well known fact that marginal densities of band-pass features are accurately modeled by a generalized Gaussian distribution (GGD) (Clarke, 1985; Mallat, 1989; Modestino, 1977), 
PX(x;α,β)=β2αΓ(1/β)exp{(|x|α)β},
(4)
where Γ(z) = ∫0ettz−1dt, t > 0, is the Gamma function, α is a scale parameter, and β is a shape parameter. The parameter β controls the decay rate from the peak value and defines a subfamily of the GGD (e.g., the Laplacian family when β = 1 or the Gaussian family when β = 2). When the class conditional densities, PXY(xc), and the marginal density, PX(x), follow a GGD, the mutual information of Equation 3 has a closed form (Do & Vetterli, 2002) 
I(X;Y)=cPY(c)KL[PX|Y(x|c)PX(x)],
(5)
with 
KL[PX(x;α1,β1)PX(x;α2,β2)]=log(β1α2Γ(1/β2)β2α1Γ(1/β1))+(α1α2)β2Γ((β2+1)/β1)Γ(1/β1)1β1,
(6)
where KL[p∣∣q] = ∫p(x)log
p(x)q(x )
dx is the Kullback–Leibler (K–L) divergence between p(x) and q(x). Hence, the discriminant saliency measure only requires the estimation of the α and the β parameters for the center and the surround widows and the computation of Equations 3, 5, and 6
Gao and Vasconcelos (in press) have shown that, for maximum a posteriori estimation of the parameters (αc and βc, c ∈ {0, 1}) with conjugate (Gamma) priors, there is a one-to-one mapping between the discriminant saliency detector and a neural network that replicates the standard architecture of V1: a cascade of linear filtering, divisive normalization, quadratic nonlinearity, and spatial pooling. In the implementation presented in this article, we have instead adopted the method of moments for all parameter estimation because it is computationally more efficient on a nonparallel computer. Under the method of moments, α and β are estimated through the relationships  
σ 2 = α 2 Γ ( 3 β ) Γ ( 1 β ) a n d κ = Γ ( 1 β ) Γ ( 5 β ) Γ 2 ( 3 β ) ,
(7)
where σ2 and κ are, respectively, the variance and the kurtosis of X  
σ 2 = E X [ ( X E X [ X ] ) 2 ] a n d κ = E X [ ( X E X [ X ] ) 4 ] σ 4 .
(8)
In summary, parameter estimation only requires sample moments of the feature responses within the center and the surround windows and is very efficient. The method of moments has also been shown to produce good fits to natural images (Huang & Mumford, 1999). 
Consistency with psychophysics
To evaluate the compliance of discriminant saliency with psychophysics, we simulated a number of classical experiments in visual attention (Treisman & Gelade, 1980; Treisman & Gormican, 1988; Nothdurft, 1993). All simulations assumed viewing conditions such that 30 pixels correspond to 1° of visual angle. Following the standard practice in psychophysics, displays consisted of arrays of simple items, which subtended approximately 1°. The proposed saliency detector has two free parameters: the sizes of the center and the surround windows. In all experiments, the radius of the center was set to 1° and that of the surround to six times this value. Preliminary experimentation with these parameters has shown that the saliency results are not significantly affected by variations around the parameter values adopted. To improve intelligibility, the saliency maps shown in this article were subject to smoothing, contrast enhancement (by squaring), and normalization of the saliency value to the interval [0, 1]. This implies that absolute saliency values are not comparable across displays but only within each saliency map. 
We start with a series of displays commonly adopted in the literature to investigate whether saliency detectors reproduce the fundamental properties of human saliency (Itti et al., 1998; Rosenholtz, 1999). This is the case of discriminant saliency, which replicates the percept of pop-out for single feature search (e.g., Figures 4A and 4B), disregard of feature conjunctions (e.g., Figure 4C), and saliency asymmetries for feature presence vs. absence (e.g., Figure 5), in addition to various grouping and segmentation percepts (e.g., Figure 1). Although interesting, this type of evaluation is purely qualitative and therefore anecdotal. Given the simplicity of the displays, it is not hard to conceive of other center-surround operations that could produce similar results. To address this problem, we introduce an alternative evaluation strategy, based on the comparison of quantitative predictions, made by the saliency detectors, and available human data. It is our belief that quantitative predictions are essential for an objective comparison of different saliency principles. We show that this process can be useful, by performing various objective comparisons between discriminant saliency and the popular saliency model of Itti and Koch (2000), whose results were obtained with the MATLAB implementation by Walther and Koch (2006). 
Figure 4
 
Discriminant saliency output (bottom row) for displays (top row) where target and distractors differ in terms of single features (A, orientation; B, color) or (C) feature conjunctions (color and orientation). Brightest regions are most salient. The strong saliency peaks at the targets of panels A and B indicate a strong pop-out effect. The lack of distinguishable saliency variations between the target (fourth line and fourth column) and distractors of panel C indicates that the target does not pop-out.
Figure 4
 
Discriminant saliency output (bottom row) for displays (top row) where target and distractors differ in terms of single features (A, orientation; B, color) or (C) feature conjunctions (color and orientation). Brightest regions are most salient. The strong saliency peaks at the targets of panels A and B indicate a strong pop-out effect. The lack of distinguishable saliency variations between the target (fourth line and fourth column) and distractors of panel C indicates that the target does not pop-out.
Figure 5
 
Example of pop-out asymmetry (discriminant saliency maps shown below each display). (Left) A target (“Q”) defined by the presence of a feature that the distractors (“O”) lack produces a strong pop-out effect. (Right) The reverse does not lead to noticeable pop-out.
Figure 5
 
Example of pop-out asymmetry (discriminant saliency maps shown below each display). (Left) A target (“Q”) defined by the presence of a feature that the distractors (“O”) lack produces a strong pop-out effect. (Right) The reverse does not lead to noticeable pop-out.
In the first experiment, we examine the ability of the saliency detectors to predict a well known nonlinearity of human saliency. While it has long been known that local feature contrast affects percepts such as target pop-out and texture segregation, most early studies in the psychophysics of saliency pursued the threshold at which these events occur. Examples includes the threshold at which a (previously nonsalient) target pops-out (Foster & Ward, 1991; Nothdurft, 1991b), two formerly indistinguishable textures segregate (Julesz, 1981; Landy & Bergen, 1991), a “serial” visual search becomes “parallel,” or vice versa (Moraglia, 1989; Treisman & Gelade, 1980; Wolfe, Friedman-Hill, Stewart, & O'Connell, 1992). In the context of objective evaluation, these studies are less interesting than a posterior set, which also measured the saliency of pop-out targets above the detection threshold (Motoyoshi & Nishida, 2001; Nothdurft, 1993; Regan, 1995). In particular, Nothdurft (1993) characterized the saliency of pop-out targets due to orientation contrast by comparing the conspicuousness of orientation defined targets and luminance defined ones and by using luminance as a reference for relative target salience. He showed that the saliency of a target increases with orientation contrast, but in a nonlinear manner, exhibiting both threshold and saturation effects: (1) there exists a threshold below which the effect of pop-out vanishes and (2) above this threshold saliency increases with contrast, saturating after some point. The overall relationship has a sigmoidal shape, with lower (upper) threshold tl (tu). 
The results of this experiment are illustrated in Figure 6. The figure presents plots of saliency strength as a function of the target orientation contrast. The human data collected by Nothdurft (1993) is presented in Figure 6A, while the predictions of the discriminant saliency detector are shown in Figure 6B. Note that the latter closely predicts the strong threshold and saturation effects of the former, suggesting that tl ≈ 10° and tu ≈ 40°. These predictions are consistent with the human data. The same experiment was repeated for the model of Itti and Koch (2000) which, as illustrated by Figure 6C, exhibited no quantitative compliance with human performance. 
Figure 6
 
The nonlinearity of human saliency responses to orientation contrast (reproduced from Figure 9 of Nothdurft, 1993) (a) is replicated by discriminant saliency (b) but not by the model of Itti and Koch (2000) (c).
Figure 6
 
The nonlinearity of human saliency responses to orientation contrast (reproduced from Figure 9 of Nothdurft, 1993) (a) is replicated by discriminant saliency (b) but not by the model of Itti and Koch (2000) (c).
The nonlinearity of discriminant saliency is mostly due to the combination of (1) the mutual information underlying the saliency measure and (2) the generalized Gaussian statistics of natural image feature responses. To obtain some intuition about this, consider the saliency computations, Equations 5 and 6, with parameters estimated by the method of moments, Equation 7. Let αc, β c, c ∈ {0,1,2} represent, respectively, the GGD parameters of feature distributions in the center, the surround, and the total (center + surround) regions. Finally, for simplicity, assume that β0 = β1 = β2 = 1, in which case Equation 5 becomes  
I ( X ; Y ) = P Y ( 0 ) σ 0 + P Y ( 1 ) σ 1 σ 2 + log σ 2 P Y ( 0 ) log σ 0 P Y ( 1 ) log σ 1 1 ;
(9)
with σ22 = PY(0)σ02 + PY(1)σ12
For Gabor decompositions, the standard deviation (σ) of the response of each filter is sensitive to stimulus orientation: It reaches its maximum value when the latter is aligned with the filter orientation, dropping significantly as the two orientations diverge. This implies that, when orientation contrast between target and distractors is small, σ0 and σ1 are close (σ1/σ 0 ≈ 1). As contrast increases, the response in the region whose stimulus orientation is closer to the preferred orientation of the filter becomes dominant, i.e., either σ1/σ0 ≫ 1 or σ0/σ1 ≫ 1. It follows that the ratio σ1/σ0 is a measure of orientation contrast between center and surround stimuli. Plotting Equation 9 as a function of this ratio, as illustrated in Figure 7, shows that the discriminant saliency measure increases nonlinearly with orientation contrast and exhibits a strong saturation effect (a similar shape was also obtained for σ0/σ1 and is omitted). While other factors, such as the facts that β is not necessarily 1 and that σ itself saturates, also contribute to the nonlinear behavior of saliency, these are smaller effects than that of Figure 7
Figure 7
 
Mutual information between feature responses and class label, as a function of the ratio between the variances of the former in the center and the surround windows, σ1/σ0.
Figure 7
 
Mutual information between feature responses and class label, as a function of the ratio between the variances of the former in the center and the surround windows, σ1/σ0.
A second experiment addressed the ability of the saliency detectors to make quantitative predictions regarding classical saliency asymmetries: While the presence in the target of some feature absent from the distractors produces pop-out, the reverse (pop-out due to the absence, in the target, of a distractor feature) does not hold (Treisman & Gormican, 1988). The qualitative results of Figure 5 show that discriminant saliency has the ability to reproduce these asymmetries. We investigated if it could also make objective predictions for the strength of this asymmetry. For this, we relied on data collected in visual search experiments (Treisman & Gormican, 1988), which showed that asymmetries occur not only for the existence and the absence of a feature but also for quantitatively weaker and stronger responses along one feature dimension. In fact, through a series of experiments involving displays in which the target differs from distractors only in terms of length, Treisman and Gormican (1988) showed that the asymmetries follow Weber's law. Figure 8A presents one example of the displays used in this experiment, where the target (a vertical bar at the center of the display) has a different length from the distractors (a set of vertical bars). The discriminant saliency detector was applied to these displays, and the results are presented in Figure 8B. The figure shows the saliency predictions obtained at the target location, across the set of displays, as a scatter plot. The dashed line shows the best fit to Weber's law: Target saliency is approximately linear in the ratio between the difference of target/distractor length (Δx) and distractor length (x). For comparison, Figure 8C presents the corresponding scatter plot for the model of Itti and Koch (2000), which does not replicate human performance. 
Figure 8
 
An example display (a) and performance of saliency detectors (discriminant saliency (b) and the model of Itti and Koch, 2000 (c)) on Treisman's Weber's law experiment.
Figure 8
 
An example display (a) and performance of saliency detectors (discriminant saliency (b) and the model of Itti and Koch, 2000 (c)) on Treisman's Weber's law experiment.
Motion saliency
An important property of human saliency is its ubiquity: Saliency mechanisms have been observed for various cues, including orientation, color, texture, and motion (Nothdurft, 1991a; Treisman & Gelade, 1980). It has also been suggested that orientation and motion saliency could be encoded by similar mechanisms (Ivry & Cohen, 1992; Nothdurft, 1993). Since Equation 1 can be applied to any type of stimuli and features this is, in principle, possible to replicate with discriminant saliency. In this section, we verify this hypothesis by deriving the discriminant saliency detector for motion stimuli and by providing evidence of its ability to predict human psychophysics. 
Motion-based discriminant saliency detector
To compute motion information from video sequences, we adopt the spatiotemporal filtering approach of Adelson and Bergen (1985) and Heeger (1988). Spatiotemporal filtering is a biologically plausible mechanism for motion estimation and has been shown to comply with the physiology and the psychophysics of the early stages of the visual cortex (Adelson & Bergen, 1985). Since spatiotemporal orientation is equivalent to velocity, a set of 3-D Gabor (spatiotemporal) filters, tuned to a specific orientation in space and time, is used to extract the motion energy associated with different velocities. The algorithmic implementation of the spatiotemporal filters used in this work was based on the separable spatiotemporal filters of Heeger (1988). We considered only one spatial scale and the spatial frequency of each Gabor filter was fixed to 0.25 cycles/pixel. Three temporal scales (temporal frequencies of 0, ±0.25 cycles/frame) and 4 spatial orientations (0, π/4, π/2, and 3π/4) were used, in a total of 12 filters. The standard deviation of the spatial Gaussian was set to 1 and that of the temporal Gaussian to 2. This set of filter parameters were chosen for simplicity; we have not experimented thoroughly with them. We have also only considered the intensity of the input video frames, and all color information was discarded. These intensity maps were convolved with the 12 spatiotemporal filters to produce the feature maps used by the saliency algorithm. Saliency was then computed as in the static case, using Equations 36
Consistency with psychophysics of motion perception
To evaluate the compliance of the discriminant saliency detector with the psychophysics of human motion saliency (Ivry & Cohen, 1992; Nothdurft, 1993), we start with some qualitative observations (all motion stimuli sequences in the experiments were generated using the Psychtoolbox; Brainard, 1997). Ivry and Cohen (1992) showed that search asymmetries also hold for moving stimuli. For example, searching for a fast-moving target among slowly moving distractors is easier than the reverse. We applied the motion-based discriminant saliency detector to a set of sequences used to demonstrate the asymmetries of motion pop-out (Ivry & Cohen, 1992), with the results illustrated in Figure 9. The figure presents quiver plots of the motion stimuli, under the two conditions, and one frame of the resulting discriminant saliency map. The conspicuous saliency peak at the target in Figure 9A shows a strong pop-out effect when the target speed is greater than that of the distractors. No noticeable pop-out effect is observed in Figure 9B, where the distractor speed is greater than that of the target. This shows that the discriminant saliency detector can replicate the asymmetries of motion saliency. 
Figure 9
 
Discriminant saliency detector output for (a) a fast-moving target among slowly moving distracters and (b) a slowly moving target among fast-moving distractors. Top row shows quiver plots of the stimuli (the direction of motion is specified by the arrow whose length indicates the speed), and bottom row plots the corresponding saliency maps.
Figure 9
 
Discriminant saliency detector output for (a) a fast-moving target among slowly moving distracters and (b) a slowly moving target among fast-moving distractors. Top row shows quiver plots of the stimuli (the direction of motion is specified by the arrow whose length indicates the speed), and bottom row plots the corresponding saliency maps.
As was the case for static stimuli, we complemented this qualitative observation with a quantitative analysis of the saliency predictions made by the discriminant detector. Nothdurft (1993) found that human saliency responses to motion are very similar to those observed for orientation: The perception of saliency of moving targets increases nonlinearly with motion contrast and shows significant saturation and threshold effects. To test the compliance of discriminant saliency with this nonlinearity, we applied it to the motion displays of Nothdurft (1993). An example is shown in Figure 10A, where Figure 10B shows a plot of the human saliency data, reproduced from the original figure of Nothdurft (1993), and Figure 10C presents the predictions made by discriminant saliency. The two plots are very similar, both exhibiting threshold and saturation effects. 
Figure 10
 
The nonlinearity of human saliency responses to motion contrast (reproduced from Figure 9 of Nothdurft, 1993) (b) is replicated by discriminant saliency (c). A quiver plot of one instance of the motion display used in the experiment (with background contrast (bg) = 0; target contrast (tg) = 60) is illustrated in panel a. The direction of motion is specified by the arrow, whose length indicates the speed.
Figure 10
 
The nonlinearity of human saliency responses to motion contrast (reproduced from Figure 9 of Nothdurft, 1993) (b) is replicated by discriminant saliency (c). A quiver plot of one instance of the motion display used in the experiment (with background contrast (bg) = 0; target contrast (tg) = 60) is illustrated in panel a. The direction of motion is specified by the arrow, whose length indicates the speed.
Applications in computer vision
The ability of discriminant saliency to make accurate predictions of the psychophysics of human saliency, for both static and motion stimuli, encouraged us to examine its performance as a solution for computer vision problems. We considered the problems of predicting human eye fixations, detecting salient moving objects in the presence of ego-motion, and background subtraction from highly dynamic scenes. In all cases, the output of the discriminant saliency detector was compared to either human performance or state-of-the-art solutions from the computer vision literature. 
Prediction of eye fixations on natural scenes
We started by testing the ability of the static discriminant saliency detector to predict the location of human eye fixations. For this, we compared the discriminant saliency maps obtained from a collection of natural images to the eye fixation locations recorded from human subjects, in a free-viewing task. The eye-fixation data were collected by Bruce and Tsotsos (2006), from 20 subjects and 120 different natural color images, depicting urban scenes (both indoor and outdoor). The images were presented in 1024 × 768 pixel format on a 21-in. CRT color monitor. The monitor was positioned at viewing distance of 75 cm; consequently, the image presented subtended 32° horizontally and 24° vertically, i.e., approximately 30 pixels per degree of visual angle. All images were presented in random order, to each subject for 4 seconds, with a mask inserted between consecutive presentations. Subjects were given no instructions, and there were no predefined initial fixations. A standard non-head-mounted gaze tracking device (Eye-gaze Response Interface Computer Aid (ERICA) workstation) was applied to record the eye movements. All participants had normal or correct-to-normal vision. 
To measure prediction accuracy, saliency maps were first quantized into a binary mask that classified each image location as either a fixation or nonfixation (Tatler, Baddeley, & Gilchrist, 2005). Using the measured human fixations as ground truth, a receiver operating characteristic (ROC) curve was generated by varying the quantization threshold. Perfect prediction corresponds to an ROC area (area under the ROC curve) of 1, while chance performance occurs at an area of 0.5. Since the metric makes use of all saliency information in both the human fixations and the saliency detector output, it has been adopted in various recent studies (Bruce & Tsotsos, 2006; Harel et al., 2007; Kienzle et al., 2007). The predictions of discriminant saliency were compared to those of the methods of Itti and Koch (2000) and Bruce and Tsotsos (2006). As an absolute benchmark, we also computed the “inter-subject” ROC area (Harel et al., 2007), which measures fixation consistency between human subjects. For each subject, a “human saliency map” was derived from the fixations of all other subjects, by convolving these fixations with a circular 2-D Gaussian kernel. The standard deviation (σ) of this kernel was set to 1° of visual angle (≈30 pixels), which is approximately the radius of the fovea. The “inter-subject” ROC area was then measured by comparing subject fixations to this saliency map and averaging across subjects and images. 
Table 1 presents average ROC areas for all detectors, across the entire image set, as well as the “inter-subject” ROC area. It is clear that discriminant saliency has the best performance among the three saliency detectors. Nevertheless, its advantage over the other two detectors is not as significant as that observed in the quantitative psychophysics simulations of the previous sections. This is due to the fact that performance on the fixation task does not depend on the accuracy of the saliency measure as critically as performance in the psychophysics experiments. There are two reasons for this. The first is that the eye fixation experiment is more qualitative: All that is required for good performance is that the saliency peaks have the correct ordering within each saliency map. On the other hand, good performance on the psychpohysics simulations, e.g., the nonlinearity study, requires a precise match between the simulated curve of saliency vs. orientation contrast and that of humans. The second is that ROC areas of Table 1 are averaged across all fixations. This makes the human ground-truth unreliable since it is unlikely that late fixations are driven by bottom-up saliency. On the contrary, as the subjects start to recognize the scenes, it is expected that they will use top-down cues to decide where to look next. This has been pointed out in the literature, for example, Tatler et al. (2005) suggest that the first a few fixations are more likely to be driven by bottom-up mechanisms than the remaining ones. 
Table 1
 
ROC areas for different saliency models with respect to all human fixations.
Table 1
 
ROC areas for different saliency models with respect to all human fixations.
Saliency model DiscriminantItti and Koch (2000)Bruce and Tsotsos (2006)Inter-subject
ROC area0.76940.72870.7547 0.8766
To probe deeper into this issue, we studied in greater detail the relationship between saliency maps and the subjects' first two fixations. Figure 11 presents the ROC areas of the three detectors as a function of the “inter-subject” ROC area, for these fixations. Again, discriminant saliency exhibited the strongest correlation with human performance, at all levels of inter-subject consistency. More importantly, the gains of discriminant saliency were largest when inter-subject consistency was strongest. In this region, the performance of discriminant saliency (0.85) was close to 90% of that of humans (0.95), while the other two detectors only achieved close to 85% (0.81). 
Figure 11
 
Average ROC area, as a function of inter-subject ROC area, for the saliency algorithms discussed in the text.
Figure 11
 
Average ROC area, as a function of inter-subject ROC area, for the saliency algorithms discussed in the text.
Discriminant saliency on motion fields
Motion saliency is of importance for various computer vision applications. For example, a robot could benefit from a motion saliency module to identify objects approaching it. However, motion saliency is not trivial to implement when there is ego-motion. If the robot is moving itself, the optical flow due to the moving objects is easily confounded with that originated by background variation due to the robot's motion. This is illustrated by Figure 12, which shows several frames (top row) from a video sequence shot with a moving camera. The sequence depicts a leopard running in a grassland. The camera motion introduces significant variability in the background, making the detection of foreground motion (the leopard) a difficult task. This can be confirmed by analyzing the saliency predictions of algorithms previously proposed in the literature. One example is the “surprise” model of Itti and Baldi (2005) (results were generated using the iLab Neuromorphic Vision Toolkit available from http://ilab.usc.edu/toolkit). Although it is one of the best saliency detectors for these types of sequences, the “surprise” maps generated by this algorithm (bottom row of the figure) frequently assign more saliency to the motion of the background than to that of the leopard. 
Figure 12
 
Saliency in the presence of ego-motion. (A–D) Representative frames from a video sequence shot with a moving camera, (E–H) the saliency map produced by the motion-based discriminant saliency detector, and (I–L) the “surprise” maps by the model of Itti and Baldi (2005) (Movie clip).
Figure 12
 
Saliency in the presence of ego-motion. (A–D) Representative frames from a video sequence shot with a moving camera, (E–H) the saliency map produced by the motion-based discriminant saliency detector, and (I–L) the “surprise” maps by the model of Itti and Baldi (2005) (Movie clip).
The saliency maps produced by motion-based discriminant saliency are shown in the middle row of the figure. They are clearly superior to those produced by the surprise model, disregarding the background and concentrating all saliency on the animal's body. This example shows that motion-based discriminant saliency is very robust to the presence of ego-motion. This is due to the fact that discriminant saliency is based on a measure of motion contrast. While there is variability in the background optical flow (due to a combination of camera motion and a mostly static scene) this is usually much smaller than the variability of the object's optical flow (especially for nonrigid objects). Hence, the object region has larger motion contrast and is deemed more salient. This is similar to the grouping examples of Figure 1, where feature contrast plays an important role in grouping and segmentation percepts. 
Discriminant saliency for dynamic scenes
One further source of complexity is the possibility that the scene is itself dynamic, e.g., a background consisting of water waves, or tree leaves moving with the wind. In this case, the variability of background optical flow can be larger than that of the object optical flow, for any object. This problem is so complex that, even though background subtraction is a classic problem in computer vision, there has been relatively little progress for these types of scenes (e.g., for a review, see Sheikh and Shah, 2005). In order to capture the motion patterns characteristic of these backgrounds, it is necessary to rely on reasonably sophisticated probabilistic models, such as the dynamic texture model (Doretto, Chiuso, Wu, & Soatto, 2003). A dynamic texture (DT) is an autoregressive, generative model for video. It models the spatial component of the video and the underlying temporal dynamics as two stochastic processes. The video is represented as a time-evolving state process xt
R
n, and the appearance of a frame yt
R
m is a linear function of the current state vector and observation noise. The system equations are 
xt=Axt1+vtyt=Cxt+wt,
(10)
where A
R
n×n is the state transition matrix and C
R
m×n is the observation matrix. The state and observation noise are given by vtN(0, Q) and wtN(0, R), respectively. Finally, the initial condition is distributed as x1N(μ, S). 
Due to the probabilistic nature of the dynamic texture model, it can be easily incorporated on a center-surround discriminant saliency detector. Given a sequence of images, the parameters of the dynamic texture are learned for the center and the surround regions at each image location, using algorithms discussed by Doretto et al. (2003) and Chan and Vasconcelos (2008). Saliency is then computed with the mutual information of Equation 3, using as PX(l),Y(l)(x, c) the probabilistic representation of the center and surround linear dynamic systems. In this case, the discriminant saliency measure becomes a measure of contrast between the compliance of the center and the surround regions with the dynamic texture assumption. Since this assumption tends to be accurate for dynamic natural scenes, but not necessarily for objects, the result is a background subtraction algorithm applicable to complex dynamic scenes. 
This can be seen in Figures 141516, which depict the saliency maps produced by the dynamic texture-based discriminant saliency (DTDS) detector for three video sequences. The first (water bottle from Zhong and Sclaroff, 2003) depicts a bottle floating in water in rain and is shown in Figure 14AD. The second sequence, Surfer, containing a surfer moving in water, is shown in Figure 15AD. This sequence is more challenging, as the water surface displays a lower frequency sweeping wave interspersed with high frequency components due to turbulent wakes (created by the surfer and crest of the sweeping wave). The third, Cyclists (Figure 16AD), shows a pair of cyclists moving across a field. The resolution of the clip is poor, and there is considerable background movement, making it difficult to extract the foreground reliably. We compared the output of the DTDS detector with a state-of-the-art background subtraction algorithm from computer vision, based on a Gaussian mixture model (GMM) (Stauffer & Grimson, 1999; Zivkovic, 2004), as well as the “surprise” model (Itti & Baldi, 2005). 
Figures 14EH, 15EH, and 16EH show the saliency maps produced by discriminant saliency detector, DTDS, for the three sequences. The DTDS detector performs well in all cases, detecting the foreground objects while ignoring the movement in the background. As can be seen in Figures 14IL and 14MP, Figures 15IL and 15MP, and Figures 16IL and 16MP, the foreground detection of the other methods is very noisy and cannot adapt to the highly dynamic nature of the background. The “surprise” maps of the early frames are especially noisy, since a training phase is required to learn the model parameters, a limitation that does not affect DTDS. Highly stochastic spatiotemporal stimuli, such as the sweeping wave crest or the very fast moving background field, create serious difficulties to both the GMM and the surprise detector. Unlike the saliency maps of DTDS, the resulting saliency maps contain substantial energy in regions of the background, sometimes completely missing the foreground objects. These saliency maps would be difficult to analyze by subsequent vision (e.g., object tracking) modules. To produce a quantitative comparison of the saliency maps, these were thresholded at a range of values. The results were compared with manually annotated ground-truth foreground masks, and an ROC curve produced for each algorithm. The results are shown in Figure 13. DTDS clearly outperforms both the GMM based background model and the “surprise” model (Figures 141516). 
Figure 13
 
Performance of background subtraction algorithms on (a) water bottle, (b) surfer, and (c) cyclists.
Figure 13
 
Performance of background subtraction algorithms on (a) water bottle, (b) surfer, and (c) cyclists.
Figure 14
 
Results on bottle: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Figure 14
 
Results on bottle: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Figure 15
 
Results on surfer: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Figure 15
 
Results on surfer: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Figure 16
 
Results on cyclists: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Figure 16
 
Results on cyclists: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Conclusion
In this work, we have evaluated the plausibility of a recently proposed hypothesis for bottom-up saliency: that it is the result of optimal decision making, under constraints of computational parsimony. It was shown that this hypothesis can be applied to various stimulus modalities, and optimal saliency detectors were derived for intensity, color, orientation, and motion. These detectors were shown to replicate quantitative psychophysics aspects of human saliency for both static and moving stimuli. Application of the detectors to problems of interest in computer vision, including the prediction of human eye fixations on natural scenes, motion-based saliency in the presence of ego-motion, and background subtraction in highly dynamic scenes, also revealed better performance than existing solutions to these problems. 
Supplementary Materials
Supplementary Movie 1 - Supplementary Movie 1 
Supplementary Movie 2 - Supplementary Movie 2 
Supplementary Movie 3 - Supplementary Movie 3 
Supplementary Movie 4 - Supplementary Movie 4 
Acknowledgments
The authors thank Neil Bruce for kindly sharing the eye fixation data and saliency predictions of Bruce and Tsotsos (2006). This research was supported by NSF awards IIS-0448609 and IIS-0534985. 
Commercial relationships: none. 
Corresponding author: Dashan Gao. 
Email: dgao@ucsd.edu. 
Address: 9500 Gilman Dr. Mail Code 0409, La Jolla, CA 92037-0409. 
References
Adelson, E. H. Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, Optics and Image Science, 2, 284–299. [PubMed] [CrossRef] [PubMed]
Beck, J. (1966a). Effect of orientation and of shape similarity on perceptual grouping. Perception & Psychophysics, 1, 300–302. [CrossRef]
Beck, J. (1966b). Perceptual grouping produced by changes in orientation and shape. Science, 154, 538–540. [PubMed] [CrossRef]
Beck, J. (1972). Similarity grouping and peripheral discriminability under uncertainty. American Journal of Psychology, 85, 1–19. [PubMed] [CrossRef] [PubMed]
Brainard, D. H. (1997). The Psychophysics Toolbox. Spatial Vision, 10, 433–436. [PubMed] [CrossRef] [PubMed]
Bruce, N. Tsotsos, J. Weiss, Y. Schölkopf,, B. Platt, J. (2006). Saliency based on information maximization. Advances in neural information processing systems 18 (pp. 155–162). Cambridge, MA: MIT Pres.
Buccigrossi, R. W. Simoncelli, E. P. (1999). Image compression via joint statistical characterization in the wavelet domain. IEEE Transactions on Image Processing, 8, 1688–1701. [PubMed] [CrossRef] [PubMed]
Cavanaugh, J. R. Bair, W. Movshon, J. A. (2002). Nature and interaction of signals from the receptive field center and surround in macaque V1 neurons. Journal of Neurophysiology, 88, 2530–2546. [PubMed] [Article] [CrossRef] [PubMed]
Chan, A. B. Vasconcelos, N. (2008). Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 909–926. [PubMed] [CrossRef] [PubMed]
Clarke, R. (2008). Transform coding of images. San Diego, CA: Academic Pres.
Do, M. N. Vetterli, M. (2002). Wavelet-based texture retrieval using generalized Gaussian density and Kullback-Leibler distance. IEEE Transactions on Image Processing, 11, 146–158. [PubMed] [CrossRef] [PubMed]
Doretto, G. Chiuso, A. Wu, Y. N. Soatto, S. (2003). Dynamic textures. International Journal of Computer Vision, 51, 91–109. [CrossRef]
Foster, D. H. Ward, P. A. (1991). Asymmetries in oriented-line detection indicate two orthogonal filters in early vision. Proceedings of the Royal Society B: Biological Sciences, 243, 75–81. [PubMed] [CrossRef]
Gao, D. Vasconcelos, N. Saul, L. K. Weiss, Y. Bottou, L. (2005). Discriminant saliency for visual recognition from cluttered scenes. Advances in neural information processing systems 17 (pp. 481–488). Cambridge, MA: MIT Pres.
Gao, D. Vasconcelos, N. (in press). Decision-theoretic saliency: Computational principle, biological plausibility, and implications for neurophysiology and psychophysics. Neural Computation.
Harel, J. Koch, C. Perona, P. Schölkopf, B. Platt, J. Hoffman, T. (2007). Graph-based visual saliency. Advances in neural information processing systems 19 (pp. 545–552). Cambridge, MA: MIT Pres.
Heeger, D. (1988). Optical flow from spatiotemporal filters. International Journal of Computer Vision, 1, 279–302. [CrossRef]
Huang, J. Mumford, D. (1999). Statistics of natural images and models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 541–547). Ft. Collins, CO, USA: IEEE Computer Society.
Hubel, D. H. Wiesel, T. N. (1965). Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. Journal of Neurophysiology, 28, 229–289. [PubMed] [PubMed]
Itti, L. (2004). Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Transactions on Image Processing, 13, 1304–1318. [PubMed] [CrossRef] [PubMed]
Itti, L. Baldi, P. (2005). A principled approach to detecting surprising events in video. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 631–637). San Diego, CA.
Itti, L. Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40, 1489–1506. [PubMed] [CrossRef] [PubMed]
Itti, L. Koch, C. (2001). Computational modelling of visual attention. Nature Reviews, Neuroscience 2, 194–203. [PubMed] [CrossRef]
Itti, L. Koch, C. Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1254–1259. [CrossRef]
Ivry, R. B. Cohen, A. (1992). Asymmetry in visual search for targets defined by differences in movement speed. Journal of Experimental Psychology: Human Perception and Performance, 18, 1045–1057. [PubMed] [CrossRef] [PubMed]
Julesz, B. (1975). Experiments in the visual perception of texture. Scientific American, 232, 34–43. [PubMed] [CrossRef] [PubMed]
Julesz, B. (1981). A theory of preattentive texture discrimination based on first-order statistics of textons. Biological Cybernetics, 41, 131–138. [PubMed] [CrossRef] [PubMed]
Julesz, B. (1984). A brief outline of the texton theory of human vision. Trends in Neuroscience, 7, 41–45. [CrossRef]
Kienzle, W. Wichmann, F. A. Schölkopf, B. Franz, M. O. Schlkopf, B. Platt, J. Hoffman, T. (2007). A nonparametric approach to bottomup visual saliency. Advances in neural information processing systems 19 (pp. 689–696). Cambridge, MA: MIT Pres.
Knierim, J. J. van Essen, D. C. (1992). Neuronal responses to static texture patterns in area V1 of the alert macaque monkey. Journal of Neurophysiology, 67, 961–980. [PubMed] [PubMed]
Koch, C. Ullman, S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiology, 4, 219–227. [PubMed] [PubMed]
Landy, M. S. Bergen, J. R. (1991). Texture segregation and orientation gradient. Vision Research, 31, 679–691. [PubMed] [CrossRef] [PubMed]
Li, Z. (2002). A saliency map in primary visual cortex. Trends in Cognitive Sciences, 6, 9–16. [PubMed] [CrossRef] [PubMed]
Mallat, S. G. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11, 674–693. [CrossRef]
Modestino, J. W. Papantoni-Kazakos, P. Kazakos, D. (1977). Adaptive nonparametric detection techniques. Nonparametric methods in communications (pp. 29–65). New York: Marcel Dekker.
Moraglia, G. (1989). Display organization and the detection of horizontal line segments. Perception & Psychophysics, 45, 265–272. [PubMed] [CrossRef] [PubMed]
Motoyoshi, I. Nishida, S. (2001). Visual response saturation to orientation contrast in the perception of texture boundary. Journal of the Optical Society of America A, Optics, Image Science, and Vision, 18, 2209–2219. [PubMed] [CrossRef] [PubMed]
Nothdurft, H. C. (1991a). Texture segmentation and pop-out from orientation contrast. Vision Research, 31, 1073–1078. [PubMed] [CrossRef]
Nothdurft, H. C. (1991b). The role of local contrast in pop-out of orientation, motion and color. Investigative Ophthalmology & Visual Science, 32, 714.
Nothdurft, H. C. (1992). Feature analysis and the role of similarity in preattentive vision. Perception & Psychophysics, 52, 355–375. [PubMed] [CrossRef] [PubMed]
Nothdurft, H. C. (1993). The conspicuousness of orientation and motion contrast. Spatial Vision, 7, 341–363. [PubMed] [CrossRef] [PubMed]
Olson, R. K. Attneave, F. (1970). What variables produce similarity grouping? American Journal of Psychology, 83, 1–21. [CrossRef]
Parkhurst, D. Law, K. Niebur, E. (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42, 107–123. [PubMed] [CrossRef] [PubMed]
Peters, R. J. Iyer, A. Itti, L. Koch, C. (2005). Components of bottom-up gaze allocation in natural images. Vision Research, 45, 2397–2416. [PubMed] [CrossRef] [PubMed]
Regan, D. (1995). Orientation discrimination for bars defined by orientation texture. Perception, 24, 1131–1138. [PubMed] [CrossRef] [PubMed]
Rosenholtz, R. (1999). A simple saliency model predicts a number of motion popout phenomena. Vision Research, 39, 3157–3163. [PubMed] [CrossRef] [PubMed]
Sagi, D. Julesz, B. (1985). “Where” and “what” in vision. Science, 228, 1217–1219. [PubMed] [CrossRef] [PubMed]
Sheikh, Y. Shah, M. (2005). Bayesian modeling of dynamic scenes for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1778–1792. [PubMed] [CrossRef] [PubMed]
Shic, F. Scassellati, B. (2007). A behavioral analysis of computational models of visual attention. International Journal of Computer Vision, 73, 159–177. [CrossRef]
Stauffer, C. Grimson, W. (1999). Adaptive background mixture models for real-time tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 246–252).
Tatler, B. W. Baddeley, R. J. Gilchrist, I. D. M. (2005). Visual correlates of fixation selection: Effects of scale and time. Vision Research, 45, 643–659. [PubMed] [CrossRef] [PubMed]
Treisman, A. (1985). Preattentive processing in vision. Computer vision, Graphics, & Image Processing, 31, 156–177. [CrossRef]
Treisman, A. M. Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12, 97–136. [PubMed] [CrossRef] [PubMed]
Treisman, A. Gormican, S. (1988). Feature analysis in early vision: Evidence from search asymmetries. Psychological Review, 95, 15–48. [PubMed] [CrossRef] [PubMed]
Vasconcelos, M. Vasconcelos, N. (in press) Natural image statistics and low complexity feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Vasconcelos, N. (2003). Feature selection by maximum marginal diversity: Optimality and implications for visual recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (vol. 1, pp. 762) –769).
Vasconcelos, N. Vasconcelos, M. (2004). Scalable discriminant feature selection for image retrieval and recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (vol. 2, pp. 770)–775).
Walther, D. Koch, C. (2006). Modeling attention to salient proto-objects. Neural Networks, 19, 1395–1407. [PubMed] [CrossRef] [PubMed]
Wolfe, J. M. (1994). Guided search 2.0: A revised model of visual search. Psychonomic Bulletin & Review, 1, 202–238. [CrossRef] [PubMed]
Wolfe, J. M. Friedman-Hill, S. R. Stewart, M. I. O'Connell, K. M. (1992). The role of categorization in visual search for orientation. Journal of Experimental Psychology: Human Perception and Performance, 18, 34–49. [PubMed] [CrossRef] [PubMed]
Zhong, J. Sclaroff, S. (2003). Segmenting foreground objects from a dynamic textured background via a robust Kalman filter. Proceedings of IEEE International Conference on Computer Vision (vol. 1, pp. 44).
Zivkovic, Z. (2004). Improved adaptive Gaussian mixture model for background subtraction. Proceedings of International Conference on Pattern Recognition (vol. 2, pp. 28–31).
Figure 1
 
Four displays (top row) and saliency maps produced by the algorithm proposed in this article (bottom row). These examples show that saliency analysis facilitates aspects of perceptual organization, such as grouping (left two displays) and texture segregation (right two displays).
Figure 1
 
Four displays (top row) and saliency maps produced by the algorithm proposed in this article (bottom row). These examples show that saliency analysis facilitates aspects of perceptual organization, such as grouping (left two displays) and texture segregation (right two displays).
Figure 2
 
Illustration of discriminant center-surround saliency.
Figure 2
 
Illustration of discriminant center-surround saliency.
Figure 3
 
Bottom-up discriminant saliency detector.
Figure 3
 
Bottom-up discriminant saliency detector.
Figure 4
 
Discriminant saliency output (bottom row) for displays (top row) where target and distractors differ in terms of single features (A, orientation; B, color) or (C) feature conjunctions (color and orientation). Brightest regions are most salient. The strong saliency peaks at the targets of panels A and B indicate a strong pop-out effect. The lack of distinguishable saliency variations between the target (fourth line and fourth column) and distractors of panel C indicates that the target does not pop-out.
Figure 4
 
Discriminant saliency output (bottom row) for displays (top row) where target and distractors differ in terms of single features (A, orientation; B, color) or (C) feature conjunctions (color and orientation). Brightest regions are most salient. The strong saliency peaks at the targets of panels A and B indicate a strong pop-out effect. The lack of distinguishable saliency variations between the target (fourth line and fourth column) and distractors of panel C indicates that the target does not pop-out.
Figure 5
 
Example of pop-out asymmetry (discriminant saliency maps shown below each display). (Left) A target (“Q”) defined by the presence of a feature that the distractors (“O”) lack produces a strong pop-out effect. (Right) The reverse does not lead to noticeable pop-out.
Figure 5
 
Example of pop-out asymmetry (discriminant saliency maps shown below each display). (Left) A target (“Q”) defined by the presence of a feature that the distractors (“O”) lack produces a strong pop-out effect. (Right) The reverse does not lead to noticeable pop-out.
Figure 6
 
The nonlinearity of human saliency responses to orientation contrast (reproduced from Figure 9 of Nothdurft, 1993) (a) is replicated by discriminant saliency (b) but not by the model of Itti and Koch (2000) (c).
Figure 6
 
The nonlinearity of human saliency responses to orientation contrast (reproduced from Figure 9 of Nothdurft, 1993) (a) is replicated by discriminant saliency (b) but not by the model of Itti and Koch (2000) (c).
Figure 7
 
Mutual information between feature responses and class label, as a function of the ratio between the variances of the former in the center and the surround windows, σ1/σ0.
Figure 7
 
Mutual information between feature responses and class label, as a function of the ratio between the variances of the former in the center and the surround windows, σ1/σ0.
Figure 8
 
An example display (a) and performance of saliency detectors (discriminant saliency (b) and the model of Itti and Koch, 2000 (c)) on Treisman's Weber's law experiment.
Figure 8
 
An example display (a) and performance of saliency detectors (discriminant saliency (b) and the model of Itti and Koch, 2000 (c)) on Treisman's Weber's law experiment.
Figure 9
 
Discriminant saliency detector output for (a) a fast-moving target among slowly moving distracters and (b) a slowly moving target among fast-moving distractors. Top row shows quiver plots of the stimuli (the direction of motion is specified by the arrow whose length indicates the speed), and bottom row plots the corresponding saliency maps.
Figure 9
 
Discriminant saliency detector output for (a) a fast-moving target among slowly moving distracters and (b) a slowly moving target among fast-moving distractors. Top row shows quiver plots of the stimuli (the direction of motion is specified by the arrow whose length indicates the speed), and bottom row plots the corresponding saliency maps.
Figure 10
 
The nonlinearity of human saliency responses to motion contrast (reproduced from Figure 9 of Nothdurft, 1993) (b) is replicated by discriminant saliency (c). A quiver plot of one instance of the motion display used in the experiment (with background contrast (bg) = 0; target contrast (tg) = 60) is illustrated in panel a. The direction of motion is specified by the arrow, whose length indicates the speed.
Figure 10
 
The nonlinearity of human saliency responses to motion contrast (reproduced from Figure 9 of Nothdurft, 1993) (b) is replicated by discriminant saliency (c). A quiver plot of one instance of the motion display used in the experiment (with background contrast (bg) = 0; target contrast (tg) = 60) is illustrated in panel a. The direction of motion is specified by the arrow, whose length indicates the speed.
Figure 11
 
Average ROC area, as a function of inter-subject ROC area, for the saliency algorithms discussed in the text.
Figure 11
 
Average ROC area, as a function of inter-subject ROC area, for the saliency algorithms discussed in the text.
Figure 12
 
Saliency in the presence of ego-motion. (A–D) Representative frames from a video sequence shot with a moving camera, (E–H) the saliency map produced by the motion-based discriminant saliency detector, and (I–L) the “surprise” maps by the model of Itti and Baldi (2005) (Movie clip).
Figure 12
 
Saliency in the presence of ego-motion. (A–D) Representative frames from a video sequence shot with a moving camera, (E–H) the saliency map produced by the motion-based discriminant saliency detector, and (I–L) the “surprise” maps by the model of Itti and Baldi (2005) (Movie clip).
Figure 13
 
Performance of background subtraction algorithms on (a) water bottle, (b) surfer, and (c) cyclists.
Figure 13
 
Performance of background subtraction algorithms on (a) water bottle, (b) surfer, and (c) cyclists.
Figure 14
 
Results on bottle: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Figure 14
 
Results on bottle: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Figure 15
 
Results on surfer: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Figure 15
 
Results on surfer: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Figure 16
 
Results on cyclists: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Figure 16
 
Results on cyclists: (A–D) original; (E–H) DTDS; (I–L) surprise; and (M–P) GMM model (Movie clip).
Table 1
 
ROC areas for different saliency models with respect to all human fixations.
Table 1
 
ROC areas for different saliency models with respect to all human fixations.
Saliency model DiscriminantItti and Koch (2000)Bruce and Tsotsos (2006)Inter-subject
ROC area0.76940.72870.7547 0.8766
Supplementary Movie 1
Supplementary Movie 2
Supplementary Movie 3
Supplementary Movie 4
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×