We present a novel unified framework for both static and space-time saliency detection. Our method is a bottom-up approach and computes so-called local regression kernels (i.e., local descriptors) from the given image (or a video), which measure the likeness of a pixel (or voxel) to its surroundings. Visual saliency is then computed using the said “self-resemblance” measure. The framework results in a saliency map where each pixel (or voxel) indicates the statistical likelihood of saliency of a feature matrix given its surrounding feature matrices. As a similarity measure, matrix cosine similarity (a generalization of cosine similarity) is employed. State of the art performance is demonstrated on commonly used human eye fixation data (static scenes (N. Bruce & J. Tsotsos, 2006) and dynamic scenes (L. Itti & P. Baldi, 2006)) and some psychological patterns.

*local steering kernels*and

*space-time local steering kernels*which capture local data structure exceedingly well. Our approach is motivated by a probabilistic framework, which is based on a nonparametric estimate of the likelihood of saliency. As we describe below, this boils down to the local calculation of a “self-resemblance” map, which measures the similarity of a feature matrix at a pixel of interest to its neighboring feature matrices.

*p*(

**f**), where

**f**is a local visual feature vector (i.e., derived from independent component analysis (ICA) performed on a large sample of small RGB patches in the image.) The probability density function is estimated based on a Gaussian kernel density estimate in a neural circuit.

**f**

_{G}represents a global feature that summarizes the appearance of the scene, and approximated this conditional probability density function by fitting to a multivariate exponential distribution. Zhang et al. (2008) also proposed saliency detection using natural statistics (SUN) based on a similar Bayesian framework to estimate the probability of a target at every location. They also claimed that their saliency measure emerges from the use of Shannon's self-information under certain assumptions. They used ICA features as similarly done in Bruce and Tsotsos (2006), but their method differs from Bruce and Tsotsos (2006) in that natural image statistics were applied to determine the density function of ICA features. Itti and Baldi (2006) proposed so-called “Bayesian Surprise” and extended it to the video case (Itti & Baldi, 2005). They measured KL-divergence between a prior distribution and posterior distribution as a measure of saliency.

*y*

_{i}denote whether a pixel position

**x**

_{i}= [

**x**

_{1};

**x**

_{2}]

_{i}

^{T}is salient or not as follows:

*i*= 1,…,

*M*, and

*M*is the total number of pixels in the image. Motivated by the approach in Zhang et al. (2008) and Oliva et al. (2003), we define saliency at pixel position

**x**

_{i}as a posterior probability

*Pr*(

*y*

_{i}= 1∣

**F**) as follows:

**F**

_{i}= [

**f**

_{i}

^{1},…,

**f**

_{i}

^{L}] at pixel of interest

**x**

_{i}(what we call a center feature,) contains a set of feature vectors (

**f**

_{i}) in a local neighborhood where

*L*is the number of features in that neighborhood. (Note that if

*L*= 1, we use a single feature vector. Using a feature matrix consisting of a set of feature vectors provides more discriminative power than using a single feature vector as also pointed out in Wu and Nevatia, 2007 and Bregonzio, Gong, & Xiang, 2009.) In turn, the larger collection of features

**F**= [

**F**

_{1},…,

**F**

_{N}] is a matrix containing features not only from the center, but also a surrounding region (what we call a center + surround region; see Figure 2).

*N*is the number of feature matrices in the center + surround region. Using Bayes' theorem, Equation 2 can be written as

*p*(

**F**) are uniform over features, the saliency we defined boils down to the conditional probability density

*p*(

**F**∣

*y*

_{i}= 1).

*p*(

**F**∣

*y*

_{ i}= 1), we need to estimate it. It is worth noting that Gao et al. (2008) and Zhang et al. (2008) fit the marginal density of local feature vectors

*p*(

**f**) to a generalized Gaussian distribution. However, in this paper, we approximate the conditional density function

*p*(

**F**∣

*y*

_{i}= 1) based on nonparametric kernel density estimation which will be explained in detail in Saliency by self-resemblance section.

*p*(

**F**∣

*y*

_{i}= 1) using nonparametric kernel density estimation (see Figure 6). 3) While Itti and Baldi (2006) computed, as a measure of saliency, KL-divergence between a prior and a posterior distribution, we explicitly estimate the likelihood function directly using nonparametric kernel density estimation. 4) Our space-time saliency detection method does not require explicit motion estimation. 5) The proposed unified framework can handle both static and space-time saliency detection. Figure 1 shows an overview of our proposed framework for saliency detection. To summarize the operation of the overall algorithm, we first compute the normalized local steering kernels (space-time local steering kernels) from the given image (video) I and vectorize them as

**f**'s. Then, we identify features

**F**

_{i}centered at a pixel of interest

**x**

_{i}, and a set of feature matrices

**F**

_{j}in a center + surrounding region and compute the self-resemblance measure (see Equations 16 and 17). The final saliency map is given as a density map as shown in Figure 1. A shorter version of this paper (available online from http://users.soe.ucsc.edu/~milanfar/publications) can be found in the proceeding of IEEE Conference on Computer Vision and Pattern Recognition, 1st International Workshop on Visual Scene Understanding (ViSU09) (Seo & Milanfar, 2009b).

*l*∈ {1,…,

*P*},

*P*is the number of pixels in a local window;

*h*is a global smoothing parameter (This parameter is set to 1 and fixed for the all experiments.) The matrix

**C**

_{ l}is a covariance matrix estimated from a collection of spatial gradient vectors within the local analysis window around a position

**x**

_{ l}= [

**x**

_{1};

**x**

_{2}]

_{ l}

^{ T}. More specifically, the covariance matrix

**C**

_{ l}can be first naively estimated as

**J**

_{ l}

^{ T}

**J**

_{ l}with

*z*

_{ x 1}(·) and

*z*

_{ x 2}(·) are the first derivatives along

*x*

_{1}-; and

*x*

_{2}-axes. For the sake of robustness, we compute a more stable estimate of

**C**

_{ l}by invoking the singular value decomposition (SVD) of

**J**

_{l}with regularization as (Takeda et al., 2007; Seo & Milanfar, 2009a)

*λ*′ and

*λ*″ are parameters (

*λ*′ and

*λ*″ are set to 1 and 10

^{−7}respectively, and they are fixed for all experiments.) that dampen the noise effect and keep the denominators of

*a*

_{q}'s from being zero, and

*α*is a parameter (

*α*is set to 0.008 and fixed for all experiments.) that restricts

*γ*. The singular values (

*s*

_{1},

*s*

_{2}) and the singular vectors (

**v**

_{1},

**v**

_{2}) are given by the compact SVD of

**J**

_{l}=

**U**

_{l}

**S**

_{l}

**V**

_{l}

^{T}=

**U**

_{l}diag[

*s*

_{1},

*s*

_{2}]

_{l}[

**v**

_{1},

**v**

_{2}]

_{l}

^{T}. Figure 3 illustrates that how covariance matrices and LSK values are computed in an edge region.

**x**

_{ l}= [

*x*

_{1},

*x*

_{2},

*t*]

_{ l}

^{ T}:

*x*

_{1}and

*x*

_{2}are the spatial coordinates,

*t*is the temporal coordinate. The approach is fundamentally the same as in 2-D. Again, the covariance matrix

**C**

_{ l}can be naively estimated as

**J**

_{ l}

^{ T}

**J**

_{ l}with

*z*

_{ x 1}(·),

*z*

_{ x 2}(·), and

*z*

_{ t}(·) are the first derivatives along

*x*

_{1}-,

*x*

_{2}-, and

*t*-axes, and

*P*is the total number of samples in a

*space-time*local analysis window (or cube) around a sample position at

**x**

_{ i}. As similarly done in 2-D case,

**C**

_{ l}is estimated by invoking the singular value decomposition (SVD) of

**J**

_{ l}with regularization as (Takeda et al., 2009):

*λ*′ and

*λ*″ are parameters that dampen the noise effect and restrict

*γ*and the denominators of

*a*

_{q}'s from being zero. As mentioned earlier, the singular values (

*s*

_{1},

*s*

_{2}, and

*s*

_{3}) and the singular vectors (

**v**

_{1},

**v**

_{2}, and

**v**

_{3}) are given by the compact SVD of

**J**

_{l}=

**U**

_{l}

**S**

_{l}

**V**

_{l}

^{T}=

**U**

_{l}diag[

*s*

_{1},

*s*

_{2},

*s*

_{3}]

_{l}[

**v**

_{1},

**v**

_{2},

**v**

_{3}]

_{l}

^{T}.

**C**

_{ l}modifies the shape and size of the local kernel in a way which robustly encodes the space-time local geometric structures present in the video (see Figure 4b for an example). Similarly to 2D case, 3-D LSKs are formed as follows:

*K*are based on the covariance matrices

**C**

_{ l}along with their spatial locations

**x**

_{ l}. Intuitively,

**C**

_{ l}'s in the local analysis window are similar to one another in the flat region. Therefore, only spatial locations affect the kernel shape, which looks more or less symmetric or isotropic in the flat region. On the other hand, in the edge region, the kernel size and shape depend on both

**C**

_{ l}and its spatial location

**x**

_{ l}in the local window. Thus, high values in the kernel are yielded along the edge region whereas the rest of kernel values are near zero. For a more in depth analysis of local steering kernels, we refer the interested reader to Takeda et al. (2007) and Takeda et al. (2009).

**x**

_{ i}, we will essentially be using (a normalized version of) the function

*K*(

**x**

_{ l}−

**x**

_{ i}). To be more specific, the local steering kernel function

*K*(

**x**

_{ l}−

**x**

_{ i}) is calculated at every pixel location and normalized as follows

**C**

_{ l}, LSK features are robust against signal uncertainty such as presence of noise. In addition, the normalized version of LSKs provides certain invariance to illumination changes as shown in Figure 5.

**F**

_{ i}by using

**f**′s which are a vectorized version of

*W*'s. More specifically, we collect

**f**

_{ i}

^{ j}in a local window (say, 3 × 3) centered at the pixel of interest

**x**

_{ i}, where

*j*= 1,…, 9. Then, in a larger window (say, 5 × 5) also centered at

**x**

_{ i}, center + surround feature matrices

**F**= [

**F**

_{1},…,

**F**

_{25}] are obtained. In the following section, we explain how we nonparametrically estimate the conditional probability density

*p*(

**F**∣

*y*

_{ i}= 1) discussed in Overview of the proposed approach section.

**x**

_{ i}is measured using the conditional density of the feature matrix at that position:

*S*

_{ i}=

*p*(

**F**∣

*y*

_{ i}= 1). Hence, the task at hand is to estimate

*p*(

**F**∣

*y*

_{ i}= 1) over

*i*= 1,…,

*M*. In general, the Parzen density estimator is a simple and generally accurate non-parametric density estimation method (Silverman, 1986). However, in higher dimensions and with an expected long-tail distribution, the Parzen density estimator with an isotropic kernel is not the most appropriate tool (Bengio, Larochelle, & Vincent, 2005; Brox, Rosenhahn, & Cremers, 2007; Vincent & Bengio, 2003). As explained earlier, the LSK features tend to generically come from long-tailed distributions, and as such, there are generally no tight clusters in the feature space. When we estimate a probability density at a particular feature point, for instance

**F**

_{i}= [

**f**

_{i}

^{1},…,

**f**

_{i}

^{L}] (where

*L*is the number of vectorized LSKs (

**f**'s) employed in the feature matrix), the isotropic kernel centered on that feature point will spread its density mass equally along all the feature space directions, thus giving too much emphasis to irrelevant regions of space and too little along the manifold. Earlier studies (Bengio et al., 2005; Brox et al., 2007; Vincent & Bengio, 2003) also pointed out this problem. This motivates us to use

*a locally data-adaptive kernel density estimator*. We define the conditional probability density

*p*(

**F**∣

*y*

_{i}= 1) at

**x**

_{i}as a center value of a normalized adaptive kernel (weight function)

*G*(·) computed in the center + surround region as follows:

*G*

_{i}in Equation 13 can be defined by using the concept of matrix cosine similarity (Seo & Milanfar, 2009a) as follows:

_{i}=

**f**

_{i}

^{1},…

**f**

_{i}

^{L}] and

_{j}=

**f**

_{j}

^{1},…

**f**

_{j}

^{L}], ∥ · ∥

_{F}is the Frobenius norm, and

*σ*is a parameter (This parameter is set to 0.07 and fixed for all the experiments.) controlling the fall-off of weights. Here,

*ρ*(

**F**

_{i},

**F**

_{j}) is the “Matrix Cosine Similarity (MCS)” between two feature matrices

**F**

_{i},

**F**

_{j}and is defined as the “Frobenius inner product” between two normalized matrices (

*ρ*(

**F**

_{i},

**F**

_{j}) = 〈

**F**

_{i},

**F**

_{j}〉

_{F}= trace (

*ρ*(

**f**

_{i},

**f**

_{j}) between each pair of corresponding feature vectors (i.e., columns) in

**F**

_{i},

**F**

_{j}as follows:

**F**

_{i},

**F**

_{j}. This measure not only generalizes the cosine similarity, but also overcomes the disadvantages of the conventional Euclidean distance which is sensitive to outliers (This measure can be efficiently computed by column-stacking the matrices

**F**

_{i},

**F**

_{j}and simply computing the cosine similarity between two long column vectors.) By inserting Equation 14 into Equation 13,

*S*

_{i}can be rewritten as follows:

*G*

_{ i}look like in various regions of a natural image. As shown in Figure 6, at

**x**

_{ i}(that is,

*S*

_{ i}=

**F**∣

*y*

_{ i}= 1)) can be explained by the peak value of the normalized weight function

*G*

_{ i}which contains contributions from all the surrounding feature matrices. In other words,

**F**∣

*y*

_{ i}= 1) reveals how salient

**F**

_{ i}is given all the features

**F**

_{ j}'s in a neighborhood.

*c*

_{1},

*c*

_{2},

*c*

_{3}as

**F**

_{i}

^{c1},

**F**

_{i}

^{c2},

**F**

_{i}

^{c3}as shown in Figure 7. By collecting them as a larger matrix

_{i}= [

**F**

_{i}

^{c1},

**F**

_{i}

^{c2},

**F**

_{i}

^{c3}], we can apply matrix cosine similarity between

_{i}and

_{j}. Then, the saliency map from color channels can be analogously defined as follows:

I: input image or video, P: size of local steering kernel (LSK) or 3-D LSK window, h: a global smoothing parameter for LSK, L: number of LSK or 3-D LSK used in the feature matrix, N: size of a center + surrounding region for computing self-resemblance, σ: a parameter controlling fall-off of weights for computing self-resemblance. |
---|

Stage 1: Compute Features |

if I is an image then |

Compute the normalized LSK W _{ i} and vectorize it to f _{ i}, where i = 1,…, M. |

else |

Compute the normalized 3-D LSK W _{ i} and vectorize it to f _{ i}, where i = 1,…, M. |

end if |

Stage 2: Compute Self-Resemblance |

for i = 1,…, M do |

if I is a grayscale image (or video) then |

Identify feature matrices F _{ i}, F _{ j} in a local neighborhood. S i = 1 ∑ j = 1 N exp ( − 1 + ρ ( F i , F j ) σ 2 ) |

else |

Identify feature matrices F _{ i} = [ F _{ i} ^{ c1}, F _{ i} ^{ c3}, F _{ i} ^{c}^{3}] and F _{ j} = [ F _{ j} ^{ c1}, F _{ j} ^{ c3}, F _{ j} ^{ c3}] in a local neighborhood from three color channels. |

S i = 1 ∑ j = 1 N exp ( − 1 + ρ ( F i , F j ) σ 2 ) |

end if |

end for |

Output: Saliency map S _{ i}, i = 1,…, M |

*I*to an appropriate coarse scale (64 × 64) (changing the scale leads to a different result in the saliency map. Assume that we use a 3 × 3 LSK and 5 × 5 local analysis window for

**F**. If the visual search is performed at a fine scale, finer detail will be captured as salient whereas at a coarser scale, larger objects will be considered to be salient. As expected, computing saliency map at a finer scale takes longer. In fact, we have tried to combine saliency maps from multi-scale, but this idea did not improve performance even at the expense of time complexity. This brings up an interesting question worth considering for future research; namely; what is the optimal resolution for saliency detection? Clearly, higher resolution images do not imply better saliency maps.) We then compute LSK of size 3 × 3 as features and generate feature matrices

**F**

_{ i}in a 5 × 5 local neighborhood. The number of LSK used in the feature matrix

**F**

_{ i}is set to 9. For all the experiments, the smoothing parameter

*h*for computing LSK was set to 1 and the fall-off parameter

*σ*for computing self-resemblance was set to 0.07. We obtained an overall saliency map by using CIE L*a*b* color space throughout all the experiments. A typical run time takes about 1 second at scale (64 × 64) on an Intel Pentium 4, 2.66 GHz core 2 PC with 2 GB RAM.

*Hit Rate*(HR) and the

*False Alarm Rate*(FAR) as follows:

*O*is a proto-objects map,

*k*is the image index. From Table 1, we observe that our method overall outperforms Hou and Zhang's method (downloadable from http://bcmi.sjtu.edu.cn/~houxiaodi/) (Hou & Zhang, 2008b) and Itti's method (downloadable from http://www.saliencytoolbox.net/) (Itti et al., 1998).

*I*to a coarse spatial scale (64 × 64) in order to reduce the time-complexity (we do not downsample the video in the time domain.) We then compute 3-D LSK of size 3 × 3 × 3 as features and generate feature matrices

**F**

_{ i}in a (3 × 3 × 7) local space-time neighborhood. The number of 3-D LSK used in the feature matrix

**F**

_{ i}is set to 1 for time efficiency. The procedure for detecting space-time proto-objects and the rest of parameters remain the same as in the 2-D case. A typical run of space-time saliency detection takes about 52 seconds on 50 frames of a video at spatial scale (64 × 64) on an Intel Pentium 4, 2.66 GHz core 2 PC with 2 GB RAM.

Model | KL ( SE) | ROC ( SE) |
---|---|---|

Itti et al. (1998) | 0.1130 (0.0011) | 0.6146 (0.0008) |

Bruce and Tsotsos (2006) | 0.2029 (0.0017) | 0.6727 (0.0008) |

Gao et al. (2008) | 0.1535 (0.0016) | 0.6395 (0.0007) |

Zhang et al. (2008) | 0.2097 (0.0016) | 0.6570 (0.0008) |

Hou and Zhang (2008b) | 0.2511 (0.0019) | 0.6823 (0.0007) |

Our method | 0.2779 (0.002) | 0.6896 (0.0007) |

*N*: size of center + surrounding region for computing self-resemblance; 2)

*P*: size of LSK; and 3)

*L*: number of LSK used in the feature matrix. As shown in Figure 16, it turns out that as we increase

*N,*the overall performance is improved while increasing

*P*and

*L*rather deteriorates the performance. Overall, the best performance was achieved with the choice of

*P*= 3 × 3 = 9;

*L*= 3 × 3 = 9; and

*N*= 7 × 7 = 49 at the expense of increased runtime.

*P*

_{f}(1 −

*P*

_{d}) → exp(−

*α*

*α*is a constant and

*h*,

*σ,*and

*λ,*these parameters are mostly set and fixed for all the experiments. Our model is somewhat similar to Gao et al. (2008) in the sense that a center-surround notion is used to compute saliency. One of the most important factors which makes the proposed method more effective is the use of LSKs as features. LSKs can capture local geometric structure exceedingly well even in the presence of signal uncertainty. In addition, unlike standard fusion methods which linearly and directly combine saliency maps computed from each color channel, we used the matrix cosine similarity to combine information from three color spaces. Our comprehensive experimental results indicate that the self-resemblance measure derived from a locally data-adaptive kernel density estimator is much more effective and simpler than other existing methods and does not require any training. Although our method is built entirely on computational principles, the resulting model structure exhibits considerable agreement with fixation behavior of the human visual system. With very good features like LSKs, the center-surround model is arguably an effective computational model of how the human visual system works.

*local steering kernels*; and by using a nonparametric kernel density estimation based on (

*Matrix Cosine Similarity*) (MCS). The proposed method can automatically detect salient objects in the given image and salient moving objects in videos. The proposed method is practically appealing because it is nonparametric, fast, and robust to uncertainty in the data. Experiments on challenging sets of real-world human fixation data (both images and videos) demonstrated that the proposed saliency detection method achieves a high degree of accuracy and improves upon state-of-the-art methods. Due to its robustness to noise and other systemic perturbations, we also expect the present framework to be quite effective in other applications such as image quality assessment, background subtraction in dynamic scene, and video summarization.