In this section, we propose a measure of saliency at a pixel of interest from observations of dissimilarity between a center patch around the pixel and its nearby patches (see
Figure 2). Let us denote by
ρi the similarity between a patch centered at a pixel of interest and its
i-th neighboring patch. Then, the
dissimilarity is measured as a decreasing function of
ρ as follows:
The similarity function
ρ can be measured in a variety of ways (Rubner, Tomasi, & Guibas,
2000; Seo & Milanfar,
2009; Swain & Ballard,
1991), for instance, using the matrix cosine similarity between visual features computed in the two patches (Seo & Milanfar,
2009,
2010). For our experiments, we shall use the LARK features as defined in Takeda, Farsiu, and Milanfar (
2007), which have been shown to be robust to the presence of noise and other distortions. Much detailed description of these features is given in Takeda et al. (
2007) and Takeda, Milanfar, Protter, and Elad (
2009). We note that the effectiveness of LARK as a visual descriptor has led to its use for object and action detection and recognition, even in the presence of significant noise (Seo & Milanfar,
2009,
2010). From an estimation theory point of view, we assume that each observation
yi is in essence a measurement of the true saliency but measured with some error. This observation model can be posed as:
where
ηi is noise. Given these observations, we assume a locally constant model of saliency and estimate the expected saliency at pixel x
j by solving the weighted least squares problem
where
yr is a reference observation. We choose
yr where
i = 1, … ,
N ranges in a neighborhood of
j. As such,
yr is the most similar patch to the patch at
j. Depending on the difference between this reference observation
yr and each observation
yi, the kernel function
K(·) gives higher or lower weight to each observation as follows:
Therefore, the weight function gives higher weight to similar patch pairs than dissimilar patch pairs. The rationale behind this way of weighting is to avoid easily declaring saliency; that is, the aggregation of dissimilarities for a truly salient region should be still high even if we put more weight on the most similar patch pairs. Put yet another way, we do not easily allow any region to be declared salient, and thus we reduce the likelihood of false alarms. We set the weight of the reference observation itself,
wr. This setting avoids the excessive weighting of the reference observation in the average. The parameter
h controls the decay of the weights and is determined empirically to get best performance.