**Abstract**:

**Abstract**
**The** **human visual system possesses the remarkable ability to pick out salient objects in images.** **Even more impressive is its ability to do the very same in the presence of disturbances. In particular, the ability persists despite the presence of noise, poor weather, and other impediments to perfect vision.** **Meanwhile, noise can significantly degrade the accuracy of automated computational saliency detection algorithms. In this article, we set out to remedy this shortcoming. Existing computational saliency models generally assume that the given image is clean, and a fundamental and explicit treatment of saliency in noisy images is missing from the literature. Here we propose a novel and statistically sound method for estimating saliency based on a nonparametric regression framework and investigate the stability of saliency models for noisy images and analyze how state-of-the-art computational models respond to noisy visual stimuli. The proposed model of saliency at a pixel of interest is a data-dependent weighted average of** *dissimilarities***between a center patch around that pixel and other patches. To further enhance the degree of accuracy in predicting the human fixations and of stability to noise, we incorporate a global and multiscale approach by extending the local analysis window to the entire input image, even further to multiple scaled copies of the image. Our method consistently outperforms six other state-of-the-art models (Bruce & Tsotsos, 2009; Garcia-Diaz, Fdez-Vidal, Pardo, & Dosil, 2012; Goferman, Zelnik-Manor, & Tal, 2010; Hou & Zhang, 2007; Seo & Milanfar, 2009; Zhang, Tong, & Marks, 2008) for both noise-free and noisy cases.**

*the saliency map*) representing visual saliency in that image. This saliency map has been useful in many applications, such as object detection (Rosin, 2009; Rutishauser, Walther, Koch, & Perona, 2004; Seo & Milanfar, 2010; Zhicheng & Itti, 2011), image quality assessment (Ma & Zhang, 2008; Niassi, LeMeur, Lecallet, & Barba, 2007), and action detection (Seo & Milanfar, 2009) and more.

*dissimilarities*between a center patch of the region and other patches (Figure 2). Once we have measured these dissimilarities, the problem of interest is how to aggregate them to obtain an estimate of the underlying saliency of that region. We look at this problem from an estimation theory point of view and propose a novel and statistically sound saliency model. We assume that each observed dissimilarity has an underlying true value, which is measured with uncertainty. Given these noisy observations, we estimate the underlying saliency by solving a local data-dependent weighted least squares problem. As we will see in the next section, this results in an aggregation of the dissimilarities with weights depending on a kernel function to be specified. We define the kernel function so that it gives higher weight to similar patch pairs than dissimilar patch pairs. Giving higher weights to more similar patch pairs would seem counterintuitive at first. But this process will ensure that only truly salient objects would be declared so, sparing us from too many false declarations of saliency. The proposed estimate of saliency at pixel x

*is defined as: where*

_{j}*y*and

_{i}*w*are the observed dissimilarity (to be defined shortly in the next section) and the weight for the

_{ij}*ij*-th patch pair, respectively.

*by where*

_{j}*y*= exp(−

_{i}*ρ*/

_{i}*τ*) and

*ρ*is the cosine

_{i}*similarity*between visual features extracted from the center patch around the pixel x

*and its*

_{j}*i*-th nearby patch. This saliency model is (to within a constant) the harmonic mean of dissimilarities,

*y*'s.

_{i}*as where*

_{j}*y*is the dissimilarity measure between a center patch around the pixel x

_{i}*and any other patch observed in the test image. This saliency model is the arithmetic mean of*

_{j}*y*'s. Besides the use of the exponential, the important difference as compared with our approach is that they use constant weights

_{i}*w*= 1/

_{ij}*N*for the aggregation of dissimilarities, whereas we use data-dependent weights.

*arithmetic*aggregation based on kernel regression.

*ρ*the similarity between a patch centered at a pixel of interest and its

_{i}*i*-th neighboring patch. Then, the

*dissimilarity*is measured as a decreasing function of

*ρ*as follows: The similarity function

*ρ*can be measured in a variety of ways (Rubner, Tomasi, & Guibas, 2000; Seo & Milanfar, 2009; Swain & Ballard, 1991), for instance, using the matrix cosine similarity between visual features computed in the two patches (Seo & Milanfar, 2009, 2010). For our experiments, we shall use the LARK features as defined in Takeda, Farsiu, and Milanfar (2007), which have been shown to be robust to the presence of noise and other distortions. Much detailed description of these features is given in Takeda et al. (2007) and Takeda, Milanfar, Protter, and Elad (2009). We note that the effectiveness of LARK as a visual descriptor has led to its use for object and action detection and recognition, even in the presence of significant noise (Seo & Milanfar, 2009, 2010). From an estimation theory point of view, we assume that each observation

*y*is in essence a measurement of the true saliency but measured with some error. This observation model can be posed as: where

_{i}*η*is noise. Given these observations, we assume a locally constant model of saliency and estimate the expected saliency at pixel x

_{i}*by solving the weighted least squares problem where*

_{j}*y*is a reference observation. We choose

_{r}*y*where

_{r}*i*= 1, … ,

*N*ranges in a neighborhood of

*j*. As such,

*y*is the most similar patch to the patch at

_{r}*j*. Depending on the difference between this reference observation

*y*and each observation

_{r}*y*, the kernel function

_{i}*K*(·) gives higher or lower weight to each observation as follows: Therefore, the weight function gives higher weight to similar patch pairs than dissimilar patch pairs. The rationale behind this way of weighting is to avoid easily declaring saliency; that is, the aggregation of dissimilarities for a truly salient region should be still high even if we put more weight on the most similar patch pairs. Put yet another way, we do not easily allow any region to be declared salient, and thus we reduce the likelihood of false alarms. We set the weight of the reference observation itself,

*w*$=maxiwi$. This setting avoids the excessive weighting of the reference observation in the average. The parameter

_{r}*h*controls the decay of the weights and is determined empirically to get best performance.

*R*= {

*r*

_{1},

*r*

_{2}, … ,

*r*} the multiple scales applied to the input image (the horizontal axis in Figure 3), and then at each scale

_{M}*r*, where 1 ≤

_{m}*m*≤

*M*, we compute the dissimilarity of the center patch relative to all patches observed in the images whose scales are

*R*= {

_{q}*r*,

_{m}*r*/2,

_{m}*r*/4} (the vertical axis in Figure 3). Consequently, for

_{m}*M*scales,

*M*saliency maps are computed and resized to the original image size by bilinear interpolation. The resulting multiple saliency maps are then combined into one by simple averaging. Figure 4 demonstrates the difference between those saliency maps obtained at different scales. Although the fine scale result detects details such as textures and edges, the coarse scale result detects global features. Note that we fixed the size of the patch at each scale

*r*as the yellow rectangle shown in Figure 3.

_{m}*y*= $e\u2212\rho (pi,pj)$, where

_{i}*p*is the

_{i}*i*-th patch observed across the multiscale pyramid and

*p*is the center patch at pixel x

_{j}*. Therefore, we can rewrite the saliency equation at each scale*

_{j}*r*as follows: The saliency at pixel x

_{m}*is taken as the mean of its saliency across all scales: In the next section, we first evaluate our saliency model for clean images against six existing saliency models (Bruce & Tsotsos, 2009; Garcia-Diaz et al., 2012; Goferman et al., 2010; Hou & Zhang, 2007; Seo & Milanfar, 2009; Zhang et al., 2008) and then investigate the stability of our saliency model for noisy images. We also see the effect of the global and multiscale approach on overall performance.*

_{j}*ρ*in Equation 4 is computed using the matrix cosine similarity between the LARK features as in the model of Seo and Milanfar (2009). We sample patches of 7 × 7 with 50% overlap from multiple scale images. We use three scales (

*M*= 3),

*R*= {1.0, 0.6, 0.4}, and the smallest scale allowed in

*R*is 20% of the original size, as in Goferman et al. (2010).

_{q}Model | AUC (SE) | SCC |

Proposed method | 0.713 (0.0007) | 0.386 |

Garcia-Diaz et al. (2012) | 0.714 (0.0008) | 0.362 |

Seo and Milanfar (2009) | 0.696 (0.0007) | 0.346 |

Goferman et al. (2010) | 0.686 (0.0008) | 0.405 |

Hou and Zhang (2007) | 0.672 (0.0007) | 0.317 |

Bruce and Tsotsos (2009) | 0.672 (0.0007) | 0.424 |

Zhang et al. (2008) | 0.639 (0.0007) | 0.243 |

Model | std(0) | std(5) | std(10) |

Proposed method | 0.713 | 0.713 | 0.709 |

Garcia-Diaz et al. (2012) | 0.714 | 0.713 | 0.710 |

Seo and Milanfar (2009) | 0.696 | 0.695 | 0.693 |

Goferman et al. (2010) | 0.686 | 0.686 | 0.681 |

Hou and Zhang (2007) | 0.672 | 0.670 | 0.668 |

Bruce and Tsotsos (2009) | 0.672 | 0.672 | 0.672 |

Zhang et al. (2008) | 0.639 | 0.638 | 0.638 |

*σ*

^{2}, which equals 0.01, 0.05, 0.1, or 0.2 (the intensity value for each pixel of the image ranges from 0 to 1). The saliency maps computed from the noisy images are compared to the human fixations through the same procedure. One may be concerned that the human fixations used in this evaluation were recorded from noise-free images and not the corrupted images. However, we focus on investigating the sensitivity of computational models of visual attention subjected to visual degradations rather than evaluating the performance in predicting human fixation data in noisy images. Therefore, we use the same human fixations to see if the computational models achieve the same performance as in the noise-free case. Also, to the best of our knowledge, there is no available public fixation database on noisy images. So, we resorted instead to analyzing how state-of-the-art computational models respond to noisy visual stimuli.

*and*noisy cases simultaneously and improves on competing models over a large range of noise strengths.

*Journal of Vision*, 9(3):5, 1–24, http://www.journalofvision.org/content/9/3/5, doi:10.1167/9.3.5. [PubMed] [Article] [CrossRef] [PubMed]

*Interdisciplinary Applied Mathematics*, vol. 34. Springer-Verlag: Berlin.

*Journal of Vision*, 8(7):13, 1–18, http://www.journalofvision.org/content/8/7/13, doi:10.1167/8.7.13. [PubMed] [Article] [CrossRef] [PubMed]

*Image and Vision Computing*

*,*30

*,*51–64. [CrossRef]

*IEEE International Conference on Computer Vision and Pattern Recognition*. pp. 2376–2383.

*Proceedings of IEEE Conference Computer Vision and Pattern Recognition*, pp. 1–8.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*

*,*20

*,*1254–1259. [CrossRef]

*SPIE Conference on Computational Imaging X*, 82960U.

*Human Neurobiology*

*,*4

*,*219–227. [PubMed]

*IEEE International Conference on Image Processing*

*,*pp. 3285–3288.

*Advanced Intelligent Computing Theories and Applications*.

*With Aspects of Theoretical and Methodological Issues*, 5226, 1124–1133.

*IEEE International Conference on Image Processing*

*,*pp. II-169–II-172.

*Spatial Vision*

*,*16

*,*125–154. [CrossRef] [PubMed]

*Pattern Recognition*

*,*42

*,*2363–2371. [CrossRef]

*International Journal of Computer Vision*

*,*40

*,*99–121. [CrossRef]

*IEEE Conference on Computer Vision and Pattern Recognition*

*,*2

*,*II–37–II-44.

*Journal of Vision*, 9(12):15, 1–27, http://www.journalofvision.org/content/9/12/15, doi:10.1167/9.12.15. [PubMed] [Article] [CrossRef] [PubMed]

*IEEE Transactions on Pattern Analysis and Machine Intelligence*

*,*32

*,*1688–1704. [CrossRef] [PubMed]

*International Journal of Computer Vision*

*,*7

*,*11–32. [CrossRef]

*IEEE Transactions on Image Processing*

*,*16

*,*349–366. [CrossRef] [PubMed]

*IEEE Transactions on Image Processing*

*,*18

*,*1958–1975. [CrossRef] [PubMed]

*Vision Research*

*,*45

*,*643–659. [CrossRef]

*Neural Networks*

*,*19

*,*1395–1407. [CrossRef] [PubMed]

*Journal of Vision*, 8(7):32, 1–20, http://www.journalofvision.org/content/8/7/32, doi:10.1167/8.7.32. [PubMed] [Article] [CrossRef] [PubMed]

*IEEE Transactions on Image Processing*

*,*20

*,*2017–2029. [CrossRef] [PubMed]