There are a few related proposals that include similar definitions of the saliency of visual content. We have deferred discussion of these models to this point in the text so that specific reference to some of the results appearing in this paper may be made. A central element of the proposal put forth in this paper, as mentioned briefly, is that under the assumption of a sparse representation, the likelihood of local content as characterized by a sparse ensemble of cells may be reduced to a computationally feasible problem of many likelihood estimates of one dimension. This was a point that was the focus of Bruce (
2004), which presented this point along with the suggestion that this computation might be performed for any arbitrary definition of context. In Bruce (
2004) qualitative results of this measure were presented for a definition of context based on the entirety of a single natural image under consideration, or for a definition based on
ecological statistics in which a large set of natural images forms the likelihood estimate. Zhang, Tong, Marks, Shan, and Cottrell (
2008) have presented analysis of the relationship between this latter definition and locations fixated by human observers. The results they present show comparable performance to results for which the estimation is based on the image in question or based on a local surround region. However, such a definition precludes the possibility of context specific determination of saliency and thus will not produce any of the behaviors associated with the various psychophysics paradigms we have considered. There are a few behaviors that Zhang et al. describe, which they suggest a context specific model of saliency fails to capture, such as the asymmetric performance observed in a visual search for a bar oriented 5 degrees from vertical among many vertical bars versus a bar oriented vertically among many bars oriented 5 degrees from vertical, with the suggestion that a likelihood based on natural image statistics is necessary to account for this effect. There is however a significant oversight associated with this statement. An encoding based on ICA is optimal with respect to the statistics of the natural environment and therefore there is some representation of natural image statistics in general inherent in the context specific model in the definition of the receptive fields themselves. Therefore one also observes this specific asymmetry in the context of our model as the units that respond most strongly to content oriented 5 degrees from vertical also respond to a vertically oriented edge, but the converse is not the case (or the response is weaker) as the nature of the coding dictates a smaller orientation bandwidth for vertical edges. Combined with a suppressive surround, this results in the observed performance asymmetry. The same may be said of novelty of stimuli (e.g., Shen & Reingold,
2001; Wang, Cavanagh, & Green,
1994) assuming familiarity with a specific character set may have as a consequence a more efficient neural representation. It is also interesting to note that as receptive field properties (and image statistics) vary with position in the visual field, that behavior in tasks for which performance is anisotropic with respect to the location in the visual field might also be explained by AIM. This however is an issue that is difficult from an implementation perspective, requiring different cell types for different locations in the visual field and an explicit model of dependencies between different cell types. There are a few additional points of interest that appear in the work of Zhang et al., which are discussed at the end of this section. A definition that is closer to the former definition appearing in Bruce (
2004) in which the likelihood estimate is based on the content of the entirety of a single image under consideration appears in Torralba, Oliva, Castelhano, and Henderson (
2006). In Torralba et al. (
2006), the focus is on object recognition and how context may guide fixations in the search for a specific object. They propose the following definition:
P(
O = 1,
X∣
L, G), where
O = 1 indicates that the object
O in question is present,
X is the location within the scene, and
L and
G are the local and global features, respectively. Via Bayes rule and excluding certain terms that appear in reformulating this definition, one arrives at an expression for saliency
S(
x) =
p(
X∣
O = 1,
G). While the focus of Torralba et al. (
2006) is on how context informs the saliency within the context of an object recognition task given by the location likelihood conditioned on the global statistics for instances in which the object appears, the formulation also results in a term that is the inverse function of the likelihood of some set of local features conditioned on the global features. In the model of Torralba et al. (
2006), they propose that image structure is captured on the basis of global image features. These global features consist of a coarse spatial quantization of the image and the features themselves pool content across many feature channels for each spatial location. Given this formulation, evaluation of
P(
L∣
G) directly is infeasible. For this reason, an estimate of
P(
L∣
G) is computed on the basis of the joint likelihood of a vector of local features based on a model of the distribution of said features over the entire scene. The likelihood
P(
L∣
G) is fit to a multivariate power exponential distribution with assumptions on the form of the distribution allowing an estimate of the joint likelihood of a local feature vector. Aside from the most obvious difference between the proposal put forth in this paper and that appearing in Torralba et al. (
2006); that being computation of saliency based on local context as mediated by surround suppression, versus global receptive fields), there are a few comments that may be made in regards to the relationship to the proposal put forth in this paper. A significant point that may be made is that in considering the joint likelihood of a set of features, one once again fails to predict a variety of the behaviors observed in the psychophysics examples. For example the independence assumption is central to some of the psychophysics behaviors discussed, such as the distinction between a pop-out and conjunction search, or the feature presence/absence asymmetry. Secondly, it is unclear how the computational machinery proposed to achieve this estimate will scale with the number of features considered and it is likely that a local surround contains an insufficient number of samples for the required covariance matrix estimate. Therefore, it is once again the case that this proposal does not correspond to behavior observed psychophysically and also seems to prohibit computation of a measure of information in which the context is local. It should be noted that this quantity is not the main focus of the proposal of Torralba et al. (
2006) but does serve as a useful point of contrast with the proposal at hand and highlights the importance of sparsity for likelihood estimation involved in a case where the data contributing to such an estimate is limited. It is interesting to note that the circuitry required to implement AIM is consistent with the behavior of local surround suppression with the implication that surround suppression may subserve saliency computation in line with recent suggestions (Petrov & McKee,
2006). There are in fact several considerations pertaining to the form of a local surround-based density estimate that mirror the findings of Petrov and McKee. Specifically, suppression in the surround comes from features matching the effective stimulus for the cell under consideration, is spatially isotropic, is a function of relative contrast, is prominent in the periphery and absent in the fovea, and the spatial extent of surround suppression does not scale with spatial frequency. It is also interesting to note that suppression of this type is observed for virtually all types of features (Shen, Xu, & Li,
2007). It is important to note that AIM is the sole proposal that is consistent with the entire range of psychophysical results considered and has a strong neural correlate in its relationship to behavior observed in the recent surround suppression literature.