A proposal for saliency computation within the visual cortex is put forth based on the premise that localized saliency computation serves to maximize information sampled from one's environment. The model is built entirely on computational constraints but nevertheless results in an architecture with cells and connectivity reminiscent of that appearing in the visual cortex. It is demonstrated that a variety of visual search behaviors appear as emergent properties of the model and therefore basic principles of coding and information transmission. Experimental results demonstrate greater efficacy in predicting fixation patterns across two different data sets as compared with competing models.

*master map of locations*. The basic structure of the model is that various basic features are extracted from the scene. Subsequently the distinct feature representations are merged into a single topographical representation of saliency. In later work this representation has been deemed a saliency map and includes with it a selection process that in vague terms selects the largest peak in this representation, and the

*spotlight*of attention moves to the location of this peak (Koch & Ullman, 1985). In this context, the combined pooling of the basic feature maps is referred to as the saliency map. Saliency in this context then refers to the output of an operation that combines some basic set of features into a solitary representation.

*why*the operations involved in the model have the structure that is observed and, specifically, what the overall architecture translates into with respect to its relationship to the incoming stimulus in a principled quantitative manner. As such, little is offered in terms of an explanation for design principles behind observed behavior and the structure of the system.

*why*certain components implicated in visual saliency computation behave as they do and also presents a novel model for visual saliency computation built on a first principles information theoretic formulation dubbed Attention based on Information Maximization (AIM). This comprises a principled explanation for behavioral manifestations of AIM and contributions of this paper include:

- A computational framework for visual saliency built on first principles. Although AIM is built entirely on computational constraints, the resulting model structure exhibits considerable agreement with the organization of the human visual system.
- A definition of visual saliency in which there is an implicit definition of context. That is, the proposed definition of visual salience is not based solely on the response of cells within a local region but on the relationship between the response of cells within a local region and cells in the surrounding region. This includes a discussion of the role that context plays in the behavior of related models.
- Consideration of the impact of principles underlying neural coding on the determination of visual saliency and visual search behavior. This includes a demonstration that a variety of visual search behaviors may be seen as emergent properties of principles underlying neural coding combined with information seeking as a visual sampling strategy.
- A demonstration that the resulting definition of visual saliency exhibits greater agreement with fixational eye movement data than existing efforts.

“Consider the very simple situation presented in Figure 1. With a modicum of effort, the reader may be able to see this as an ink bottle on the corner of a desk. Let us suppose that the background is a uniformly white wall, that the desk is a uniform brown, and that the bottle is completely black. The visual stimulation from these objects is highly redundant in the sense that portions of the field are highly predictable from other portions. In order to demonstrate this fact and its perceptual significance, we may employ a variant of the “guessing game” technique with which Shannon has studied the redundancy of printed English. We may divide the picture into arbitrarily small elements, which we “transmit” to a subject (S) in a cumulative sequence, having him guess at the color of each successive element until he is correct…. If the picture is divided into 50 rows and 80 columns, as indicated, our S will guess at each of 4,000 cells as many times as necessary to determine which of the three colors it has. If his error score is significantly less than chance [2/3 × 4,000 + 1/2(2/3 × 4,000) = 4,000], it is evident that the picture is to some degree redundant. Actually, he may be expected to guess his way through Figure 1 with only 15 or 20 errors.”

*i, j*in the scene, the response of various learned filters with properties reminiscent of V1 cortical cells are computed. This stage may be thought of as measuring the response of various cortical cells coding for content at each individual spatial location and corresponds roughly to Gabor-like cells that respond to oriented structure within a specific spatial frequency band and color opponent cells. This yields a set of coefficients for each local neighborhood of the scene

*C*

_{ i,j}that may be assumed mutually independent. Operations involved in this stage are depicted in Figure 3b and a description of these operations as well as discussion of the assumption of mutual independence follows the overview description of the model. More specific details may also be found in 1.

*C*

_{ i,j,k}(

*i, j*corresponding to position of the local neighborhood) of the image is characterized by several coefficients

*a*

_{ k}corresponding to the various basis filters that code for that location. Let us consider one of these coefficients that, choosing an arbitrary example, might correspond to the presence of edge content corresponding to a specific orientation and spatial frequency at that location. In a larger region

*S*

_{ i,j,k}surrounding the location in question, one also has for each spatial location in the surround, a single coefficient corresponding to this same filter type. Considering all spatial locations in the surround the coefficients corresponding to the filter in question form a distribution (based on a non-parametric or histogram density estimate) that may be used to predict the likelihood of the response of the coefficient in question for

*C*

_{ i,j,k}. For computational parsimony, the definition of surround in the simulations shown is such that each pixel in the image contributes equally to the density estimate and is performed based on a 1000 bin histogram density estimate with the number of bins chosen to be in a range where the likelihood estimate is insensitive to a change in the number of bins. That said, the proposal is amenable to computation based on a local surround and results concerning the quantitative evaluation are included based on such a definition. It is worth noting that in the presence of the sort of parallel hardware with which the brain is equipped, the computation of a likelihood estimate based on the local surround is highly efficient. For more discussion related to this issue, the readers may refer to the section on related literature and 1.

*p*(

*x*) is then given by −log(

*p*(

*x*)). Note that this is equivalent to the sum of the self-information of the individual cell responses. The resulting information map depicts the saliency attributed to each spatial location based on the Shannon information associated with the joint likelihood of all filters in the cortical column. An additional point of interest is that the depiction in what appears as a saliency map can then be thought of as the average Shannon self-information of cells across a cortical column corresponding to content appearing at each spatial location. It should be noted, however, that the saliency-related computation takes place at the level of a single cell, which is an important consideration in addressing different architectures concerning how attentional selection is achieved; this is an issue that is considered in the Discussion section.

*C*

_{ k}, multiplication of the local pixel matrix with the unmixing matrix produces a set of coefficients that corresponds to the relative contribution of the various basis functions in representing the local neighborhood. These coefficients may be thought of as the responses of V1-like cells across a cortical column, corresponding to the location in question.

*f*noise and that inhibition of return proceeds according to a very coarse representation of past fixations. An important element of this work lies in showing that target search appears to operate according to maximizing the information about the location of the target in its choice of fixations. Another effort that leans more toward a stimulus-driven approach in the sense that there is no specific target is that of Renninger, Verghese, and Coughlan (2007). The task involved determining whether the silhouette of a particular shape matched with a subsequently presented silhouette. Eye movements were tracked during the presentation to observe the strategy underlying the selection of fixations. Renninger et al. demonstrate that the selection of fixation points proceeds according to a strategy of minimizing local uncertainty, which equates to a strategy of maximizing information assuming information equates to local entropy. This will typically correspond to regions of the shape silhouette, which contain several edgelets of various orientations. In agreement with the work of Najemnik and Geisler, it was found that there is little benefit to the optimal integration of information across successive fixations. Mechanisms for gain control at the level of a single neuron have been observed, which have been shown to correspond to a strategy based on information maximization (Brenner, Bialek, & de Ruyter van Steveninck, 2000). Although the proposal put forth in this paper is distinct from a description that involves sequences of fixations, the search for a specific target, or specific task conditions, it is nevertheless encouraging that there do appear to be mechanisms at play in visual search that serve to maximize some measure of information in sampling, and it is also the case that the findings of these studies may be viewed as complementary to our proposal rather than conflicting.

*σ*is chosen to approximate the drop-off in visual acuity moving peripherally from the center of the fovea based on the viewing distance of participants in the experiment. The density map then comprises a measure of the extent to which each pixel of the image is sampled on average by a human observer based on observed fixations. This affords a representation for which similarity to a saliency map may be considered at a glance. Quantitative performance evaluation is achieved according to the procedure of Tatler et al. (2005). The saliency maps produced by each algorithm are treated as binary classifiers for fixation versus non-fixation points. The choice of several different thresholds for the saliency maps treated as binary classifiers in predicting fixated versus not fixated pixel locations allows an ROC curve to be produced for each algorithm. An overall quantitative performance score is then given by the area under the ROC curve. For a further explanation of this method, refer to Tatler et al. (2005).

*Warmer*colors are more salient, and this scale is used in all examples scaled between the maximum and minimum saliency values across all examples within an experiment. As can be seen in Figure 8 the target relative to distractor saliency is very high for the first two cases, but the target saliency is indistinguishable from that of the distractors in the third case, suggesting no guidance toward the target item and hence requiring a visit of items in serial order. Thus, the distinction between a serial and parallel search is an emergent property of assuming a sparse representation and saliency based on information maximization. Since the learned feature dimensions are mutually independent, the likelihood is computed independently for uncorrelated feature domains implying unlikely stimuli for singletons based on a single feature dimension but equal likelihood in the case of a target defined by a conjunction. This behavior seen through the eyes of AIM is then a property of a system that seeks to model redundancy in natural visual content and overcome the computational complexity of probability density estimation in doing so. An additional example of a conjunction search is featured in Figure 9: The small, rotated, and red 5's are easily spotted, but finding the 2 requires further effort. It is worth noting that this account of visual search has been revised to some extent with more recent experiments demonstrating an entire continuum of search slopes ranging from very inefficient to very efficient (Wolfe, 1998). This is a consideration that is also supported by AIM as more complex stimuli that give rise to a distributed representation may yield very different ratios of target versus distractor saliency.

*ecological statistics*in which a large set of natural images forms the likelihood estimate. Zhang, Tong, Marks, Shan, and Cottrell (2008) have presented analysis of the relationship between this latter definition and locations fixated by human observers. The results they present show comparable performance to results for which the estimation is based on the image in question or based on a local surround region. However, such a definition precludes the possibility of context specific determination of saliency and thus will not produce any of the behaviors associated with the various psychophysics paradigms we have considered. There are a few behaviors that Zhang et al. describe, which they suggest a context specific model of saliency fails to capture, such as the asymmetric performance observed in a visual search for a bar oriented 5 degrees from vertical among many vertical bars versus a bar oriented vertically among many bars oriented 5 degrees from vertical, with the suggestion that a likelihood based on natural image statistics is necessary to account for this effect. There is however a significant oversight associated with this statement. An encoding based on ICA is optimal with respect to the statistics of the natural environment and therefore there is some representation of natural image statistics in general inherent in the context specific model in the definition of the receptive fields themselves. Therefore one also observes this specific asymmetry in the context of our model as the units that respond most strongly to content oriented 5 degrees from vertical also respond to a vertically oriented edge, but the converse is not the case (or the response is weaker) as the nature of the coding dictates a smaller orientation bandwidth for vertical edges. Combined with a suppressive surround, this results in the observed performance asymmetry. The same may be said of novelty of stimuli (e.g., Shen & Reingold, 2001; Wang, Cavanagh, & Green, 1994) assuming familiarity with a specific character set may have as a consequence a more efficient neural representation. It is also interesting to note that as receptive field properties (and image statistics) vary with position in the visual field, that behavior in tasks for which performance is anisotropic with respect to the location in the visual field might also be explained by AIM. This however is an issue that is difficult from an implementation perspective, requiring different cell types for different locations in the visual field and an explicit model of dependencies between different cell types. There are a few additional points of interest that appear in the work of Zhang et al., which are discussed at the end of this section. A definition that is closer to the former definition appearing in Bruce (2004) in which the likelihood estimate is based on the content of the entirety of a single image under consideration appears in Torralba, Oliva, Castelhano, and Henderson (2006). In Torralba et al. (2006), the focus is on object recognition and how context may guide fixations in the search for a specific object. They propose the following definition:

*P*(

*O*= 1,

*X*∣

*L, G*), where

*O*= 1 indicates that the object

*O*in question is present,

*X*is the location within the scene, and

*L*and

*G*are the local and global features, respectively. Via Bayes rule and excluding certain terms that appear in reformulating this definition, one arrives at an expression for saliency

*S*(

*x*) =

*p*(

*X*∣

*O*= 1,

*G*). While the focus of Torralba et al. (2006) is on how context informs the saliency within the context of an object recognition task given by the location likelihood conditioned on the global statistics for instances in which the object appears, the formulation also results in a term that is the inverse function of the likelihood of some set of local features conditioned on the global features. In the model of Torralba et al. (2006), they propose that image structure is captured on the basis of global image features. These global features consist of a coarse spatial quantization of the image and the features themselves pool content across many feature channels for each spatial location. Given this formulation, evaluation of

*P*(

*L*∣

*G*) directly is infeasible. For this reason, an estimate of

*P*(

*L*∣

*G*) is computed on the basis of the joint likelihood of a vector of local features based on a model of the distribution of said features over the entire scene. The likelihood

*P*(

*L*∣

*G*) is fit to a multivariate power exponential distribution with assumptions on the form of the distribution allowing an estimate of the joint likelihood of a local feature vector. Aside from the most obvious difference between the proposal put forth in this paper and that appearing in Torralba et al. (2006); that being computation of saliency based on local context as mediated by surround suppression, versus global receptive fields), there are a few comments that may be made in regards to the relationship to the proposal put forth in this paper. A significant point that may be made is that in considering the joint likelihood of a set of features, one once again fails to predict a variety of the behaviors observed in the psychophysics examples. For example the independence assumption is central to some of the psychophysics behaviors discussed, such as the distinction between a pop-out and conjunction search, or the feature presence/absence asymmetry. Secondly, it is unclear how the computational machinery proposed to achieve this estimate will scale with the number of features considered and it is likely that a local surround contains an insufficient number of samples for the required covariance matrix estimate. Therefore, it is once again the case that this proposal does not correspond to behavior observed psychophysically and also seems to prohibit computation of a measure of information in which the context is local. It should be noted that this quantity is not the main focus of the proposal of Torralba et al. (2006) but does serve as a useful point of contrast with the proposal at hand and highlights the importance of sparsity for likelihood estimation involved in a case where the data contributing to such an estimate is limited. It is interesting to note that the circuitry required to implement AIM is consistent with the behavior of local surround suppression with the implication that surround suppression may subserve saliency computation in line with recent suggestions (Petrov & McKee, 2006). There are in fact several considerations pertaining to the form of a local surround-based density estimate that mirror the findings of Petrov and McKee. Specifically, suppression in the surround comes from features matching the effective stimulus for the cell under consideration, is spatially isotropic, is a function of relative contrast, is prominent in the periphery and absent in the fovea, and the spatial extent of surround suppression does not scale with spatial frequency. It is also interesting to note that suppression of this type is observed for virtually all types of features (Shen, Xu, & Li, 2007). It is important to note that AIM is the sole proposal that is consistent with the entire range of psychophysical results considered and has a strong neural correlate in its relationship to behavior observed in the recent surround suppression literature.

*forest before trees*priority in visual perception appears to be general to virtually any category of stimulus including the perception of words preceding that of letters (Johnston & McClelland, 1974) and scene categories more readily perceived than objects (Biederman, Rabinowitz, Glass, & Stacy, 1974) in addition to a more general global precedence effect as demonstrated by Navon (1977). As a whole, the behavioral studies that observe early access to general abstract quantities prior to more specific simple properties such as location seem to support an attentional architecture that consists of a hierarchical selection mechanism with higher visual areas orchestrating the overall selection process. Further evidence of this arrives in the form of studies that observe pop-out of high-level features such as depth from shading (Ramachandran, 1988), facial expressions (Ohman, Flykt, & Esteves, 2001), 3D features (Enns & Rensink, 1990), perceptual groups (Bravo & Blake, 1990), surface planes (He & Nakayama, bib1221992), and parts and wholes (Wolfe, Friedman-Hill, & Bilsky, 1994). As mentioned, the important property that many of these features may share is an efficient cortical representation. Furthermore, pop-out of simple features may be observed for features that occupy regions far greater than the receptive field size of cells in early visual areas. It is unclear then how a pooled representation in the form of a saliency map mediating spatial selection can explain these behaviors unless one assumes that it comprises a pooled representation of activity from virtually every visual area. The only requirement on the neurons involved is sparsity and it may be assumed that such computation may act throughout the visual cortex with localized saliency computation observed at every layer of the visual hierarchy in line with more general models of visual attention (Desimone & Duncan, 1995; Tsotsos et al., 1995). There also exists considerable neurophysiological support in favor of this type of selection architecture. In particular the response of cells among early visual areas appears to be affected by attention at a relatively late time course relative to higher visual areas (Martínez et al., 1999; Nobre et al., 1997; Roelfsema, Lamme, & Spekreijse, 1998) and furthermore the early involvement of higher visual areas in attention-related processing is consistent with accounts of object-based attention (Tipper & Behrmann, 1996; Somers, Dale, Seiffert, & Tootell, 1999).

*basic*features, an important consideration is the scale at which analysis is performed with regard to conclusions that emerge from the proposal. It is evident that varying the extent of surround suppression might have some effect on the pop-out observed in a case such as that appearing in Figure 9. Under the assumption of a hierarchical representation in which features are represented at each layer with increasing extent of receptive field size and surround, one has a definition that is less sensitive to scale (for example, in embedding AIM within the hierarchical selective attention architecture of Tsotsos et al., 1995). It is also worth noting that in order to explain a result such as that of Enns and Rensink (1990) whereby targets defined by unique 3D structure pop-out, or that of Ramachandran (1988) whereby shape defined by shading results in pop-out, the global definition proposed by Torralba et al. (2006) would require a summary representation of more complex types of features such as 3D structures or shape from shading based on global receptive fields. These considerations raise questions for any definition of saliency in which the determination is based on global scene statistics. The case is even stronger if one considers pop-out effects associated with faces although there remains some contention that demonstrations of pop-out effects associated with faces are a result of confounds associated with simpler features (Hershler & Hochstein, 2005, 2006). As a whole, a hierarchical representation of salience based on a local judgment of information is the only account that appears to be consistent with the entire range of effects described.

*missing link*in observing pop-out behaviors that appear within models that posit a distributed strategy for attentional selection; a subset of attention models for which favorable evidence is mounting. The proposal is shown to agree with a broad range of psychophysical results and allows the additional possibility of simulating apparent high-level pop-out behaviors. Finally, the model demonstrates considerable efficacy in explaining fixation data for two qualitatively different data sets demonstrating the plausibility of a sampling strategy based on information seeking as put forth in this paper.

*a*

_{ i,j,k}. For each type of cell

*k*centered at location

*i, j,*one may then proceed to estimate the likelihood of

*a*

_{ i,j,k}based on the response of other cells of type

*k*within the surround.

*i, j*pixel location in the image, the 31 × 31 neighborhood

*C*

_{ k}centered at

*i, j*is projected onto the learned basis yielding a set of basis coefficients

*a*

_{ i,j,k}for the basis functions

*B*

_{ i,j,k}corresponding to the response of various cells whose receptive field center is at

*i, j*. In order to evaluate the likelihood of any given

*p*(

*B*

_{ i,j,k}=

*a*

_{ i,j,k}), it is necessary to observe the values taken on by

*B*

_{ u,v,k}, whereby

*u, v*define cells that surround

*i, j*giving its context. For computational parsimony as in Bruce and Tsotsos (2006), we have considered

*u, v*over the entire scene, where each pixel location contributes equally to the likelihood estimate. It is possible to derive more local estimates albeit with more computation required in the absence of the sort of parallel processing hardware with which the brain is equipped. This yields for each

*B*

_{u,v,k}, a distribution of firing rates taken on by cells of type

*k*. Computation is based on a 1000 bin histogram density estimate. It is worth noting that although computation based on a local surround is much more computationally intensive requiring a density estimate for every spatial location, this process is very computationally efficient when performed with the sort of parallel computation the brain performs. Although the computation of self-information for each

*C*

_{k}based on a local definition for the surround

*S*

_{k}becomes quite cumbersome for a 31 × 31 central receptive field, we also carried out a series of experiments for a 21 × 21 receptive field size with ICA components learned via the Jade algorithm preserving 95% variance and performing an estimate for each

*C*

_{k}with a local definition of

*S*

_{k}. An exhaustive determination of performance based on local center and surround regions proves computationally prohibitive. Thus, we have selected a single sensible choice of these parameters for analysis motivated by biological observations. The specific choice of parameters for this analysis are based on the data appearing in Figure 7 in Petrov and McKee (2006) corresponding to a drop-off in surround modulation by a factor of approximately 200 over approximately 5 degrees visual angle with this drop-off fit to a Gaussian. In this condition, the density estimation is based on a Gaussian kernel with standard deviation of 0.01. These conditions yield an ROC score of 0.762 ± 0.0085 for the same experimental conditions reflected in Figure 5. As a basis for comparison, a global histogram density estimate for a 21 × 21 window size yields an ROC score of 0.768 ± 0.0086. It is perhaps therefore reasonable to assume that a saliency determination for a local surround estimate based on a 31 × 31 neighborhood might produce a similar score to the determination appearing in Figure 5. In all cases, these scores are significantly greater than those produced by the algorithm put forth in Itti et al. (1998). There are also two additional salient points that might be made with respect to the preceding discussion. It is quite likely that with some investigation of different surround extents, or richer (e.g., non-Gaussian) models of the surround drop-off, or indeed a multi-scale approach, that one might improve performance. This, however, would require a lengthy computational evaluation and may yield little in terms of contributing to the theory of AIM as a model of human saliency determination. Second, it is worth noting that the degree of samples available for an estimate based on a local surround is quite impoverished relative to a more global surround, making the need for independent neural responses increasingly important and also raising the question of whether the distribution for a region as local as that involved in surround suppression might be adequately fit by a power exponential distribution (as in Torralba et al., 2006). Given simple local circuitry (e.g., Bruce & Tsotsos, 2006) it is possible to compute an estimate of the distribution without the need for the assumption of a specific parametric form.

*C*

_{ k}, we are interested in the quantity

*p*(

*B*

_{ i,j}=

*a*

_{ i,j}) evaluated for all

*k*= 1…

*N*features, or more specifically

*p*(

*B*

_{ i,j,1}=

*a*

_{ i,j,1}, …,

*B*

_{ i,j,N}=

*a*

_{ i,j,N}). Owing to the independence assumption afforded by a sparse representation (ICA), we can instead evaluate the computationally tractable quantity

*p*(

*B*

_{ i,j,k}=

*a*

_{ i,j,k}). The self-information attributed to

*C*

_{ k}is then given by −log(

*p*(

*B*

_{ i,j,k}=

*a*

_{ i,j,k})), which may also be computed as

*p*(

*B*

_{ i,j,k}=

*a*

_{ i,j,k}). The resulting self-information map is convolved with a Gaussian envelope with the fit corresponding to the observed drop-off in visual acuity affording a sense of how much total information is gained in making a saccade to a target location. Note that this has little effect on the resulting scores derived from the quantitative assessment but accounts for clustering or center of mass types of saccade targeting (Coren & Hoenig, 1972; Shuren, Jacobs, & Heilman, 1997) and allows direct qualitative comparison with the experimental density maps.