Inspired by the primate visual system, computational saliency models decompose visual input into a set of feature maps across spatial scales in a number of pre-specified channels. The outputs of these feature maps are summed to yield the final saliency map. Here we use a least square technique to learn the weights associated with these maps from subjects freely fixating natural scenes drawn from four recent eye-tracking data sets. Depending on the data set, the weights can be quite different, with the face and orientation channels usually more important than color and intensity channels. Inter-subject differences are negligible. We also model a bias toward fixating at the center of images and consider both time-varying and constant factors that contribute to this bias. To compensate for the inadequacy of the standard method to judge performance (area under the ROC curve), we use two other metrics to comprehensively assess performance. Although our model retains the basic structure of the standard saliency model, it outperforms several state-of-the-art saliency algorithms. Furthermore, the simple structure makes the results applicable to numerous studies in psychophysics and physiology and leads to an extremely easy implementation for real-world applications.

*Feature Integration Theory*of Treisman and Gelade (1980) and the proposal by Koch and Ullman (1985) for a map in the primate visual system that encodes the extent to which any location in the field of view is conspicuous or salient, based on bottom-up, task-independent factors, a series of ever refined algorithms has been designed to predict where subjects will fixate in synthetic or natural scenes (Einhäuser, Spain, & Perona, 2008; Foulsham & Underwood, 2008; Itti, Koch, & Niebur, 1998; Oliva, Torralba, Castelhano, & Henderson, 2003; Parkhurst, Law, & Niebur, 2002; Walther, Serre, Poggio, & Koch, 2005). In these models (Itti & Koch, 2000; Itti et al., 1998; Parkhurst et al., 2002), low-level attributes such as color, intensity, and orientation combined to yield maps through center–surround filtering at numerous spatial scales. Subsequently, Einhäuser et al. (2006) and Krieger, Rentschler, Hauske, Schill, and Zetzsche (2000) suggested incorporating higher order statistics to fill some of the gaps between the predictive powers of current saliency map models. One way of doing this is by adding more semantic feature channels such as faces or text into the saliency map. This significantly improves the accuracy of prediction (Cerf, Frady, & Koch, 2009; Einhäuser et al., 2008). The extent to which such bottom-up, task-independent saliency maps predict human fixational eye movements under free-viewing conditions remains under active investigation (Donk & Zoest, 2008; Foulsham & Underwood, 2008; Masciocchi, Mihalas, Parkhurst, & Niebur, 2009). Bottom-up saliency has also been adopted (Chikkerur, Serre, Tan, & Poggio, 2010; Navalpakkam & Itti, 2005; Rutishauser & Koch, 2007) to mimic top-down searches. However, we here only consider task-independent scrutiny of images as they might occur when people are gazing at a scene without looking for anything in particular.

*linear summation*of feature channels into the final saliency map remains the norm (Cerf et al., 2009; Harel, Koch, & Perona, 2007; Itti & Baldi, 2006; Itti et al., 1998). Linear summation has some psychophysical support (Nothdurft, 2000) and is simple to apply. However, (Koene & Zhaoping, 2007; Li, 2002) have raised psychophysical arguments against linear summation strategies. In addition, prior work (Itti, 2005; Peters, Iyer, Itti, & Koch, 2005) has been aware of the different strengths contributed by different features to perceptual salience. We here investigate the importance of different bottom-up features in driving gaze allocation, including inter-subject variability, by learning an optimal set of feature weights using the constraint linear least square algorithm and perform quantitative analysis on four recent eye movement data sets (Bruce & Tsotsos, 2009; Cerf et al., 2009; Judd, Ehinger, Durand, & Torralba, 2009; Subramanian, Katti, Sebe, Kankanhalli, & Chua, 2010).

*FIFA data set*(Cerf et al., 2009), fixation data were collected from 8 subjects performing a 2-s-long free-viewing task on 180 color natural images (28° × 21°). They were asked to rate, on a scale of 1 through 10, how interesting each image was. Scenes were indoor and outdoor still images in color. Images include faces in various skin colors, age groups, gender, positions, and sizes.

*Toronto database*) contains data from 11 subjects viewing 120 color images of outdoor and indoor scenes. Participants were given no particular instructions except to observe the images (32° × 24°), 4 s each. One distinction between this data set and that of the FIFA (Cerf et al., 2009) is that a large portion of images here do not contain particular regions of interest, while in the FIFA data set most contain very salient regions (e.g., faces or salient nonface objects).

*MIT database*) is the largest one in the community. It includes 1003 images collected from

*Flickr*and

*LabelMe*. Eye movement data were recorded from 15 users who free-viewed these images (36° × 27°) for 3 s. A memory test motivated subjects to pay attention to the images: they looked at 100 images and needed to indicate which ones they had seen before.

*NUS database*recently published by Subramanian et al. (2010) includes 758 images containing semantically affective objects/scenes such as expressive faces, nudes, unpleasant concepts, and interactive actions. Images are from

*Flickr, Photo.net, Google,*and

*emotion-evoking IAPS*(Lang, Bradley, & Cuthbert, 2008). In total, 75 subjects free-viewed (26° × 19°) part of the image set for 5 s each (each image was viewed by an average of 25 subjects).

*i*viewing image

*j,*assuming that each fixation gives rise to a Gaussian-distributed activity, all gaze data are represented as the recorded fixations convolved with an isotropic Gaussian kernel

*K*

_{G}as

**x**denotes the 2

*d*image coordinates.

**x**

_{ k }represents the image coordinates of the

*k*th fixation, and

*f*is the number of fixations. The bandwidth of the kernel,

*h,*is set to approximate the size of fovea, and

*α*normalizes the map. An example of a fixation map is shown in Figure 2b. Note that the first fixation of each image is not used as it is always the center of the image.

*c*= {2, 3, 4}, surround level

*s*=

*c*+

*δ,*where

*δ*= {2, 3}). A single

*conspicuity map*for each of the color, intensity, and orientation feature channels is built through across-scale addition of the center–surround difference maps and is represented at scale 4 (Figure 2c). For the face channel, the conspicuity map is generated by running the Viola and Jones (2001) face detector. Although different from early visual features such as color, intensity, and orientation, face attracts attention strongly and rapidly, independent of task; therefore, it is also considered part of the bottom-up saliency pathway (Cerf et al., 2009).

**x**, the values of the color, intensity, orientation, and face conspicuity maps at this particular location are extracted and stacked to form the sample vector

**v**(

**x**) = [

*C*(

**x**)

*I*(

**x**)

*O*(

**x**)

*F*(

**x**)]

^{ T }.

**C**,

**I**,

**O**, and

**F**be the stacked vectors of the color, intensity, orientation, and face values at all image locations and let us denote

**V**= [

**C**

**I**

**O**

**F**],

**M**

_{fix}as vectorized fixation map that is represented as the recorded fixations convolved with an isotropic Gaussian kernel, and

**w**= [

*w*

_{ C }

*w*

_{ I }

*w*

_{ O }

*w*

_{ F }]

^{ T }as the weights of the feature channels. The objective function is

- We model any time-dependent center bias using a 2D Gaussian filter centered at the current fixation as$N$(
**c**_{ t }, Σ_{ f }). Here**c**_{ t }is the location of the current fixation that changes with time, and Σ_{ f }=$ ( \sigma f 2 0 0 \sigma f 2 ) $, where*σ*_{ f }denotes a space constant and is fixed*a priori*. Note that although the mean of the distributions changes with time, the standard deviation reflects inherent biological properties and we set it as a constant during the viewing process (see the two small black circles in Figure 4a). - We model any time-independent center bias (due, for instance, to the straight-ahead position, the tendency to center the eyeball within its orbit, and the tendency to look at the screen center due to strategic advantages) via a 2D Gaussian centered at the screen center as$N$(
**0**, Σ_{ h }) (see the large black circle in Figure 4a). Since the multiplication of Gaussian functions is still Gaussian functions, a single Gaussian here is equivalent to modeling each factor using a Gaussian and then multiplying them for the compound effect. As before, Σ_{ h }=$ ( \sigma h 2 0 0 \sigma h 2 ) $, where*σ*_{ h }is set*a priori*.

^{^}to represent distributions and those with

^{∼}to denote instances of variables.

*t*) fixation position

_{ t }, the two Gaussian factors just described multiply to produce the center bias effect; therefore, the distribution of the (

*t*+ 1)th saccade (from the

*t*th location to the (

*t*+ 1)th location) is

_{0}=

**0**since the eye movement starts at the center of the screen. In this and the next subsection, the subscript 0 denotes the initial fixation, which is generally not used for analysis as it is the center of the screen. The subscript

*t*refers to the

*t*th fixation starting from the fixation following the initial one.

_{ t }}

_{ t=1,2,…}, follows a Gaussian process.

*t*as

_{ t }, the fixation distribution at time

*t*+ 1,

_{ t+1}, can be written as the integral of saccade distributions over all possible locations

_{ t }, weighted by the probability of generating each location

_{ t }from

_{ t }.

_{ t+1}∼

**0**, Σ

_{ h }) · (

_{ t }*

**0**, Σ

_{ f })) (see 1 for the derivation) and

_{1}=

_{1}. Since the convolution of two Gaussian functions is another Gaussian function, as is the multiplication of two Gaussian functions, we have

**0**. Further, we prove that their covariance matrix (Equation 5b) converges. Formally, we denote {

*t*}

_{ t=1,2,…}as the sequence of successive fixations. The covariance matrix of the distribution at these fixations are {Σ

_{ t }}

_{ t=1,2,..}, where Σ

_{ t }is defined in Equation 5b. We prove (see 1) the following.

**Theorem 2.1.**

*The sequence of*{Σ

_{ t }}

_{ t=1,2,…}

*is convergent.*

*zero*indicates no such correspondence. Unlike the NSS that focuses on the saliency values of the scanpath, EMD (Rubner et al., 2000) captures the global discrepancy of two distributions. Intuitively, given two distributions, EMD measures the least amount of work needed to move one distribution to map onto the other one. It is computed through linear programming and accommodates distribution alignments well. A larger EMD indicates a larger overall discrepancy between the two distributions.

*n*− 1 subjects, iterating over all

*n*subjects and averaging the result. These AUC values are 78.6% for the FIFA data set, 87.8% for the Toronto data set, 90.8% for the MIT data set, and 85.7% for the NUS data set. In general, we express the performance of saliency algorithms in terms of normalized AUC (nAUC) values, which are the AUC values using the saliency algorithm normalized by the ideal AUC. A strong saliency model should have an nAUC value close to 1, a large NSS, and a small EMD value.

Equal weights | Optimal weights | |||
---|---|---|---|---|

Without CBM | With CBM | Without CBM | With CBM | |

nAUC | 0.828 | 0.943 | 0.834 | 0.948 |

NSS | 0.872 | 1.49 | 0.920 | 1.54 |

EMD | 4.85 | 3.09 | 4.50 | 2.90 |

*prior*together with the usual features consistently shows improved performance. (3) Saliency decreases with time, consistent with the findings (Mannan, Kennard, & Husain, 2009) that initial fixations are more driven by bottom-up features compared to later ones.

_{ C }

_{ I }

_{ O }

_{ F }]

^{ T }= [0.109 0.072 0.278 0.541]

^{ T }, and the standard deviation is [

*σ*

_{ C }

*σ*

_{ I }

*σ*

_{ O }

*σ*

_{ F }]

^{ T }= [0.028 0.022 0.039 0.054]

^{ T }. We use the trained weights to build subject-specific saliency models, and the model performance is reported in the 6th and 8th columns of Table 4. Again, the improvement compared to the model trained on the population data is marginal. For a performance summary of 7 models (the aforementioned 5 models (Figure 7) and 2 subject-specific ones (the 6th and 8th columns)), see Table 4.

Centered Gaussian | Equal weights | Optimal weights | |||||
---|---|---|---|---|---|---|---|

Without CBM | With CBM | Without CBM | With CBM | ||||

General | Subject-specific | General | Subject-specific | ||||

nAUC | 0.869 | 0.776 | 0.899 | 0.792 | 0.795 | 0.910 | 0.912 |

NSS | 1.07 | 0.635 | 1.19 | 0.725 | 0.744 | 1.24 | 1.25 |

EMD | 3.56 | 4.73 | 3.04 | 4.53 | 4.49 | 2.88 | 2.86 |

Centered Gaussian | Equal weights | Optimal weights | |||
---|---|---|---|---|---|

Without CBM | With CBM | Without CBM | With CBM | ||

nAUC | 0.904 | 0.793 | 0.922 | 0.829 | 0.938 |

NSS | 1.06 | 0.706 | 1.15 | 0.858 | 1.28 |

EMD | 3.20 | 4.85 | 3.04 | 4.55 | 2.97 |

*prior*information and can be combined with top-down knowledge (Chikkerur et al., 2010; Kollmorgen, Nortmann, Schräoder, & Käonig, 2010; Navalpakkam & Itti, 2005; Rutishauser & Koch, 2007; Underwood & Foulsham, 2006) to infer task-specific optimal weights.

*ad hoc*and effective single kernel center model, we derive a theoretical basis that justifies the approximation of a single kernel to the dynamic Gaussian process. In addition, our model of center bias is not restricted to laboratory settings. It could apply to any combinations of possible causes to the center bias. For example, in real-world scenarios where the subjects are allowed to move their heads, other contributions such as the high-level strategic advantages, the drop in visual sensitivity in the periphery, and motor bias combine to produce the center bias in the way our model explains, though the Gaussian variance is larger than that of the laboratory settings.

_{ t+1}∼ N (

**0**, Σ

_{ h }) · ( X ^

_{ t }* N (

**0**, Σ

_{ f }))

^{^}represent distributions and those with

^{∼}denote instances of variables.

_{ t }, the (

*t*+ 1)th saccade distribution is

_{ t+1}∼

**0**, Σ

_{ h }) ·

_{ t }, Σ

_{ f }) (Equation 4). Thus, given

_{ t }, the probability of the next fixation at

_{ t+1}is given by [

**0**, Σ

_{ h }) ·

_{ t }, Σ

_{ f })](

_{ t+1}) (in this derivation, we use [·] to denote a distribution and [·](·) as a value of the distribution at a specific point).

_{ t }, the probability of fixating at a particular location

_{ t }is

_{ t }(

_{ t }). Integrating the saccade distributions over all possible locations

_{ t }yields

*Proof*. Since the

*t*th covariance matrix Σ

_{ t }is given by

_{ t }}

_{ t=1,2,…}is equivalent to the convergence of {

*σ*

_{ t }

^{2}}

_{ t=1,2,…}. To prove the convergence of this series, it suffices to show that it is both upper bounded and strictly monotonic increasing. First, recall that

_{ t+1}∼

**0**, Σ

_{ h }) · (

_{ t }*

**0**, Σ

_{ f })), using Equations A3, A4a, and A4b, we obtain

*σ*

_{ t }

^{2}<

*σ*

_{ h }

^{2}for

*t*= 1,2,…, therefore {

*σ*

_{ t }

^{2}}

_{ t=1,2,…}is upper bounded by

*σ*

_{ h }

^{2}.

*Q*(

*t*) be the statement that

*σ*

_{ t+1}

^{2}>

*σ*

_{ t }

^{2}. This is equivalent to

*Basic step.*To prove

*Q*(1), simply substitute Equations A5a and A5b into Equation A6:

*σ*

_{2}

^{2}>

*σ*

_{1}

^{2}.

*Inductive step.*This time we assume

*Q*(

*t*− 1), i.e.,

*σ*

_{ t }>

*σ*

_{ t−1}, and prove

*Q*(

*t*).

*σ*

_{ t }

^{2}>

*σ*

_{ t−1}

^{2}results

*Q*(

*t*) and completes the proof.