The human visual system is foveated, that is, outside the central visual field resolution and acuity drop rapidly. Nonetheless much of a visual scene is perceived after only a few saccadic eye movements, suggesting an effective strategy for selecting saccade targets. It has been known for some time that local image structure at saccade targets influences the selection process. However, the question of what the most relevant visual features are is still under debate. Here we show that center-surround patterns emerge as the optimal solution for predicting saccade targets from their local image structure. The resulting model, a one-layer feed-forward network, is surprisingly simple compared to previously suggested models which assume much more complex computations such as multi-scale processing and multiple feature channels. Nevertheless, our model is equally predictive. Furthermore, our findings are consistent with neurophysiological hardware in the superior colliculus. Bottom-up visual saliency may thus not be computed cortically as has been thought previously.

*perceptive fields*. Perceptive fields are analogous to receptive fields but at the psychophysical level (Jung & Spillmann, 1970; Neri & Levi, 2006; Wichmann, Graf, Simoncelli, Bülthoff, & Schölkopf, 2005).

*f*(

**x**) =

**w**

^{⊺}

**x**to eye movement data, i.e., such that

*f*(

**x**) describes the visual saliency of a local image region whose visual appearance is represented by a vector

**x**∈ R

^{ n}(here,

**x**holds the pixel luminances from an image region). The fitted weights

**w**represent exactly the optimal stimulus of the model (here, a luminance pattern), and can therefore be interpreted as the characteristic visual pattern that drives visual saliency.

*w*of a fitted linear model) is also a horizontal edge. The authors then concluded that during the search, the saccadic targeting system was driven by the identified edge pattern. In this work we want to arrive at a similar, but more general result, namely we want to identify the characteristic luminance patterns for a

*free-viewing*task on

*natural images*. In other words, we want to find characteristic patterns that drive bottom-up visual saliency.

**x**

_{0}and to its inverse −

**x**

_{0}, since

*f*(

**x**

_{0}) = −

*f*(−

**x**

_{0}). This property is extremely restrictive. As an example, if we merely extend the target in the visual search task from Rajashekar et al. (2006) from a horizontal edge to that same edge with both polarities allowed (i.e., also upside down), a linear model would not be valid anymore, since its output on a horizontal edge is exactly the negative of the output on the same edge upside down. In practice, if a linear model is fitted to such data, the complementary data samples will essentially cancel each other, resulting in an unstructured, or at least very noisy classification image.

*f*(

**x**) = ∑

*α*

_{ i}

*φ*

_{ i}(

**x**), where

**x**is an image patch and

*φ*

_{ i}are nonlinear basis functions. In this model, the fitted parameters are the weights

*α*

_{ i}. An advantage of this approach to nonlinearity is that the model is still linear in the fitted parameters (

*α*

_{ i}), and yet implements a nonlinear relationship through its nonlinear basis functions

*φ*

_{ i}(

**x**).

*φ*

_{ i}(

**x**) = exp(−

*γ*∣∣

**x**

_{ i}−

**x**∣∣

^{2}), centered at examples of recorded salient and non-salient image regions

**x**

_{ i}. In this case, our nonparametric model takes the form of a

*support vector machine*(Schölkopf & Smola, 2002). This particular choice of nonlinearity brings two advantages. First, the resulting model is very general in the sense that it is able to capture stimulus-response relationships of any order. Second, Gaussian radial basis functions satisfy a positive definiteness property (see Extracting perceptive fields section), which means that the resulting nonlinear analysis is indeed a straightforward generalization of the traditional linear approach. More precisely, we show below that due to this property the concept of the optimal stimulus being just the weight vector

**w**in a dot-product with the input (

**w**

^{⊺}

**x**) directly translates to the nonlinear case.

*x*

_{ i}of salient (white) and non-salient (black) image regions, i.e. two-dimensional vectors that describe the local luminance values at locations in natural scenes where people did (white) or did not (black) look in our experiment. The initial step of our analysis consists of fitting a nonlinear real-valued function

*f*(

**x**) to the data, such that

*f*(

**x**) takes on larger values if

**x**is salient, and smaller (negative) values otherwise. In this paper, we use a nonparametric model for

*f*taking the form of a sum of weighted Gaussian bumps

*φ*

_{ i}(

**x**) = exp(−

*γ*∣∣

**x**

_{ i}−

**x**∣∣

^{2}). The fitted model is illustrated in panel (b) by the black level curves and shaded background. Note that there are four extremal points in this example (denoted by the red plus signs in panel (c)), namely two maxima and two minima. The image patches at these locations correspond to the optimal stimuli of the fitted model

*f*(and hopefully of the underlying system as well), since it is at these locations where the value of

*f,*the saliency, is either extremely high or low. The key step in our analysis is that from the fitted nonlinear model

*f,*the optimal stimuli can be determined via gradient search (red zigzag line). In this paper we refer to the optimal stimuli as the

*nonlinear perceptive fields,*stressing the fact that these luminance patterns are similar to receptive fields, but stem from a psychophysical experiment, not from neurophysiological recordings. The cartoon here is realistic in the sense that

*f*has two maxima and two minima. This is also true for our actual data set, the four perceptive fields are shown in Figure 3. Our analysis concludes with the proposition of a simple bottom-up saliency model based on this result ( Figure 4), i.e., a radial basis function network with only four basis functions, centered on the perceptive fields (red level curves in Figure 1, panel d). We show that this simple model, being purely data-driven, is as predictive as the more complex models based on “biologically plausible” intuition.

*SD*) degrees. We classified all eye movements with a speed above 26.8 degrees per second (>3 pixels per sample) as saccades. Saccade targets were extracted from the images at the median position of consecutive scene samples between two saccades.

*SD*) degrees. The mean saccade length was 7.0 (±5.3

*SD*) degrees. Fixations lasted for 250 (±121

*SD*) milliseconds.

**x**

_{ i}. A label variable

*y*

_{ i}∈ {1, −1} was associated with every patch, denoting target or non-target, respectively.

*m*= 24,370 patches) using the support vector algorithm (Schölkopf & Smola, 2002), which minimizes the regularized risk

*α*

_{i}. The first term in Equation 2 denotes the data fit. It is zero, indicating a perfect fit, whenever

*y*

_{i}

*f*(

**x**

_{i}) ≥ 1. It attempts to push

*f*(

**x**

_{i}) to values ≥1 if

*y*

_{i}= 1, and to values ≤−1 if

*y*

_{i}= −1. If successful, this will result in a

*margin*of separation between the two classes, with

*f*taking values in [−1, 1]. The number of points falling inside this margin will depend on the strength of the regularization, measured by ∣∣

*f*∣∣

^{2}. The smaller ∣∣

*f*∣∣

^{2}, the smoother the solution

*f*. The tradeoff between data fit and smoothness is controlled by the parameter

*λ*. The model is nonparametric, and its descriptive power grows with

*m,*the number of data points it is fitted to. In fact, with the choice of Gaussian radial basis functions (Equation 1), it is sufficiently flexible to fit any smooth stimulus-response relationship in the data (Steinwart, 2001). Figures 1a and 1b illustrate a fitted model with Gaussian radial basis functions.

*f*∣∣

^{2}, a convenient choice is the one employed by support vector machines. By means of a nonlinear mapping induced by the

*kernel*exp(−

*γ*∣∣

**x**

_{ i}−

**x**∣∣

^{2}), it represents the function

*f*as a vector in a high-dimensional space, and then uses the squared length of that vector as a regularizer. Moreover, the decision function in that space is linear, and the problem of finding it can be reduced to a so-called

*quadratic program*. A support vector machines is but one example of a

*kernel method,*a class of methods which have recently also gained popularity as models in psychology (Jäkel, Schölkopf, & Wichmann, 2007). They all deal with nonlinearity by employing kernels that correspond to dot products in high-dimensional spaces, allowing for the construction of geometric algorithms in such spaces that correspond to nonlinear methods in the input domain.

*α*

_{ i}, there are three design parameters that have to be set:

*γ, λ,*and the patch size

*d*. These were determined by maximizing cross-validation estimates of the model's accuracy, using an eight fold, images-wise split of the training set. We conducted an exhaustive search on an 11 × 9 × 13 grid with the grid points equally spaced on a log scale such that

*d*= 0.47,…, 27 degrees,

*γ*= 5 · 10

^{−5}…, 5 · 10

^{3}, and

*λ*= 10

^{−3},…, 10

^{4}, resulting in the optimal values

*λ*= 1,

*σ*= 1,

*d*= 5.4 degrees. Performance was relatively stable with respect to changes of

*d*in the range from 2.5 to 8.1 degrees, and changes of

*λ*and

*γ*up to a factor of 3 and 10, respectively. Note that finding the optimal weights

*α*

_{ i}for a given set of the three design parameters is a convex problem and has therefore no local minima. As a result, an exhaustive grid search optimizes all parameters in our model globally and jointly.

*kernel methods*. In that framework the type of basis functions

*φ*

_{ i}(

**x**) = exp(−

*γ*∣∣

**x**

_{ i}−

**x**∣∣

^{2}) is referred to as the

*kernel k*(

**x**

_{ i},

**x**). An essential feature of kernel methods is that suitable kernels—such as the Gaussian radial basis function employed in our model—must satisfy a positive definiteness property (Schölkopf & Smola, 2002), in which case it can be shown that

*f*(

**x**) = ∑

_{i=1}

^{m}

*α*

_{i}

*k*(

**x**

_{i},

**x**) is nonlinear in its input

**x**, the theoretical and practical benefits of linear methods are retained.

*preimage problem*in kernel methods (Scholkopf et al., 1999). Due to Equation 3, the fitted kernel model

*f*(

**x**) is linear in the implied feature space

_{i=1}

^{m}

*α*

_{i}Φ(

**x**

_{i}). Thus, in

*f*. In order to visualize Ψ, we exploit the fact that the feature mapping Φ maps image patches to vectors in

**z**= Φ

^{−1}(Ψ) corresponding to the receptive field. Since not every vector in

**z**is defined as the patch whose image in

**z**= arg min

_{x}∣∣Ψ − Φ(

**x**)∣∣

^{2}. In case of a Gaussian radial basis kernel this amounts to solving

*f*(

*x*) (see Equation 4). This not only provides an alternative interpretation of

**z**, but shows that we can solve the optimization problem (Equation 5), without having to compute the dot product in the (potentially high dimensional) feature space

**z**is in general not unique, i.e., there can be multiple perceptive fields. For illustration, Figure 1c shows the optimal stimuli as red pluses.

*f*in Equation 1 defines a smooth function, and, since the Gaussian radial basis function is bounded, so is

*f,*and hence all minima and maxima exist. Initial values for the gradient search were random patches with pixels drawn from a normal distribution with zero mean and standard deviation 0.11, the mean value in the training data. As mentioned above, the result of the gradient search is not unique. Thus, in order to find all perceptive fields, we solved the optimization problem many times with different initial values. This could be intractable, since

*f*could have a large number of extremal points. In our case, however, we found that this was not a problem. After running the search 1,000 times, we found only 4 distinct solutions. This was verified by clustering the 1,000 optima using

*k*-means. The number of clusters

*k*was found by increasing

*k*until the clusters were stable. Interestingly, the clusters for both minima and maxima were already highly concentrated for

*k*= 2, i.e., within each cluster, the average variance of a pixel was less than 0.03% of the pixel variance of its center patch. This result did not change if initial values were random natural patches (standard deviation 0.11) or the training examples

**x**

_{ i}.

*φ*

_{ i}= exp(−

*γ*∣∣

**z**

_{ i}−

**x**∣∣

^{2}), centered at the patterns

**z**

_{1}…

**z**

_{4}( Figures 3a– 3d). The network weights

*β*

_{ i}were fitted by optimizing the same objective as before ( Equation 2), using the optimal values for

*γ, λ,*and

*d*reported above. This yielded

*β*

_{1}= 0.94 and

*β*

_{2}= 1.70 for the excitatory units and

*β*

_{3}= −1.93,

*β*

_{4}= −1.82 for the inhibitory units. Figure 1d illustrates this procedure.

^{−4}

*SEM*) versus 0.095 (±5.4 · 10

^{−4}

*SEM*). The relevance of RMS contrast has been a well-known result at least since Reinagel and Zador's work (Reinagel & Zador, 1999). In contrast, finding characteristic differences in the spatial structure of the patches is a much harder problem, as Figure 2 suggests. This difficulty does not change if the two sets are compared in terms of their principal components (Rajashekar et al., 2002) or their independent components (Bell & Sejnowski, 1997; Olshausen & Field, 1996) (see Supplementary Figure 1).

*SD*) degrees. In addition, different spatial scales for the true features were tested. We found that the perceptive fields either showed the true underlying structure, or no structure at all if the “measurement” noise was too high (above roughly twice the estimated true measurement noise level). This indicates that our method does not generate spurious structure, regardless of the level of measurement uncertainty or the scale of the true feature. In particular, the perceptive fields computed from edge data were never center-surround or vice versa (see Supplementary Figure 3). In addition, this experiment shows that the frequency components of the center-surround patterns in Figure 3 are not significantly affected by the measurement noise: while the uncertainty in the position measurements (standard deviation 0.4 deg) suggests that no frequencies above about 1 cpd can be resolved, the passband of our center-surround patches is one order of magnitude below this limit (around 0.15 cpd), and hence unlikely to be influenced by this effect. Furthermore, the center surround perceptive fields Supplementary Figure 3 (8% feature size, 100% noise) have a passband at roughly the double frequency (0.3 cpd), and are still correctly identified by our method.

*SD*) degrees, fixations lasted for 243 (±118

*SD*) milliseconds on average. Reassuringly, this also yielded only two excitatory perceptive fields with center-surround structure (on- and off-center), despite the fact that the local structure in that data set is governed by somewhat different features, e.g. long and sharp edges (see Supplementary Figure 4).

*saliency*was originally introduced for allocation of both covert and overt visual attention (Koch & Ullman, 1985), it has become common practice to use it as a quantity monotonically related to the “probability of looking somewhere” (Henderson, 2003; Itti & Koch, 2001). It is in that sense that we use the term here.

*φ*

_{ i}= exp(−

*γ*∣∣

**z**

_{ i}−

**x**∣∣

^{2}), centered at the perceptive field patterns

**z**

_{1}…

**z**

_{4}( Figures 3a– 3d). The weights

*β*

_{ i}were fit to the data to maximize predictivity (see Methods section).

*φ*

_{ i}(

**x**) are not only linear filters corresponding to relevant subspaces. Rather, they define excitatory (

*β*

_{ i}> 0) or inhibitory (

*β*

_{ i}< 0)

*regions*in the space of image patches. A connection to linear-nonlinear-linear models can be made, however, by expanding the square in the radial basis function ∣∣

**z**

_{ i}−

**x**∣∣

^{2}=

**z**

_{ i}

^{⊺}

**z**

_{ i}+

**x**

^{⊺}

**x**− 2

**z**

_{ i}

^{⊺}

**x**. Here,

**z**

_{ i}

^{⊺}

**z**

_{ i}is a constant,

**x**

^{⊺}

**x**is the signal energy of the input patch

**x**, and −

**z**

_{ i}

^{⊺}

**x**is a linear filter. Thus, we can write the radial basis units as exp(

*a*

_{ i}

**z**

_{ i}

^{⊺}

**x**+

*b*), with a positive constant

*a*

_{ i}and an offset

*b*which depends only on the signal energy (

*b*acts akin to a contrast gain-control mechanism trading-off pure contrast and local image structure). In particular, for any fixed energy, the perceptive fields in our model indeed behave like linear filters, followed by an exponential nonlinearity.

*SEM*). We also tested the saliency model by Itti et al. (1998) on our data set and found its performance to be 0.62 (±0.022

*SEM*). Furthermore, tested on the office stimuli of our second control experiment, our model—while trained on natural images—still led to 0.62 (±0.010

*SEM*), whereas the model by Itti et al. yielded 0.57 (±0.024

*SEM*). Two important conclusions can be drawn from these results: first, our model is at least as predictive on natural scenes as the best existing models. Second, even if we disregard the admittedly not dramatic differences in predictivity, the models differ substantially in terms of complexity. The model by Itti et al. implements contrast and orientation features at multiple scales with lateral inhibition, while our model uses merely four features at a single scale within a simple feed-forward network. The good performance of our model on office scenes, on which the model was not trained, indicates that our model does not overfit. Rather, due to its simplicity, it seems to be more robust than Itti et al.'s model, yielding stable results if the stimulus type varies.

*s*(

**x**). The most and least salient 100 patches are shown in Figure 5. Patches in the left panel (a) are the most salient, those in shown in the center panel (b) are the least salient ones.

*SEM*) in the 100 most salient patches and 0.044 (±0.001

*SEM*) in the 100 least salient patches. The second observation is that RMS contrast alone should not be equated with visual saliency. To illustrate this, the 100 least salient patches from panel (b) are plotted again in the right panel (c) of Figure 5, this time with their RMS contrast adjusted to that of the most salient patches in panel (a). This shows that the structure of the most salient patches tends to be of medium spatial frequency and localized at the patch centers. The structure of the least salient stimuli, on the other hand, is more ramp-like, i.e., not localized at the centers, but at the edges or corners of the patches, and has stronger low-frequency components, as shown in panel (d). Note that this behavior is not surprising, but reflects the structural differences between the excitatory and inhibitory perceptive fields in the model (Figure 4). In summary, we arrive at a similar conclusion as Krieger et al. (2000), who analyzed higher-order structure in saccade targets. They concluded that “the saccadic selection system avoids image regions which are dominated by a single orientated structure. Instead it selects regions containing different orientations, like occlusions, corners, etc.” (p. 208, first paragraph).

*result from the data*as those patterns which maximize predictivity: our perceptive fields are the optimal predictors for saccade targets. This is in contrast to previous studies which either assumed relevant structure by the choice of image features (Gabor or others), or used linear identification methods (Rajashekar et al., 2006; Tavassoli, van der Linde, Bovik, & Cormack, 2007).

*“Although the SC is best known for its role in the motor control of saccades, it appears to serve a more general function related to evaluating possible targets, defining the goal for orienting movements, and in updating the representation of the goal as the movement is executed. (p. 1450)”*. Thus the collicular pathway for eye-movement generation is actively involved in fixation target selection and, verily likely, saccade initiation regulation. Thus the fact that our psychophysical perceptive fields not only resemble physiological receptive fields but match important size and tuning properties of SC cells may not only be a coincidence but be taken as evidence for a role of SC in bottom-up visual saliency computations. We speculate that a substantial part of bottom-up saliency computations might be carried out sub-cortically, perhaps directly in the superior colliculus. Many previous models explicitly or tacitly—by the choice of oriented filters—assumed that visual saliency is computed in visual cortex. Our results suggest that bottom-up saliency driven eye-movements may be controlled and executed via a fast pathway involving the SC and that cognitively controlled top-down eye-movements may be computed cortically.

- is extremely simple compared to previously suggested models;
- predicts human saccade targets in natural scenes at least as well as previously suggested models;
- generalizes to a novel image set better than previously suggested models;
- is free of prior assumptions regarding the shape, scale, or number of filters;
- can be implemented with optimal filters resembling those in the SC in shape, size and spatial frequency tuning, suggesting that bottom-up visual saliency may be computed sub-cortically in SC.

*Nature,*

*381*, 607–609. [PubMed]