Most natural scenes are too complex to be perceived instantaneously in their entirety. Observers therefore have to select parts of them and process these parts sequentially. We study how this selection and prioritization process is performed by humans at two different levels. One is the overt attention mechanism of saccadic eye movements in a free-viewing paradigm. The second is a conscious decision process in which we asked observers which points in a scene they considered the most interesting. We find in a very large participant population (more than one thousand) that observers largely agree on which points they consider interesting. Their selections are also correlated with the eye movement pattern of different subjects. Both are correlated with predictions of a purely bottom–up saliency map model. Thus, bottom–up saliency influences cognitive processes as far removed from the sensory periphery as in the conscious choice of what an observer considers interesting.

*SD*= 9.8 years). All demographic information was self-reported.

*p*-values we found in the randomization tests were usually well below this level.

*F*(4, 3204) = 134.13,

*MSE*= 8.39,

*p*< .001. Paired-samples

*t*-tests showed that first selections were the slowest, second selections were the fastest, and subsequent selections showed a steady increase in reaction time, all

*p*< .01. One explanation for this pattern is that participants initially viewed the entire scene for an extended time to determine the most interesting location, as well as to locate the next few interesting regions before making their first selection. Consequently, the choice to make the subsequent selections was facilitated and the time shortened.

*number*of interest points close to others; we show that these numbers are significantly higher than would be expected in the absence of clustering. These assessments of clustering were chosen since they are similar to those used in previous work in this field. The fourth, and final, method we used is a standard k-means algorithm.

*x*,

*y*) coordinate of each interest point location, and we determined its distance from every other interest point in the same image. We then calculated the fraction of interest points that were separated by a given number of pixels. We used bins of 10 pixels, i.e., we determined the fraction of interest points that were between 1 and 10 pixels from another interest point, then 11 to 20 pixels away, and so on up to 50 pixels (we chose this maximal distance since beyond it the clustering effects begin to be counteracted by the compensation that has to occur at sufficiently large distances).

*F*(1, 96) = 710.04,

*MSE*= 0.297,

*p*< .001, which was due to more clustering in the actual than shuffled data set, as well as a main effect of distance,

*F*(4, 384) = 65.70,

*MSE*= 0.123,

*p*< .001. Furthermore, there was a main effect of data type,

*F*(3, 96) = 4.41,

*MSE*= 0.324,

*p*< .01. There was also an interaction between distance and data set,

*F*(4, 384) = 177.75,

*MSE*= 0.116,

*p*< .001, as well as an interaction between distance and image type,

*F*(12, 384) = 11.69,

*MSE*= 0.123,

*p*< .001. All of these effects were qualified by a Distance × Data Set × Image Type interaction,

*F*(12, 384) = 14.44,

*MSE*= 0.116,

*p*< .001. Overall, the main finding is that for all image types interest point selections were closer together than what was predicted by the shuffled data set.

*F*(1, 96) = 1044.88,

*MSE*= 101.60,

*p*< .001, showing more clustering in the actual versus shuffled data set, a main effect of image type,

*F*(3, 96) = 7.98,

*MSE*= 114.97,

*p*< .001, as well as a Data Type × Image Type interaction,

*F*(3, 96) = 8.65,

*MSE*= 101.60,

*p*< .001. Independent samples

*t*-tests showed that the interaction was based on there being more clustering in the building

*t*(48) = 2.35,

*SE*= 4.07,

*p*< .05,

*t*(48) = 4.40,

*SE*= 3.38,

*p*< .001 and home interior image

*t*(48) = 3.29,

*SE*= 3.92,

*p*< .05,

*t*(48) = 5.69,

*SE*= 3.19,

*p*< .001 types, compared to the fractal and landscape image types, respectively. Buildings and home interiors did not differ from each other,

*p*> .10 nor did fractals and landscapes,

*p*> .25. However, the main finding is that for all image types a larger fraction of interest points were part of a cluster than would be predicted by the shuffled data set.

*x*,

*y*) locations, equal to the average number of interest points per actual image. The mean maximum silhouette value for participants' interest point selections for the 100 images was 0.587 (

*SD*= 0.061). The mean maximum silhouette value for the random distribution was 0.415 (

*SD*= 0.005). For each image, the best mean silhouette value for participants' data was above the 95th percentile of the random data, suggesting that for every image a greater degree of clustering was observed than expected by chance. Furthermore, the number of clusters corresponding to the maximum silhouette value was also different. For the selections made by the participants, the average number of clusters which best fit the data was 10.92 (

*SD*= 3.10), while for the random distribution it was 2.17 (

*SD*= 0.55). Once again, these results strongly suggest that the interest point selections were best described as clustering around several independent locations. Next, we examined whether clustering differed as a function of image type. A one-way ANOVA on the silhouette values was significant,

*F*(3, 96) = 7.91,

*MSE*= 0.003,

*p*< .001. Pairwise comparisons revealed that the difference was caused by smaller values for the landscapes compared to buildings

*t*(48) = 2.18,

*SE*= 0.84,

*p*< .05 and home interiors

*t*(48) = 1.70,

*SE*= 0.91,

*p*< .1, consistent with the previous clustering results.

*x*,

*y*) coordinates of all interest point selections for each image were determined, and the saliency values from that image's SM were extracted separately for the five selections. These values formed the actual distribution.

*t*-tests tested whether the chance adjusted-saliency was greater than zero for the first five selections. All differences were significant, all

*p*< .001, showing that for all selections participants selected areas higher in saliency than would be expected by chance. Next, a repeated measures ANOVA was conducted with selection number (1, 2, 3, 4, 5) as a within-subjects variable and image type as a between-subjects variable. The only reliable effect was a main effect of selection number,

*F*(4, 384) = 12.83,

*MSE*= 70.403,

*p*< .001, as earlier selections showed a higher chance-adjusted saliency value than later selections. Overall, these results show that participants do select salient locations as being interesting more so than would be expected by chance, and that earlier selections are more influenced by image saliency than later selections.

*P*and

*Q*, both of dimension

*M*,

*N*, as

*P*=

*IM*and

*Q*=

*SM*in Equation 1) divided by the square root of the product of the autocorrelations,

*p*-value of .05. The mean value of the actual distribution was 0.368 (

*SD*= 0.141). Overall, 61 out of the 100 comparisons were above the random sampling distribution curve at the 95th percentile, suggesting that over half of the cross-correlation values between the IMs and SMs are greater than what would be expected by chance (see Figure 8B). If we assume sampling from a Bernoulli process, with 61 out of 100 instances that had a probability of 0.05 (corresponding to 61 out of 100 images being above the 95th percentile), we would obtain this result with a probability of 5 × 10

^{−53}.

*SD*= 21), and there were a mean of 12.89 (

*SD*= 3.11) fixations per trial.

*x*,

*y*) coordinates of the participants' first 10 fixation locations. These values formed the actual distribution. The SMs were identical to the ones used in Experiment 1. The chance sampling distribution was also created in the same way as Experiment 1. That is, the values of the SM for each image were extracted from the fixation locations from all other images. Finally, to determine whether participants fixated more salient regions than would be expected by chance, we calculated the difference between the mean of the actual distribution and the mean of the chance sampling distribution for each image, the chance-adjusted saliency for fixation locations.

*t*-tests tested whether the chance adjusted saliency was greater than zero for the first 10 fixations. All differences were significant, all

*p*< .001, showing that for the first 10 fixations participants fixate areas higher in saliency than would be expect by chance. Next, a repeated measures ANOVA was conducted with fixation number (1–10) as a within-subjects variable and image type as a between-subjects variable. The only reliable effect was a main effect of fixation number,

*F*(4, 384) = 3.50,

*MSE*= 25.783,

*p*< .001, as the first fixation had a lower chance-adjusted salience value than all other fixations. The main effect of image type did approach significance,

*F*(3, 96) = 2.62,

*MSE*= 293.246,

*p*= .055, and pairwise comparisons showed that the chance adjusted saliency value was smaller for home interiors compared to buildings and fractals. Overall, our results show that for our image set participants do fixate regions of high saliency above chance.

*SD*), was placed around each fixation location. Instead of the fixed weight used for interest maps, for the computation of fixation maps the Gaussian was weighted proportionally to the length of that fixation. Thus, longer fixations received a higher total value in the FM than shorter fixations. Then, the total values of each map were normalized so that the sum of all values for a given map equaled unity, and the overall mean subtracted.

*SD*= 0.145). Overall, 33 out of the 100 actual comparisons were above the random cross-correlation distribution curve at the 95th percentile, suggesting that about one third of the cross-correlation values between fixation and saliency maps were higher than expected by chance (see Figure 9B). The probability of obtaining this result by chance, again computed from the binomial distribution as in the Comparing interest points and image saliency section, is 1 × 10

^{−18}.

*t*-tests tested whether the chance adjusted-interest value was greater than zero for the first 10 fixations. All differences were significant, all

*p*< .001, showing that for the first 10 fixations participants fixate areas higher in interest value than would be expected by chance. Next, a repeated measures ANOVA was conducted with fixation number (1–10) as a within-subjects variable and image type as a between-subjects variable. A main effect of fixation number was found,

*F*(4, 864) = 11.22,

*MSE*= 48.295,

*p*< .001, as interest values were higher for fixations after the first. The main effect of image type was marginally significant,

*F*(3, 96) = 2.18,

*MSE*= 296.770,

*p*= .095. The Fixation Number × Image Type interaction was also significant,

*F*(27, 864) = 2.00,

*MSE*= 48.295,

*p*< .01. This interaction could be understood by comparing the interest values for the first and second fixations for fractals to all other image types. For fractals, the chance adjusted interest value for the first two fixations was not significantly different (paired samples

*t*-test,

*t*(24) = 0.294,

*SE*= 2.25,

*p*= .771). For the other three image types, the chance adjusted interest value was significantly higher for the second compared to the first fixation, all

*p*< .01. Overall, these results show that participants in Experiment 2 were likely to fixate image regions that participants in Experiment 1 found interesting.

*SD*= 0.113). Overall, 98 out of the 100 actual cross-correlations were larger than the value at the 95th percentile of the random sampling distribution (see Figure 10B), suggesting that practically all of the images had a higher cross-correlation between their corresponding interest and fixation maps than would be predicted by chance. The probability of obtaining this result by chance as computed from the binomial distribution, as in the Comparing interest points and image saliency section, is 1 × 10

^{−124}.

*t*(99) = 3.07,

*SE*= 0.63,

*p*< .01). A closer look at each image category separately reveals that this holds also individually for the three natural image types (landscapes, interiors, buildings;

*t*-test, all

*p*< .05) though not for fractals. Note that this is not the case for the raw (non-adjusted) saliency values which were found to be not significantly different between first and second fixations, a result that holds for each of the image categories separately as well as when they are collapsed (

*t*-tests, all

*p*> .15). Furthermore, in both studies the chance-adjusted saliency for fractals is found to be greater for the first fixation than for natural scenes (collapsed over the three classes of natural scenes,

*t*(98) = 2.14,

*SE*= 1.51,

*p*< .05). This is consistent with an interpretation in which the top–down effects gain in importance as additional information from previous fixations is processed, and also with the intuitively plausible assumption that top–down effects play a greater role in the semantically “meaningful” natural scenes (landscapes, buildings, interiors) than in fractals. Therefore, the purely bottom–up saliency map model performs better on earlier than later fixations, and better on fractals than on natural scenes.

*covert*attention have used

*overt*attention, i.e., eye movements (e.g., Parkhurst et al., 2002), demonstrating that bottom–up scene-contents influences attentional selection. In the present work, we go one step further away from the sensory periphery and investigate whether the purely bottom–up predictions of the saliency map model have predictive value even for consciously made decisions about what constitutes an “interesting” area of an image. Given the high-level of abstraction of this concept (whose precise definition is deliberately left to the individual human observers), it was expected that inter-individual differences will be substantial. We therefore developed a novel, Internet-based approach that allowed us to use a very large population of observers (Experiment 1). Together with results from a more traditional study of eye-movements (Experiment 2) on the same image set, we can constructs maps for bottom–up saliency (SM), fixations (FM), and conscious selection of interest (IM). For the sake of illustration, we show one example image together with all these three maps derived from this image (Figure 11). Our major results, to be discussed below, are the relationships between these three maps, and the quantitative determination of the inter-individual differences between points selected as interesting.

*not*idiosyncratic but reflects something inherently significant about certain areas of scenes. This is even more remarkable considering the differences in viewing angle, monitor quality, and other uncontrolled factors that could have lead to vastly different responses between participants who responded on their personal computers via the Internet. These factors are nearly certain to have added variability and “noise” to the data so that the true degree of agreement between observers is likely even higher than what we measured. Significant consistency between observers was also found in a recent study (Cerf, Cleary, Peters, Einhäuser, & Koch, 2007) where observers rated the saliency of a whole image relative to other images of the same category (rather than parts of a image relative to other parts of the same image, as in the present study). Significant inter-observer consistency was also observed when the same image set was tested again a year later when participants reported not having explicit memory of the specific images they saw. These results support our hypothesis that it is objective criteria, rather than idiosyncratic decisions, that determine what people find subjectively interesting.

*p*< .001) for all these three relations ( Figures 8, 9, and 10).

*other*individuals).

*per se,*in the sense that the information is part of the image and does not reflect an idiosyncratic preference or task goal. For instance, the features of a car in an image would be the same whether or not the person is searching for a car, but these features are part of the image itself and could be pre-selected to aid in the detection of that object if that was the observer's goal. Thus, attention in natural scenes, as measured by interest point selections, may be guided by an intermediate stage between bottom–up and top–down information, which is important for the formation of stable object representations. Furthermore, recent physiological results show that figure–ground segregation does not depend on selective attention (Qiu, Sugihara, & von der Heydt, 2007). A theoretical model for a neural substrate of this and other mechanisms of intermediate vision has been suggested recently in a computational model (Craft, Schütze, Niebur, & von der Heydt, 2007).