How do we decide which objects in a visual scene are more interesting? While intuition may point toward high-level object recognition and cognitive processes, here we investigate the contributions of a much simpler process, low-level visual saliency. We used the *LabelMe* database (24,863 photographs with 74,454 manually outlined objects) to evaluate how often interesting objects were among the few most salient locations predicted by a computational model of bottom-up attention. In 43% of all images the model's predicted most salient location falls within a labeled region (chance 21%). Furthermore, in 76% of the images (chance 43%), one or more of the top three salient locations fell on an outlined object, with performance leveling off after six predicted locations. The bottom-up attention model has neither notion of object nor notion of semantic relevance. Hence, our results indicate that selecting interesting objects in a scene is largely constrained by low-level visual properties rather than solely determined by higher cognitive processes.

*interesting*objects or image regions as those which, among all items present in a digital photograph, people choose to label when given a fairly unconstrained image annotation task (details below). The assumption that people would choose to label interesting objects comes simply from the fact that there is some motivation for people to label one region (whether being an object or not) over another.

*LabelMe*for people to annotate objects in scenes (Figure 1). The scenes are submitted by various contributors and depict many indoor and outdoor locations. For instance, images include outdoor scenes of cities around the world, office buildings, parking lots, indoor offices and houses, country scenes, and many more. Examples of scenes and associated object outlines from the database are shown in Figure 1 and in (Russell et al., 2005). As can be seen, the labeled objects in the scene range from being in plain view and well lit to being partly occluded and low contrast or distorted. Anyone can contribute new images to this shared repository. One can also manually trace the outlines of scene elements using simple computer-based tracing tools. Finally, one can associate semantic labels with the outlined elements (e.g., a car, a dog). This database and associated tracing and labeling tools are freely available on the World Wide Web. The original purpose is to collect ground-truth data about objects in complex natural scenes and to train computational object recognition algorithms. Note that the only given criteria for labeling are to make “nice labels” (i.e., outlining whole objects somewhat precisely, as opposed to just drawing a rectangular bounding box), and contributors certainly are not instructed to find or to label objects which are more salient. However, It can be argued that a selection bias might exists to submit and to label objects that are inherently salient compared with the domain of all possible objects—whether such a saliency-driven bias indeed exists is the main research question posed by the present study. That is, by looking at a large number of scenes, we attempt to find this bias and whether it is contributed by top-down or bottom-up processes. In addition, the images were viewed on various computers with various display sizes and color properties. As a result, this data set provides a good indication of what people would find “generically interesting,” in the absence of a particular task, in uncontrolled conditions, under no time pressure, and outside the laboratory; obviously, the flip side to this is that the data set is highly heterogeneous and possibly very noisy, which might mask any of the effects being investigated here. At the time of this writing, there are 74,454 objects annotated by several human annotators in 24,863 scenes. In this paper, we propose to use this massive collective data set as an approximation to deciphering which scene elements may be more generically interesting to a human observer: Presumably, the scene elements which were outlined by the annotators were more interesting than the many other scene elements which were not outlined. The high heterogeneity and inherently uncontrolled nature of this data set regarding image types, resolution, annotators, image display conditions, tracing tools used, etc., is such that any bias that might be robust enough to be detected in the present study could reasonably be expected to generalize to other data sets.

*LabelMe*project, it could be inferred that most images are labeled by computer science researchers interested in vision. As a result, it could be further inferred that the data set would contain objects and scenes that would be difficult for various vision algorithms to recognize.

*LabelMe*database at the time of testing consisted a total of 24,863 labeled scenes, of which 7,719 were single-shot static scenes and 17,144 where images from sequence (video) scenes. Image size varied from 130 × 120 to 4,423 × 3,504 pixels for the static scenes (

*M*= 1,549.90,

*SD*= 1,321.48 ×

*M*= 898.23,

*SD*= 680.06) and 160 × 120 pixels to 1,024 × 768 (

*M*= 681.28,

*SD*= 156.90 ×

*M*= 458.72,

*SD*= 106.09) for the sequence scenes. Within all these scenes, 74,454 objects were labeled, from 1 to 87 per image (

*M*= 8.90,

*SD*= 11.30). Each labeled object occupied between <0.01% and 99.82% of the total image area (

*M*= 7.32,

*SD*= 9.95), and the union of all labeled objects in each image occupied between <0.01% and 99.99% of the total image area (

*M*= 20.82,

*SD*= 0.12). On average, about 20% of each image was labeled. Sometimes object outlines overlapped within an image (e.g., a desk was outlined, and also a computer on that desk), which slightly complicated our analysis (see below). Lastly, the order in which people chose to label the objects is captured in the data set. This is later used as a measure of how salient the first labeled objects were. Examples of scenes and associated object outlines from the database are shown in Figure 1.

*LabelMe*data set, a saliency map was computed according to the algorithm proposed by Itti et al. (Itti & Koch, 2000; Itti et al., 1998). The saliency map was then inspected algorithmically in several ways to reveal whether or not it had preferentially highlighted the labeled objects over other objects. The results were then compared to chance to indicate how difficult was the task of selecting labeled objects in the data set.

*LabelMe*data set), then choosing a point at random would have a low probability of hit. If one assumes a uniform probability of picking a random location anywhere in the image, chance probability of a hit is hence given by the ratio of the area of the union of all labeled regions to the area of the entire image. It is this ratio which we computed for every image and which we report. As a control, we confirmed that identical values (to within 0.1%) were obtained by picking 100 uniformly distributed random locations in each scene and testing at each location whether or not it belonged to a labeled object. In addition, a random map was created to obtain the chance results for the multiple location experiments. This process is explained later in the paper.

*z*score. This test was used due to the binary nature of the data. Therefore, the

*z*score would indicate the probability of a hit in a given trial. Since the number

*N*of images is much grater then 10, the normal approximation to the binomial distribution was used.

*X*is the hit rate that the saliency map obtained,

*p*is the hit rate obtained by chance,

*P*is the probability of hitting an object in a given image, and

*N*is the number of images.

*t*test (Welch, 1947) was also used to test the statistical significance when the saliency map was used to detect all of the objects in a given scene. This test is similar in nature to the Student's

*t*test, but it considers the fact that the two sets of data have different variances.

*X*

_{1}and

*X*

_{2}are the mean of the saliency map results and chance results, respectively,

*s*

_{1}

^{2}and

*s*

_{2}

^{2}are the variance of the saliency map results and chance, respectively, and

*N*

_{1}=

*N*

_{2}is the number of images, and where the degrees of freedom were calculated as

*LabelMe*data set. We found that the saliency map was highly significantly above chance at preferentially attending to labeled objects. Indeed, the hit rate, or percentage of images were the location of maximum saliency and hence the first attended location, was within a labeled object, was 43.29%, 40.07%, and 44.45% for all the scenes, the static scenes, and the sequences, respectively. These values were about twice above chance. The values of chance computed were

*M*= 20.82,

*SD*= 0.08%,

*M*= 25.46,

*SD*= 0.07%, and

*M*= 18.73,

*SD*= 0.12% for all, static, and sequences, respectively. A binomial test indicated that the hit rates for the saliency map were statistically significantly above chance with

*z*= 87.27,

*p*= 0.001,

*z*= 30.74,

*p*= 0.001, and

*z*= 86.32,

*p*= 0.001, for all, static, and sequences, respectively.

*LabelMe*data set with the knowledge of the overall object label bias image. We found that the saliency map was again highly significantly above chance at preferentially attending to labeled objects. Indeed, the hit rate, or percentage of images where the location of maximum saliency and hence the first attendant location, was within a labeled object, was 55.33% for all images, while chance was computed as

*M*= 26.40,

*SD*= 2.65%. A binomial test indicated that the hit rates for the saliency map were statistically significantly above chance with

*z*= 105.80,

*p*= 0.001. It is worth noting that even if we compare the unbiased version of the saliency map with the biased randomized model, the results are still significantly above chance. In particular, the saliency model still picks a labeled object 43.29% of the time versus 26.40% for biased-random, leading to a significant advantage for saliency (binomial test

*z*= 60.41,

*p*= 0.001).

*LabelMe*database contains substantially more sequence images than static images. Hence, our conclusions drawn from the entire data set are generally applicable to static scenes, sequence scenes, and both combined.

*z*= 99.44,

*p*= 0.001,

*z*= 44.88,

*p*= 0.001, and

*z*= 89.65,

*p*= 0.001 for all, static, and sequences, respectively.

*M*= 60.07%,

*SD*= 0.26% of all labeled objects in all the scenes, more than

*M*= 72.25%,

*SD*= 0.29% in the sequences scenes, and over

*M*= 33.01%,

*SD*= 0.38% in the static scenes. The values of chance computed were

*M*= 35.21%,

*SD*= 0.26%,

*M*= 42.11%,

*SD*= 0.33%, and

*M*= 19.89%,

*SD*= 0.32% for all, sequences, and static scenes, respectively. A Welch's

*t*test indicated that these values are statistically significantly above chance with

*t*(49,723) = 67.99,

*p*= 0.001,

*t*(33,729) = 68.23,

*p*= 0.001, and

*t*(15,000) = 28.18,

*p*= 0.001 for all, static, and sequences, respectively.

*LabelMe*data set, users have not been given a search task and are free to label anything they want. This could have partly removed the influence of contextual information. There still could have been some context information due to the fact that people would expect certain objects to exist in a particular scene, which might bias them toward these objects. In addition, other factors could have effected the selection of one object from another. For example, a less complex object might be labeled because it is easier to outline as opposed to a more complex object that might be more salient (e.g., a dull ball vs. a salient human), or an object might be labeled because of a more sophisticated measure of subjective interest (e.g., deriving from object recognition or higher-level mental processes). However, our further results looking at several successive shifts of ROI indicate that by just considering the top three most salient locations, one would hit at least one labeled object in 76% of all images. This is a remarkable figure given the simplicity of the saliency mechanism implemented here.

*LabelMe*and in other data sets. This could be for example investigated by comparing the distribution of absolute peak salience values in all

*LabelMe*images to those in a broader representative and unbiased sample of all images of the world. Obviously, a difficulty here is in gathering said representative sample of images (e.g., using cameras placed at randomly drawn GPS coordinates and mounted on randomly controlled motorized pan/tilt heads).

*LabelMe*images evaluated in this paper. The various individuals who labeled the images did not have time constraints in their task and hence could first look at every location they liked and then decide which ones to label. Furthermore, their task was to label as many objects in the images as possible, which had nothing to do with saliency or attention. Nevertheless, our results demonstrate that as most people labeled only objects that they thought were of interest, this in turn led them to label objects that they had fixated as predicted simply by their salient properties.