Abstract
People spend a significant amount of their time freely viewing the world in the absence of a task. The dominant class of models attempting to explain this free-viewing behavior computes saliency, a measure of local feature contrast in an image, to obtain a strictly bottom-up attention priority map. Our contention is that the directionality of attention control may be exactly opposite; that free viewing may be guided by a top-down control process that we refer to as multiple-object search. Unlike standard search in which there is typically only a single target, multiple-object search distributes the target goal over several objects, thereby diluting the contribution of any one and creating a diffuse object-priority signal. To compute this signal we borrowed computer vision methods for localizing a trained object class in an image by backpropagating activity from a high-layer in a deep network to lower layers closer to the pixel space. Several object-localization methods exist, but we chose STNet (Biparva & Tsotsos, 2017) because it is inspired by the brain’s attention mechanism. Using STNet we computed an object localization map for each of 1000 categories (ImageNet), which we averaged to create one top-down objectness map. We evaluated our method by predicting the free-viewing fixations in the MIT-ICCV dataset of 1003 scenes. For each scene, the location of maximum object-map activity was selected for fixation, followed by spatial inhibition and the iterative selection of the next most active location until six-fixation scanpaths were obtained. We also obtained scanpath predictions from several bottom-up saliency models. Using vector similarity for scanpath comparison, we found that predictions from objectness maps were as good as those from saliency maps, with the best predictions obtained by combining the two. This suggests that top-down attention control signals originating from learned object categories may influence even ostensibly task-free viewing behavior.