Abstract
Three sources of guidance have been proposed to explain the deployment of attention during visual search tasks. (1) Saliency reflects the capture of attention by regions of an image that differ from their surroundings in low-level features (i.e., Itti & Koch, 2000). (2) Attention may also be guided towards image regions that look like the search target (Wolfe, 2007); for example, attention may be directed towards red objects when searching for a red-colored target. (3) The context of a scene is also likely to guide attention: in the real world, objects are constrained to appear in particular locations (for example, cars appear on streets), so attention may be guided to these locations during search (Torralba et al., 2007).
We attempted to predict human search fixations using computational models of the three sources of guidance (saliency, target features, and scene context) in a large database of human fixation data (14 observers searching for pedestrians in 912 outdoor scenes). When tested individually, each model performed above chance but scene context provided the best prediction of human fixation locations. A combined model incorporating all three sources of guidance outperformed each of the single-source models, with performance driven predominantly by the context model. The combined model performed at 94% of the level of human agreement in the search task, as measured by the area under the ROC curve.
We compared performance of the three-source model of search guidance to an empirically-derived model of scene context. For this comparison, a “context oracle” was created by asking human observers to specify the scene region where a target was most likely to appear. This context oracle predicted human fixations as well as the three-source computational model. We discuss the implication of these results for future models of visual search.
Funded by NSF CAREER awards (0546262) to A.O. and (0747120) to A.T. B.H.S. is funded by an NSF Graduate Fellowship.