Abstract
Predicting human eye movements during search in complex natural images is an important part of understanding visual processing. Previously, several mechanisms have been shown to influence the deployment of attention and eye movements: bottom-up cues (Itti & Koch 2001) as well as top-down feature-based (Wolfe 1994, Navalpakkam & Itti 2005) and contextual cues (Torralba 2003). How the visual system combines these cues and what the underlying neural circuits are, remain largely unknown.
Here, we consider a Bayesian framework to study the contributions of bottom-up cues and top-down (feature-based and contextual) cues during visual search. We implemented a computational model of visual attention that combines feature-based and contextual priors. The model relies on a realistic population of shape-based units in intermediate areas of the ventral stream (Serre et al 2005), which is modulated by the prefrontal and parietal cortex. The posterior probability of the resulting locations is combined within parietal areas to generate a task-specific attentional map.
We compared the predictions by the model with human psychophysics. We used a database of street-scene images and instructed subjects to count either the number of cars or pedestrians while we tracked their eye movements. We found a surprisingly high level of agreement among subjects‘ initial fixations, despite the presence of multiple targets. We also found that the pattern of fixations is highly task-dependent, suggesting that top-down task-dependent cues played a larger role than bottom-up task-independent cues.
We found that our proposed model, which combines bottom-up and top-down cues, can predict fixations more accurately than alternative models, including purely bottom-up approaches. Also, neither feature-based nor contextual mechanisms alone could account for human fixations. Our results thus suggest that human subjects efficiently use all available prior information when searching for objects in complex visual scenes.