A central issue in vision is the relative importance of bottom-up features vs top-down agendas in the control of gaze. Recently, a new approach to studying this question has become available: Graphic models that simulate human visual capabilities can be used to develop synthetic models of visuo-motor behavior. Our specific model consists of a graphic humanoid walker navigating a cluttered sidewalk in an urban setting. The model uses a reinforcement learning protocol to direct gaze to subserve three subtasks: sidewalk centering, obstacle avoidance and litter pickup. To test this model, we compared its performance with that of human walkers who completed the same navigation task by walking in the virtual environment. Gaze was tracked by an ASL tracker inserted in the optical pathway of a V-8 binocular HMD. Gaze locations overlaid on the video display were scored by human analysts. Video frames that had fixations were also analyzed by the Itti-Koch feature-based algorithm that computes saliency and ranks five possible gaze locations. Human gaze locations were highly correlated with the predictions of the task model. To make the model comparable to the subjects' performance, the parameters were adjusted so that the model and human subjects spent approximately the same time on the three sub-tasks. Then we compared the gaze locations that the humans selected to those selected by the model and the Itti-Koch saliency map. In a conservative analysis less than 10% of the bottom-up gaze locations match. If the criterion is broadened to include the object nearest the gaze locations selected by the bottom up model, it is still the case that only 40% of the locations match. The large difference in predictions of the two models arises because vision must be used functionally in the navigation task. In contrast over 90% of the gaze locations could be interpreted as subserving one of the three sub-tasks. This result is consistent with other studies that use natural environments.
Supported by NCRR grant RR-09283