Abstract
Decades of research highlight the importance of bottom-up stimulus-driven guidance (e.g., saliency) and top-down user-driven factors (e.g., target surface features, prior knowledge of likely target locations) in visual search. While these factors can be experimentally manipulated in simple abstract search arrays, it has been difficult to empirically derive unique predictions for distinct top-down factors in real-world scenes. As a first step toward addressing this issue, we developed two new approaches based on convolutional neural network models. The first extends the class activation mapping (CAM) approach (Zhou, Khosla, Lapedriza, Oliva & Toraralba, 2015) to compute “Average-CAM” maps which capture variability in the diagnostic value of scene regions in predicting scene category membership. The second approach, which we term “Patch-Match”, maps the relatedness of scene regions to category-level (e.g. any clock) target and visual search target template (e.g. a specific clock) activations. We demonstrate the value of these approaches by showing these maps explain unique variance (beyond that explained by saliency models) in human gaze patterns in visual search tasks for real-world targets embedded in natural scenes.