Purchase this article with an account.
Yupei Chen, Gregory Zelinsky; A CNN Model of "Objectness" Predicts Fixations During Free Viewing. Journal of Vision 2018;18(10):314. doi: https://doi.org/10.1167/18.10.314.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
Although object representations are known to guide attention during goal-directed behaviors, their importance in free scene viewing, a task often assumed to be minimally goal-directed, is debated and where bottom-up saliency-based models are still dominant. Here we introduce a top-down object-based model of fixation prediction during scene viewing. Our premise is that the same visual object representations learned and used to control goal-directed behavior are also used to guide attention to objects even during free viewing. What distinguishes our model from other object-based models of attention is that it is image-computable, meaning that it needs no hand labeling of a scene's objects. An image is input to a state-of-the-art CNN pre-trained for object classification using 1000 object categories from ImageNet. An activity visualization method (Grad-CAM, Selvaraju et al., 2016) is then used to localize regions of activation in this image corresponding to each of the 1000 categories for which the CNN was trained, thereby creating 1000 object-category-specific priority maps. Summing and normalizing these maps produces a single priority map reflecting the general "objectness" of locations in an image. Doing this for images from the MIT-ICCV dataset and comparing that dataset's gaze fixation ground truth to the model predictions, we found that the mean AUC score for our model was as good or better than those for comparable saliency models. By showing that an image-computable, object-based model can predict fixations during scene viewing, a viable alternative now exists to bottom-up saliency models, as well as a computational method for quantitatively distinguishing attention guidance to the objects in a scene versus attention to lower-level feature contrast. Lastly, because our model of free viewing is essentially a simultaneous hybrid search for 1000 target-object categories, the tasks of free viewing and visual search are finally unified under a single theoretical framework.
Meeting abstract presented at VSS 2018
This PDF is available to Subscribers Only