Abstract
Humans acquire visual information of the external environment while moving their eyes sequentially from one location to another. Previous studies showed that salient locations attract human gaze frequently (Itti, Koch, 1998), but more recent evidence suggests that higher-order image features might have higher predictability of gaze frequency (Kümmerer et al., 2016) and temporal characteristics (Kümmerer et al., 2017; Akamatsu, Miyawaki, 2018) than the classical saliency theory. However, it remains unclear whether higher-order image features per se serve as strong gaze attractors because previous experiments used natural scene images and the results could be influenced by semantic information from object categories and scene contexts. To resolve this issue, we designed a new experiment using “feature images” that can contain a pre-specified order of image features while suppressing object-categorical and scene-contextual information. The feature images were artificially generated so that they selectively maximized the response of a specific layer of a pre-trained deep convolutional neural network (DCNN), using gradient ascent optimization of image pixel values. Subject’s eye movement was recorded while they were observing a pair of feature images, each of which corresponded to a different layer of DCNN. Results showed that feature images corresponding to a higher layer of DCNN (higher-layer feature images) attracted the gaze more frequently than the simultaneously-presented lower-layer feature images, and the gaze frequency progressively increased as the DCNN layer tied to feature images. Control analyses confirmed that higher-layer feature images did not possess higher saliency and thus classical saliency theory is unlikely to explain the observed gaze frequency bias. These results suggest that higher-order image features serve as a significant gaze attractor independently of semantic information embedded in natural scenes.
Acknowledgement: JST PRESTO (JPMJPR1778), JSPS KAKENHI (17H01755)