Purchase this article with an account.
Charlotte A. Leferink, Dirk B. Walther; Saliency Map Predictions of DeepGaze II are Influenced by the Convolutional Neural Network Texture Bias. Journal of Vision 2020;20(11):963. doi: https://doi.org/10.1167/jov.20.11.963.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
Though line drawings depict edges of objects and not texture nor colour, which is typically present in the natural environment, humans can recognize scenes depicted in line drawings just as well as those in colour photographs. It has been shown recently that most convolutional neural networks (CNNs) rely on texture more than they rely on edges and shapes of the objects depicted in an image. But what are the effects of these model constraints on modelling visual attention?
Here we show that, like humans, a leading CNN-based model of spatial attention, DeepGaze II, generalizes well between photographs and edge-extracted images. Seemingly innocuous low-level changes, however, such as reversing the contrast polarity of the edge-extracted images, cause vastly different predictions by the attention model. This is not the case for human observers, since contrast polarity reversal maintains the structure and global properties of the objects within an image. These results provide further evidence of the reliance of CNNs on texture-based visual information for generating its predictions.
To further explore these questions, we recorded eye movements of participants viewing edge-extracted versions of the images from the MIT1003 dataset, which serves as ground truth for DeepGaze II. Comparisons of predicted gaze maps generated from line drawings and from photographs with human fixations on line drawings and photographs showed that both humans and the model generalize well between those drastically different representations of the scenes. Changing contrast polarity of the drawings, on the other hand, drastically changed the predicted gaze maps. Our unique eye movements data set and analysis procedures allow us to further explore the limitations of the CNN texture bias, and to further investigate the capacity of CNNs to learn global shape representations as they apply to directing visual attention.
This PDF is available to Subscribers Only