Abstract
An important aspect of vision – control of eye-movements in scene viewing – is intensely debated, with many studies suggesting that people look at scene regions rich in meaning. A recent proposal suggests that the distribution of meaning can be quantified by ‘Meaning Maps’ (MMs). To create MMs, images are segmented into partially overlapping patches, which are rated for their meaningfulness by multiple observers. These ratings are combined into a smooth distribution over the image. If MMs capture the distribution of meaning, and if the deployment of eye-movements in humans is guided by meaning, two predictions arise: first, MMs should be better predictors of gaze position than saliency models, which use image features rather than meaning to predict fixations; second, differences in eye movements that result from changes in meaning should be reflected in equivalent differences in MMs. Here, we tested these predictions. Results show that MMs performed better than the simplest saliency model (GBVS), were similar to a more advanced model (AWS), and were outperformed by DeepGaze II – a model using features from a deep neural network. These data suggest that, similar to saliency models, MMs might not measure meaning but index the distribution of features. Using the SCEGRAM database, we tested this notion directly by comparing scenes containing consistent object-context relationships with identical images, in which one object was contextually inconsistent, thus changing its meaning (e.g., a kitchen with a mug swapped for a toilet roll). Replicating previous studies, regions containing inconsistencies attracted more fixations from observers than the same regions in consistent scenes. Crucially, however, MMs of the modified scenes did not attribute more ‘meaning’ to these regions. DeepGaze II exhibited the same insensitivity to meaning. Both methods are thus unable to capture changes in the deployment of eye-movements induced by changes of an image’s meaning.