Abstract
Current computational models of visual salience accurately predict the distribution of fixations on isolated visual stimuli. It is not known, however, whether the global salience of a stimulus, that is, its effectiveness in the competition for attention with other stimuli, is a function of the local salience or an independent measure. Further, do task and familiarity with the competing images influence eye movements? Here, we investigated the direction of the first saccade to characterize and analyze the global visual salience of competing stimuli. Participants freely observed pairs of images while eye movements were recorded. The pairs balanced the combinations of new and already seen images, as well as task and task-free trials. Then, we trained a logistic regression model that accurately predicted the location—left or right image—of the first fixation for each stimulus pair, accounting too for the influence of task, familiarity, and lateral bias. The coefficients of the model provided a reliable measure of global salience, which we contrasted with two distinct local salience models, GBVS and Deep Gaze. The lack of correlation of the behavioral data with the former and the small correlation with the latter indicate that global salience cannot be explained by the feature-driven local salience of images. Further, the influence of task and familiarity was rather small, and we reproduced the previously reported left-sided bias. Summarized, we showed that natural stimuli have an intrinsic global salience related to the human initial gaze direction, independent of the local salience and little influenced by task and familiarity.
The guidance of eye movements in visual behavior is a dominant necessity for navigation and interaction with the environment (
Liversedge & Findlay, 2000;
Geisler & Cormack, 2011;
König et al., 2016), also reflecting the individual personality (
Rauthmann et al., 2012). We constantly have to decide where to look next and which regions of interest to explore, in order to process and interpret relevant information of a scene (
Ramos Gameiro et al., 2017). As a consequence, investigating eye movement behavior has become a major field in many research areas (
Kowler, 2011;
Kaspar, 2013;
König et al., 2016).
In this regard, a number of studies have shown that visual behavior is controlled by three major mechanisms: bottom-up, top-down, and spatial biases (
Desimone & Duncan, 1995;
Egeth & Yantis, 1997;
Kastner & Ungerleider, 2000;
Corbetta & Shulman, 2002;
Connor et al., 2004;
Tatler & Vincent, 2009;
Kollmorgen et al., 2010;
Ossandón et al., 2014). Bottom-up factors describe features of the observed image, which attract eye fixations, involving primary contrasts, such as color, luminance, brightness, and saturation (
Itti et al., 1998;
Reinagel & Zador, 1999;
Baddeley & Tatler, 2006). Hence, bottom-up factors are typically based on the sensory input. In contrast, top-down factors comprise internal states of the observer (
Connor et al., 2004;
Kaspar, 2013). That is, eye movement behavior is also guided by specific characteristics, such as personal motivation, specific search tasks, and emotions (
Wadlinger & Isaacowitz, 2006;
Einhäuser et al., 2008;
Henderson et al., 2009;
Rauthmann et al., 2012;
Kaspar & König, 2012). Finally, spatial properties of the image, such as the image size, and motor constraints of the visual system in the brain may affect eye movement behavior (
Ramos Gameiro et al., 2017,
2018). As a result, spatial properties and motor constraints then lead to specific bias effects, such as the central bias in natural static images (
Tatler, 2007). Thus, investigating visual behavior necessarily implies an examination of bottom-up and top-down factors as well as spatial biases.
Based on these three mechanisms—bottom-up, top-down, and spatial biases—guiding visual behavior,
Koch and Ullman (1987) first revealed a method to highlight salient points in static image scenes. Whereas this model was purely conceptual,
Niebur and Koch (1996) later developed an actual implementation of salience maps. This was the first prominent proposal of topographically organized features maps that guide visual attention. Salience maps describe these topographic representations of an image scene, revealing where people will most likely look at while observing the respective scene (
Itti et al., 1998;
Itti & Koch, 2001). That is, salience maps can be interpreted as a prediction of the distribution of eye movements on images. Usually, salience maps include only bottom-up image features, predicting eye fixations on image regions with primary contrasts in color changes, saturation, luminance, or brightness, among others (
Itti et al., 1998;
Itti and Koch, 2001). However, in their first implementation,
Niebur and Koch (1996) also tried to include top-down factors to build up salience maps and thus predict where people will look at most likely in image scenes. Current state-of-the-art computational salience models are artificial neural networks pretrained on large data sets for visual object recognition and subsequently tuned to predict fixations, as is the case of Deep Gaze II (
Kümmerer et al., 2016). Such models do not rely only on bottom-up features any more but also incorporate higher-level features learned on object recognition tasks. Still, despite the better performance on salience benchmarks, deep nets-based models seem to fail at predicting the salience driven by low-level features (
Kümmerer et al., 2017).
Salience maps provide a highly accurate and robust method to predict human eye movement behavior on static images by relying on local features to determine which parts of an image are most salient (
Niebur & Koch, 1996;
Itti et al., 1998;
Itti & Koch, 2001;
Kowler, 2011). However, these methods do not provide any information about the salience of the image as whole, which may depend on both local properties and also the overall semantic and contextual information of the image. Such global salience is of great relevance when an observer is faced with two or more independent visual stimuli in one context. These combinations describe situations when several stimuli compete with each other with regard to their individual semantic content, despite being in the same overall context. Such cases appear frequently in real life, for instance, when two billboards hang next to each other in a mall or when several windows are open on a computer screen or a monitor in an intensive care unit, to name a few examples. Thus, by placing two or more independent image contexts side by side, as described in the previous examples, classical salience maps may well predict eye movement behavior within each of the individual images as a closed system, but they will most likely fail to predict visual behavior across the whole scene involving all images. Specifically, they will fail at answering the question: Which stimulus is most likely to attract the observers’ visual attention?
In this study, our primary hypothesis is (H1) that it is possible to measure and calculate the global salience of natural images. That is, the likelihood of a visual stimulus to attract the first fixation of a human observer, when it is presented in competition alongside another stimulus, can be systematically modeled. In the experiment presented here, participants were confronted with stimuli containing two individual natural images—one on the left and one on the right side of the screen—at the same time. The set of images used to build our stimuli consisted of urban, indoor and nature scenes; closeups of human faces; and scenes with people in a social context. During the observation of the image pairs, we recorded the participants’ eye movements. Specifically, to characterize the global salience, we were interested in the direction—left or right—of the initial saccade the participant made after the stimulus onset. For further analysis, we also collected all binary saccade decisions on all the image pairs presented to the participants. We used the behavioral data collected from the participants to train a logistic regression model that successfully predicts the location of the first fixation for a given pair of images. This allowed us to use the coefficients of the model to characterize the likelihood of each image to attract the first fixation, relative to the other images in the set. In general, images that were fixated more often were ranked higher than other images. Hence, we computed a unique “attraction score” for each image that we denote “global salience,” which depends on the individual contextual information of the image as a whole.
We also analyzed the local salience properties of the individual images and compared them to the global salience. We hereby claimed that the global salience cannot be explained by the feature-driven salience maps. Formally, we hypothesize that (H2): Natural images have a specific global salience, independent of their local salience properties, that characterizes their likelihood to attract the first fixation of human observers, when presented alongside another competing stimulus. A larger global salience leads to a higher attraction of initial eye movements.
In order to properly calculate the global salience, we accounted for general effects of visual behavior in stimuli with two paired images. Previous studies have shown that humans tend to exhibit a left bias in scanning visual stimuli.
Barton et al. (2006) showed that subjects looking at faces make longer fixations on the eye on their left side, even if the faces were inverted, and the effect was later confirmed and extended to dogs and monkeys (
Barton et al., 2006;
Guo et al., 2009). For an extensive review about spatial biases, see the work by
Ossandón et al. (2014), where the authors presented evidence of a marked initial left bias in right-handers but not in left-handers, regardless of their habitual reading direction. In sum, there is a large body of evidence of lateral asymmetry in viewing behavior, although the specific sources are yet to be fully confirmed. With respect to our study, we hypothesize that (H3): Presenting images in horizontal pairs leads to a general spatial bias in favor of the image on the left side.
In addition to the general left bias, in half of the trials of the experimental sessions, one of the images had been already seen by the participant in a previous trial, while the other was new. The participants also had to indicate which of the images was new or old. Thus, we also addressed the questions of whether the familiarity with one of the images or the task has any effect in the visual behavior and thus in the global salience of the images. Do images that show the task-relevant scene attract more initial saccades? Likewise, are novel images more likely to attract the first fixation? This challenge sheds some light on central-peripheral interaction in visual processing.
Guo (2007), for instance, showed that during face processing, humans indeed rely on top-down information in scanning images. However,
Açk et al. (2010) proposed that young adults usually rely on bottom-up rather than top-down information during visual search. In this regard we thus hypothesize that (H4): Task relevance and familiarity of images will not lead to higher probability of being fixated first. In order to account for any spatial bias effects that could influence the global salience model, we added coefficients to the logistic regression algorithm that could potentially capture any lateral, familiarity, and bias effects. This not only makes the model more accurate but allows us to analyze the influence of these effects. Furthermore, the location of the images in the experiments was randomized across trials and participants.
Finally, in order to better understand the properties of the global salience of competing stimuli, we also analyzed the exploration time of each image. In this regard, we hypothesize the following (H5): Images with larger global salience will be explored longer than images with low global salience.
We presented the stimuli on a 32-in. widescreen Samsung monitor (Apple, Cupertino, CA) with a native resolution of 3,840 × 2,160 pixels. For eye movement recordings, we used a stationary Eye Link 1000 eye tracker (SR Research Ltd., Ottawa, Ontario, Canada) providing binocular recordings with one head camera and two eye cameras with a sampling rate of 500 Hz.
Participants were seated in a darkened room at a distance of 80 cm from the monitor, resulting in 80.4 pixels per visual degree in the center of the monitor. We did not fixate the participant’s head with a headrest but verbally instructed the participants not to make head movements during the experiment. This facilitated comfortable conditions for the participants. However, the eye tracker constantly recorded four edge markers on the screen with the head camera, in order to correct for small head movements. This guaranteed stable gaze recordings based on eye movements, independent of residual involuntary head movements.
The eye tracker measured binocular eye movements. For calibration of the eye-tracking camera, each participant had to fixate on 13 black circles (size 0.5○) that appeared consecutively at different screen locations. The calibration was validated afterward by calculating the drift error for each point. The calibration was repeated until the system reached an average accuracy of <0.5○ for both eyes of the participant.
The experiment consisted of 200 trials divided into four blocks, at the beginning of which the eye-tracking system was recalibrated. The blocks were designed such that each had a different combination of task and image novelty:
- • Block 1 consisted of 25 trials formed by 50 distinct, novel images (new/new). This block was task-free, that is, participants were guided to freely observe the stimuli (Figure 1c).
- • Block 2 consisted of 75 trials, each formed by one new image and one of the previously seen images (new/old or old/new). In this block, the participants were guided to freely observe the stimuli and, additionally, they were asked to indicate the new image of the pair after the stimulus offset (Figure 1d).
- • Block 3 consisted of 75 trials, each formed by one new image and one of the previously seen images (new/old or old/new). In this block, the participants were asked to indicate the old image of the pair.
- • Block 4 consisted of 25 trials formed by 50 previously seen images (old/old). Like Block 1, this block was also task-free.
The decision in Blocks 2 and 3 was indicated by either pressing the left (task-relevant image is on the left side) or right (task relevant-image is on the right side) arrow button on a computer keyboard.
The image pairs were formed by randomly sampling from the set of 200 images, but some constraints were set in order to satisfy the characteristics of each block and keep a balance in the number of times each image was seen by the participant. The sampling process was as follows: In Block 1, 50 images were randomly sampled to form the 25 pairs. In Blocks 2 and 3, in order to construct the new/old and old/new pairs, the new image was randomly sampled from the set of remaining unseen images and the old image was randomly sampled of previously seen images, with two additional constraints: It must have been chosen only one time before and not in the previous five trials. Finally, in Block 4, a set of exactly 50 images that had been shown only once remained. These were used to randomly sample the remaining 25 trials. In all blocks, after sampling the two images, the left/right configuration was also randomly chosen with probability 0.5.
The sampling process was different for each participant, that is, they saw different sets of pairs from the 40,000 different pairs and in different order. This aimed at reducing the predictability of the process while satisfying the experimental constraints. Overall, we collected data from 9,800 pairs, some of which might have been repeated across participants. However, note that each participant saw each image exactly twice; therefore, the frequency of presentation of the images was balanced across the whole experiment. As we will see in the following section, the amount of data was enough to fit the computational model.
In all cases, the presentation time for each stimulus was 3 s and it was always preceded by a blank, gray screen with a white, central fixation dot. The stimulus was displayed only after the participant fixated the central dot.
The majority of our analyses focused on the first fixation. As a preprocessing stage, we discarded the fixations (a) due to anticipatory saccades; (b) shorter than 50 ms or longer than μdur + 2σdur ms, where μdur = 198 ms and σdur = 90 ms are the mean and standard deviation of all fixation durations, respectively; and (c) located outside any of the two images. The discarded fixations were less than 4% of the total.
Not only were we interested in modeling the likelihood of every image of receiving the first fixation, but also the contribution of other aspects of the experiment, namely, the effect of having to perform a small task when observing the pair of images and the familiarity with one of the two images. More specifically, we were interested in answering the following questions: Do light task demands, such as having to determine which image is new or old, influence the direction of the first saccade? Also, are unseen stimuli more likely to receive the initial saccade than previously observed stimuli when presented together or vice versa?
We addressed these questions by adding new features to the model that capture these characteristics of the experimental setup. These features were assigned coefficients that, after training, will indicate the magnitude of the contributions of the effects. In particular, we added the following features columns to every row i of the design matrix:
- • t(i): 1 if the target of the task (select new/old image) was on the right at trial i, −1 image if on the left, 0 if no task.
- • f(i): 1 if at trial i, the image on the right had been already shown at a previous trial (familiar), while the image on the left was still unseen; −1 if the familiar image was on the left; 0 if both images were new or familiar.
Not only did these new features enable new elements for the analysis, but they also added more representational power to the model, which could potentially learn better coefficients to describe the global salience of each image. In this line, we added one more feature to the model to capture one important aspect of visual exploration: the lateral bias. Although a single intercept term in the argument of the logistic function (
\(\mathbf {w}^{T}\mathbf {x} + b)\) would capture most of the lateral bias, since the outcome
\(\mathbf {y}\) describes exactly the lateral direction, left or right, of the first saccade, we instead added subject-specific features to model the fact that the trials were generated by different subjects with an individual lateral bias. This was done by adding
K = 49 (number of participants) features
\(s_{k}^{(i)}\), with value 1 if the trial
i was performed by subject
k and 0 otherwise. Altogether, the final design matrix
X′ extends the design matrix
X defined in
Equation 4 as follows:
\begin{equation}
X^{\prime } = \left[\begin{array}{@{}ccc|c|c|ccc@{}} x_{1}^{(1)} & \dots & x_{M}^{(1)} & t^{(1)} & f^{(1)} & s_{1}^{(1)} & \dots & s_{K}^{(1)} \\
x_{1}^{(2)} & \dots & x_{M}^{(2)} & t^{(2)} & f^{(2)} & s_{1}^{(2)} & \dots & s_{K}^{(2)} \\
\vdots & \ddots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
x_{1}^{(N)} & \dots & x_{M}^{(N)} & t^{(N)} & f^{(N)} & s_{1}^{(N)} & \dots & s_{K}^{(N)} \\
\end{array}\right]
\end{equation}
Note that the leftmost block of
X′ is identical to
X (defined in
Equation 4). While the shape of
X is 9,800 × 200,
X′ is a 9,800 × 251 matrix, since 200 + 1 + 1 + 49 = 251.
Next, we investigated the effect of the familiarity with one of the images and of the task of selecting the already seen or unseen image, which the participants had to perform in Blocks 2 and 3 of the experiment, respectively. In particular, we were interested in finding out whether there is a tendency to direct the initial saccade toward the task-relevant images or toward the new images, for instance. In our fourth hypothesis (H4), we stated that our task and familiarity should have little or no influence on the initial saccade. For that purpose, we first performed a 2 × 2 (task: select new, select old × fixated image: new image, old image) repeated-measures ANOVA analysis (Greenhouse-Geisser corrected). The results revealed no significant effects (all
F ≤ 1.936; all
p ≥ .170, all
\(\eta _p^2\) ≤ .039) (
Figure 10). Thus, the provided tasks did not bias the initial saccade decision to target one of the two presented images. Nevertheless, we found that participants correctly identified 91.43% of the new images in Block 2 and 91.16% of the old images in Block 3. Hence, the task performance was highly above chance (50%) and the participants were accurate in identifying the new and old images, respectively.
Also in this case, the same conclusion can be extracted from the coefficients learned by the model to capture the task and familiarity effects, which are −0.04 and −0.10, respectively, that is, very small and only slightly higher for the familiarity.
Taken together, spatial properties influenced the initial saccade in favor to fixate left-sided images first. Although task performance was very high, neither the task nor the familiarity with one of the images had an influence in the direction of the first fixation after stimulus onset. These results fully support our third and fourth hypotheses.
In our fifth hypothesis (H5), we stated that images with higher global image salience lead to a longer exploration time than images with lower global salience. We thus calculated the relative dwell time on each image, left and right, for each trial. As an initial step, similar to the analysis of the initial saccade, we analyzed the potential effect of the spatial image location as well as the task and familiarity relevance on the exploration time.
With respect to the spatial image location, a 4 × 2 (block: 1, 2, 3, 4 × image side: left, right) repeated-measures ANOVA (Greenhouse-Geisser corrected) revealed a significant main effect according to the block,
F(2.368, 113.668) = 12.066,
p < .001,
\(\eta _p^2\) = .201, but no further effects (all
F ≤ 2.232; all
p ≥.109, all
\(\eta _p^2\) ≤ .044). Thus, the total time of exploration did not depend on the spatial location of the images, as also shown in
Figure 11.
With respect to the task relevance (recall: Block 2—select new image; Block 3—select old image), we calculated a 2 × 2 (task: select new, select old × fixated image: new image, old image) repeated-measures ANOVA (Greenhouse-Geisser corrected). The results revealed a significant main effect according to the task,
F(1, 48) = 4.298,
p < .050,
\(\eta _p^2\) = .082, and fixated image
F(1, 48) = 64.524,
p < .001,
\(\eta _p^2\) = .573, as well as an interaction between task and fixated image,
F(1, 48) = 36.728,
p < .001,
\(\eta _p^2\) = .433. As shown by
Figure 12, our results showed that, in general, participants tended to spend more time exploring new instead of previously seen images. Furthermore, this effect was noticeably larger in Block 2, where the task was to select the new images, than in Block 3 (select old image).
Consequently, we found that the spatial location of images did not affect the total time of exploration. Instead, the task and familiarity had a considerable impact on the exploration time, revealing that new images were explored during a longer time than the counterpart.
For our main analysis regarding the interaction between exploration time and global image salience, we then contrasted the global salience score learned for each image with its respective dwell time averaged over all trials and subjects. The results revealed a significant positive correlation, indicating that images with larger global image salience led to a more intense exploration (
Figure 13). Thus, global image salience describes not only a measure of which image attracts initial eye movements, but is also connected to longer exploration time, suggesting that global salience may describe the relative engagement of images.
Taken together, our results suggest that the task and familiarity (but not the spatial location of images) influenced the exploration time with respect to higher dwell times on unseen images in combination with the task to select the new image. Note, however, that regarding the effects of task, our findings are restricted to the specific task assigned in our experiments, that is, selecting which image is new or old. The effect of task in visual attention is an active field in visual perception, and the results of multiple contributions should be taken together into consideration to draw robust conclusions. Finally, we also found that images with higher global salience correspondingly led to a larger time of exploration. These results fully support our fifth hypothesis.
We have presented a computational model trained on the saccadic behavior of participants freely looking at pairs of competing stimuli, which is able to learn a robust score for each image, related to its likelihood of attracting the first fixation. This fully supports our first hypothesis, and we refer to this property of natural images as the global visual salience.
The computational model consists of a logistic regression classifier, trained with the behavioral data of 49 participants who were presented 200 pairs of images. In order to reliably assess the performance of the model, we carried out a careful 25-fold cross-evaluation, with disjoint sets of participants for training, validating, and testing. Given a pair of images from the set of 200, the model predicted the direction of the first saccade with 82% accuracy and 0.88 area under the receiver operating characteristic curve.
Throughout the article, we have analyzed the general lateral bias toward the left image (H2), as well as other possible influences such as the familiarity with one of the images and the effect of a simple task (H3). Moreover, we have analyzed the relationship of our proposed global salience with the local salience properties of the individual images (H4). Finally, we have also studied the total exploration time of each image in the eye-tracking experiment and compared it to the global salience, which is based upon the first fixation (H5).
Regarding the lateral bias, we found that participants tended to look more frequently toward the image on the left. Such left bias is typical in visual behavior and has been found in many previous studies (
Barton et al., 2006;
Guo et al., 2009;
Calen Walshe & Nuthmann, 2014;
Ossandón et al., 2014). However, most of these studies presented only single images per stimulus. In this regard, it has been argued that cultural factors of the Western population who mostly take part in the research experiments may lead to a semantic processing of natural visual stimuli similar to the reading direction, that is, from left to right (
Spalek & Hammad, 2005;
Zaeinab et al., 2016).
In our study, about 63% of the first fixations landed on the left image. However, we also observed a high variability across participants, successfully captured by our computational model. In contrast, we showed that the given task in certain trials did not influence initial saccade behavior. Participants equally distributed the target location of saccades on the presented images, regardless of familiarity and task relevance. Consequently, the spatial location of an image affected saccade behavior, whereas the task as well as familiarity had no influence.
Importantly, we found that global salience, that is, the likelihood of an image attracting the first fixation when presented next another competing image, is independent of the low-level local salience properties of the respective images. The location of the first fixations made by the participants in the study did not correlate with the GBVS salience maps of the images, and the saccadic choice—left or right—was neither explained by the GBVS salience mass difference. Hence, our results provide some new insights in the understanding of visual perception of natural images, showing that the global salience of an image is rather affected by the semantics of the content. For instance, images involving socially relevant content such as humans or faces led to higher global salience than images containing purely indoor, urban, or natural scenes.
To gain further insight regarding this aspect, we computed the salience maps using Deep Gaze II (
Kümmerer et al., 2016), a computational salience model that is not limited to low-level features but also makes use of high-level cues, obtained by pretraining the model with image object recognition tasks. We repeated the same analyses as with the GBVS model and we found that metrics derived from Deep Gaze salience maps did have a nonzero, yet moderate correlation with our proposed global salience. This, together with previous evidence about the importance of low- and high-level features in detecting fixations (
Kümmerer et al., 2017), matches our finding that global salience cannot be explained by low-level properties of the images. However, the relatively low correlation further suggests that the initial preference for one of the images does not depend only on properties of the individual salience maps.
According to previous research, initial eye movements in young adults are based on bottom-up image features, whereas socially relevant content is fixated later in time (
Açk et al., 2010). Interestingly, as described above, we found that this was not the case when two images have been shown at the same time. Considering the very short reaction time between stimulus onset and the observers, reaction to fixate one of the two images, it seems surprising that participants had to prescan both images in their peripheral visual field before initializing the first saccade. Thus, in contrast to classical salience maps, we might argue that the global salience of an image highly relates to the semantic and socially relevant content.
In order to further investigate the effects of the global image salience, we also evaluated the total time of image exploration, that is, the dwell time. We hereby found that, different from the initial saccade, the spatial location of images did not affect the time participants explored the individual images of each image pair. However, the task and familiarity had an effect. We saw that in the task where participants had to select the new image, new images were explored longer than previously seen images. In contrast, the task asking to select the old image led to an almost equal exploration time on new and familiar images. Therefore, we conclude that participants in general tended to explore new images for a slightly longer time. Nevertheless and most important, we saw generally—and independent of the spatial location, task, and familiarity—that images with higher global salience were explored longer in time. Thus, images with larger global salience did not only attract initial eye movements after stimulus onset but also led to longer exploration times. These results support our assumption, that the global salience score of an image can also be interpreted as a measure of the general attraction of an image, in comparison to other images.
In this regard, note that although we considered the location of the first fixation as the target variable to model the global salience scores and carry out the subsequent analyses, the same computational model and procedures can be used to model alternative aspects of the behavioral responses. For instance, the model could be trained to fit the dwell time—which we have found to be positively correlated with the global salience based on the first fixations—the engagement (time until fixating away), or the number of saccades.
Despite the high performance of our computational model and its potential to assign reliable global salience scores to natural images, an important limitation is that the model and thus the scores are dependent on the image set that we used. Whereas local salience maps rely on image features, our proposed global salience model relies on the differences between the stimuli and the behavioral differences that they elicit on the participants. We observed significant differences between image categories, for example, humans versus indoor scenes, but this is only one initial step, and future work should investigate what other factors influence the global image salience. For example, it would be interesting to train a deep neural network with a possibly larger set of images and the global salience scores learned by our model as labels, similarly to how Deep Gaze was trained to predict fixation locations. This could shed more light on what features make an image more globally salient.
Another related, interesting avenue for future work is investigating the global salience in homogeneous data sets, that is, with images of similar content. Our work has shown that large differences exist between images with somehow different content, for instance, containing humans or not. However, we did not observe significantly different global salience between natural and urban scenes (see
Figure 2b), although significant differences do exist between specific images. An interesting question is:
What makes one image more likely to attract the first fixation, when presented alongside a semantically similar image? We think an answer to this question can be sought by combining a similar experimental setup to the one presented in this work, with additional data, and making use of advanced feature analysis, such as deep artificial neural networks, as mentioned above.
For instance, small changes in the context information of single images might already have a dramatic influence on reaction times in decision tasks (
Kietzmann & König, 2015). In addition, the global salience was based on eye movement behavior of human data. Depending on the choice of participants, for example, different culture, age, personal interests, and emotions, our model could have revealed different results (
Balcetis & Dunning, 2006;
Dowiasch et al., 2015;
Kaspar et al., 2015). Again, further studies might use the model on a wider range of participants, in order to validate the specific global salience and thus attraction of images.
In contrast, differences in the global salience between participant groups could be a great advantage in certain research fields. In medical applications, for instance, researchers could identify specific diseases, such autistic spectrum disorder (ASD). In such an example, our method could generate a model of the global visual salience of both control people and individuals with certain conditions, and then be used for diagnosis. Another use of our model would be marketing research, where the attraction of different images could be compared adequately based on intuitive visual behavior. Thus, depending on the research question, the global image salience might provide a new insight in prediction and analysis of visual behavior.
The authors thank Deutsche Forschungsgemeinschaft (DFG) and Open Access Publishing Fund of Osnabrück University.
Supported by BMBF and Max Planck Society.
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Grant agreement No 641805.
Commercial relationships: none.
Corresponding author: Alex Hernández-García.
Address: Wachsbleiche 27, 49090 Osnabrück (Germany).
While there is no consensus about the best metric for the evaluation of logistic regression, the coefficient of discrimination
R2 proposed by
Tjur (2009) has been widely adopted recently, as it is more intuitive than other definitions of coefficients of determination and still asymptotically related to them.