**Most visual saliency models that integrate top-down factors process task and context information using machine learning techniques. Although these methods have been successful in improving prediction accuracy for human attention, they require significant training data and are unable to provide an understanding of what makes information relevant to a task such that it will attract gaze. This means that we still lack a general theory for the interaction between task and attention or eye movements. Recently, Tanner and Itti (2017) proposed the theory of goal relevance to explain what makes information relevant to goals. In this work, we record eye movements of 80 participants who each played one of four variants of a Mario video game and construct a combined saliency model using features from three sources: bottom-up, learned top-down, and goal relevance. We use this model to predict the eye behavior and find that the addition of goal relevance significantly improves the Normalized Scanpath Saliency score of the model from 4.35 to 5.82 ( p < 1 × 10^{–100}).**

*D*with respect to an agent's probability distribution

*P*(

*S*) over the set

*S*of possible ways it could achieve its goals as a distance measure,

*d*(·, ·), between the prior distribution of beliefs

*P*(

*S*) and the posterior distribution

*P*(

*S*|

*D*) after observation of data

*D*:

- Mario has unlimited life. When he takes damage, the participant receives feedback but Mario is not killed.
- There is no time limit for completing a level.
- None of the levels contain any pits into which Mario could fall and die.
- All coins and power-ups have been removed from the game.
- Mario is unable to run and can only walk.
- When a Koopa enemy is destroyed, no shell is left behind.
- At the end of each level, an itemized score is displayed based on the time taken and how many times Mario took damage.

*t*is the amount of time taken, in seconds, and

*d*is the number of times Mario received damage. This equation means that a single instance of taking damage costs the same amount of points as spending an additional 30 seconds. Our custom levels are designed such that there is no scenario in which this occurs, so it is never worth taking damage to reach the end of the level more quickly. The optimal strategy is to prioritize avoiding damage and otherwise proceed as quickly as possible. This was explained to the participants, and the score display after each level showed the calculations to continually reiterate this point. The intention of this scoring metric was to ensure that the participants focused on avoiding the enemies and obstacles, rather than simply running as fast as possible without regard for the environment. To this end, in observing the participants, we felt we were successful in keeping them highly engaged in safely navigating the levels.

- Normal: Our modified Mario World game as described above.
- Small: Mario is half as tall as in the normal version. This allows him to fit through some gaps not accessible to normal Mario.
- High Jump: Mario can jump about twice as high as normal Mario.
- Invulnerable: Mario does not take damage when coming into contact with enemies and instead simply passes through them.

*h*of each node in the graph:

*x*and

*y*are the coordinates of Mario in the level,

*t*is the amount of time currently elapsed, in seconds, and

*d*is the number of times Mario has received damage.

*e*(

*x*) is a function for estimating the amount of time remaining before Mario reaches the level end and is simply calculated as the remaining distance to the goal divided by Mario's maximum speed. Note that this equation is the same as Equation 3, except that the time component is broken down into current time and estimated remaining time, which is available to the model. The players had no direct information about the estimated remaining time. However, all of the levels are the same length, so players could have noticed this and gained an approximate sense. The model computes the optimal solution to Equation 4 using A*, which is used to limit the breadth-first search by evaluating each node using the same fitness function. Any node whose fitness is lower than the optimal solution's fitness by more than a threshold

*δ*

_{fitness}on NSS Scores subsection.

*test objects*. This first experiment uses these to see if changing the solution space causes a predictable change in human gaze behavior.

*δ*

_{test}of the object's center, to account for the fact that subjects often attend to locations adjacent to intended objects and for minor eye-tracking miscalibrations that arise from small head movements. We set

*δ*

_{test}= 4° visual angle, which is approximately the height of Mario. These AR and GRR values are then averaged across participants, separately for participants in each of the four versions of the game. This will provide us with four AR averages and four GRR averages for each object. Finally, we analyze these values within specific subsets of objects. Specifically, we investigate the following object subsets:

- Flying enemies (Figure 1A)
- Custom wall lowers: The lower portion of the specially designed wall sections (Figure 1B)
- Custom wall uppers: The upper portion of the specially designed wall sections (Figure 1B)
- Tunnels: Long horizontal blocks under which only small Mario can fit (Figure 1C)
- Raised ledges: Blocks onto which only high-jumping Mario can jump (Figure 1D)
- Uninteractable enemies: Enemies hidden behind blocks such that no version can interact with them (Figure 1E,F). Flying enemies are also considered uninteractable for game versions other than High Jump
- Interactable enemies: All enemies not considered uninteractable

*x*

_{0}–

*x*

_{6}are coefficients selected using the simplex search method (Lagarias, Reeds, Wright, & Wright, 1998), an iterative algorithm for minimizing nonlinear functions. To avoid overfitting, we randomly set aside one participant from each group (two men, two women) and use the results from these four participants in the simplex search. The remaining analysis proceeds with these participants excluded, leaving 20 (10 men, 10 women) in each group.

- A green shade, if the AR values are significantly different from the baseline AR in the same direction as the GRR difference (up or down)
- A yellow shade, if the AR values are not significantly different from the baseline
- A red shade, if the AR values are significantly different but in the opposite direction as GRR

*r*(162,218) = 0.1217,

*p*< 1 × 10

^{–100}. This correlation is impressive because the AR computation, unlike the GRR, captures all influences on human gaze, not just the top-down influences. This means that many of the saccades may be guided by other factors that goal relevance is not designed to capture. It is well known that task plays a role in attention (Hayhoe & Ballard, 2014), but it is clearly not the only factor, so it is expected that there be a significant portion of the AR values that are a priori uncorrelated with the GRR. Because of this, the GRR values are likely much more strongly correlated with the human notion of goal relevance than indicated by the numbers alone.

*x*

_{2}and

*x*

_{4}–

*x*

_{6}are set to 0 and

*x*

_{0},

*x*

_{1}, and

*x*

_{4}are found using the simplex search method as above. These combinations allow us to see the effects of adding or removing individual components from the combined model.

*F*(279,648) = −313.1510,

*p*< 1 × 10

^{–100}. Similarly, adding GR to the TD mask was also significant,

*F*(279,648) = −91.2175,

*p*< 1 × 10

^{–100}, as well as adding GR to the BuTd model,

*F*(279,648) = −89.9835,

*p*< 1 × 10

^{–100}. These results also hold for each game version individually, indicating that there was not much difference between versions in terms of the NSS scores.

*r*(78) = 0.31,

*p*= 5.1 × 10

^{–3}. However, filtering the model results based on this question did not reveal any significant differences in model performance based on player experience. That is, experience level affected the score but did not significantly affect the agreement between our model predictions and the players' eye movements.

*, 44, 523–538.*

*IEEE Transactions on Systems, Man, and Cybernetics: Systems**.*

*arXiv preprint arXiv:1604.03605**, 3, 20–29.*

*Perspectives on Psychological Science**, 4, 100–107.*

*IEEE Transactions on Systems Science and Cybernetics**, 9, 188–194.*

*Trends in Cognitive Sciences**, 24, R622–R628.*

*Current Biology**, 18, 547.*

*Advances in Neural Information Processing Systems**, 20, 1254–1259.*

*IEEE Transactions on Pattern Analysis and Machine Intelligence**2014. BMVA Press. http://dx.doi.org/10.5244/C.28.73*

*Proceedings British Machine Vision Conference**.*

*arXiv preprint arXiv:1711.09464**, 112, 16054–16059.*

*Proceedings of the National Academy of Sciences**, 9, 112–147.*

*SIAM Journal on Optimization**, 41, 3559–3565.*

*Vision Research**, 369, 742.*

*Nature**, 45, 205–231.*

*Vision Research**. (pp. 1–8). Washington, DC: IEEE.*

*IEEE Computer Vision and Pattern Recognition**, 13 (3): 16, 1–14, https://doi.org/10.1167/13.3.16. [PubMed] [Article]*

*Journal of Vision**, 41, 3535–3545.*

*Vision Research**(pp. 1467–1474).*

*Advances in neural information processing systems**, 124, 168.*

*Psychological Review**, 39, 576–588.*

*IEEE Transactions on Pattern Analysis and Machine Intelligence**δ*

_{fitness}on NSS scores

*δ*

_{fitness}below the fitness of the best path found using an A* search. However, computational restraints required that we set this threshold to 0, meaning that we pruned branches as soon as they became worse than the best solution. To confirm that this did not affect our results, we reran the analysis using higher values but only on the four validation participants used for the simplex search (Lagarias et al., 1998). This is shown in Figure 7. Although the NSS scores did slightly worse with the higher values, the difference is small enough that the significance of the results is not affected.

*F*(279,648) = −269.6313,

*p*< 1 × 10

^{–100}. Adding GR to the TD mask was also significant,

*F*(279,648) = −6.8460,

*p*= 7.60 × 10

^{–12}, as well as adding GR to the BuTd model,

*F*(279,648) = −6.7976,

*p*= 1.07 × 10

^{–11}. Although the differences between the model scores are not as large, the significance of the results is unaffected.