Free
Methods  |   September 2013
A computational model for task inference in visual search
Author Affiliations
  • Amin Haji-Abolhassani
    Center for Intelligent Machines, Department of Electrical and Computer Engineering, McGill University, Montreal, Quebec, Canada
    amin@cim.mcgill.ca
  • James J. Clark
    Center for Intelligent Machines, Department of Electrical and Computer Engineering, McGill University, Montreal, Quebec, Canada
    clark@cim.mcgill.ca
Journal of Vision September 2013, Vol.13, 29. doi:https://doi.org/10.1167/13.3.29
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Amin Haji-Abolhassani, James J. Clark; A computational model for task inference in visual search. Journal of Vision 2013;13(3):29. https://doi.org/10.1167/13.3.29.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract
Abstract
Abstract:

Abstract  We develop a probabilistic framework to infer the ongoing task in visual search by revealing what the subject is looking for during a search process. Based on the level of difficulty, two types of tasks, easy and difficult, are investigated in this work, and individual models are customized for them according to their specific dynamics. We use Hidden Markov Models (HMMs) to serve as a model for the human cognitive process that is responsible for directing the center of gaze (COG) according to the task at hand during visual search and generating task-dependent eye trajectories. This generative model, then, is used to estimate the likelihood term in a Bayesian inference formulation to infer the task given the eye trajectory. In the easy task, focus of attention (FOA) often lands on targets, whereas in the difficult one, in addition to the on-target foci of attention, deployment of attention on nontarget objects happens very often. Therefore, we suggest a single-state and a multi-state HMM to serve as the cognitive process model of attention for the easy and difficult tasks, respectively.

Introduction
Human vision is an active, dynamic process in which the viewer seeks out specific visual inputs according to the ongoing cognitive and behavioral activity. A critical aspect of active vision is directing a spatially circumscribed region of the visual field (about 3°) corresponding to the highest resolution region of the retina, the so-called fovea, to the task-relevant stimuli in the environment. In this way our brain gets a clear view of the conspicuous locations in an image and will be able to build up an internal, task-specific, representation of the scene (Breitmeyer, Kropfl, & Julesz, 1982; Findlay & Gilchrist, 2001). 
In visual activities, the eyes make rapid movements, called saccades, typically between two and five times per second, in order to bring environmental information into the fovea. Pattern information is only acquired during periods of relative gaze stability, called fixations, owing to the retina's temporal filtering and the brain's suppression of information during the saccades (Matin, 1974). Gaze planning, thus, is the process of directing the fovea through a scene in real time in the service of ongoing perceptual, cognitive, and behavioral activity. The question of exactly what is happening during fixations is still something of a puzzle, but the effect of visual task on the pattern and specifications of eye movements has been long studied in the literature (Kowler, 2011). 
In two seminal studies, Yarbus (1967) and Buswell (1935) showed that visual task has a great influence on specific parameters of eye movement control. Figure 1 shows Yarbus's observation that eye fixations are not randomly distributed in a scene, but instead tend to cluster on some regions at the expense of others. In this figure we can see how visual task modulates the conspicuity of different regions and as a result changes the pattern of eye movements. 
Figure 1
 
Eye trajectories measured by Yarbus (1967) by viewers carrying out different tasks. (Upper right) No specific task. (Lower left) Estimate the wealth of the family. (Lower right) Give the ages of the people in the painting. The figure is adapted from Yarbus (1967) with permission from Springer Publishing Company.
Figure 1
 
Eye trajectories measured by Yarbus (1967) by viewers carrying out different tasks. (Upper right) No specific task. (Lower left) Estimate the wealth of the family. (Lower right) Give the ages of the people in the painting. The figure is adapted from Yarbus (1967) with permission from Springer Publishing Company.
Perhaps the Yarbus effect is best studied for the task of reading. Clark and O'Regan (1998) showed that when reading a text, the center of gaze (COG) lands on the locations that minimize the ambiguity of the word arising from the incomplete recognition of the letters. Feature integration theory (Treisman & Gelade, 1980) and guided search (Wolfe, Cave, & Franzel, 1989) are two seminal works that study the effect of visual search on how our brain directs eyes through a scene. In another study Bulling, Ward, Gellersen, and Tröster (2009) also showed that eye movement analysis is a rich modality for activity recognition. 
Although the effect of visual task on eye movement pattern has been investigated for various tasks, there is not much done in the area of visual task inference from the eye movements. In a forward Yarbus process, visual task is given as an input, and the output is task-dependent trajectories of eye movements. In this work, on the other hand, our goal is to develop a method to realize an inverse Yarbus process whereby we can infer the ongoing task by observing the eye movements of the viewer. However, solving an inverse Yarbus mapping is an ill-posed problem and cannot be solved directly. For instance, in a study by Greene, Liu, and Wolfe (2011) an unsuccessful attempt was made to directly solve an inverse Yarbus problem, which led to the conclusion that: “The famous Yarbus figure may be compelling but, sadly, its message appears to be misleading. Neither humans nor machines can use scan paths to identify the task of the viewer. (Greene et al., 2011, p. 1)” That said, in this work we find a way to regularize the ill-posedness of the problem and suggest a model to infer a visual task from eye movements. 
Visual search is one of the main ingredients of many complex tasks. When we are looking for a face in a crowd or counting the number of certain objects in a cluttered scene, we are unconsciously performing visual search to look for certain features in the faces or in the objects in a scene. As proposed in Treisman and Gelade (1980), the level of difficulty in a search task can vary according to the number of features distinguishing the target object from the distractors. For instance, targets defined by a unique color or a unique orientation are found more easily compared to the ones defined by a conjunction of features (e.g., red vertical bars). On this basis, we investigate two types of visual search, which we call easy search and difficult search. In the easy search task, the complexity of determining the presence of a target is lower than that of the difficult search, which reduces the number of fixations on nontarget objects compared to the difficult search task. 
In this work we develop a method to infer the task in visual search, which is essentially equivalent to finding out what the viewer is looking for. This is helpful in applications where knowing the target objects in a scene can help us improve the user experience in interaction with an interface. For instance, knowing what the user is seeking in a webpage combined with a dynamic design can lead to a smart webpage that highlights the relevant information according to the ongoing visual task. The same idea applies to intelligent signage that changes its contents to show relevant advertisements according to the foci of attention inferred from each viewer's eye movements. 
The model we propose is based on the generative model of Hidden Markov Models (HMM). For each task, we train a task-specific HMM to model the cognitive process in the human brain that generates eye movements given the task. The output of each HMM, then, would be task-dependent eye trajectories along with their respective likelihoods. In order to infer the ongoing task, we use this likelihood term in a Bayesian inference framework that incorporates the likelihood with a-priori knowledge about the task and gives the posterior probability of various tasks given the eye trajectory. 
HMMs have been successfully applied in speech recognition (Rabiner, 1990), anomaly detection in video surveillance (Nair & Clark, 2002), and handwriting recognition (Hu, Brown, & Turin, 1996). In the studies related to eye movement analysis, Salvucci and Goldberg (2000) used HMMs to break down eye trajectories into fixations and saccades. In another study Salvucci and Anderson (2001) developed a method for automated analysis of eye movement trajectories in the task of equation solving by using HMMs. Simola, Salojärvi, and Kojo (2008) used HMMs to classify eye movements into three phases, each representing a processing state of the visual cognitive system during a reading task. Van Der Lans, Pieters, and Wedel (2008) analyzed the ongoing attentional processes during the visual search. By training a HMM over a database of search trajectories, they showed that HMMs are a powerful tool to model the two underlying cognitive processes of a search operation—that is, localization of objects and identification of the target among the distractors. 
These studies unanimously concur that eye movements can be well modeled as the outcome of a sequential process that deploys visual resources on the field-of-view during a visual task based on a Markov process (also Ellis & Stark, 1986; Pieters, Rosbergen, & Wedel, 1999; Stark & Ellis, 1981). They also model the fixation positions as noisy outcomes of such a process, where the possible deviation of eye position from the focus of attention can be mapped to the disparity between the observation and hidden states in the context of HMMs (also Hayashi, Oman, & Zuschlag, 2003; Rimey & Brown, 1991). However, most of the studies only consider one specific task, such as reading a text or seeking a specific object among a group of distractors, and do not study the effect of task on the visual attention. 
Studying the effect of visual task on the parameters of the cognitive model of attention can help us come up with more accurate models that incorporate task specific features of visual attention. Besides, we can use such task-dependent models in the context of Bayesian inference to infer the ongoing task based on the a-posteriori knowledge of eye movements. To the best of our knowledge, the task-dependent modeling of attention by HMMs and using them to infer the visual task is a topic that has not been investigated in the literature related to computational modeling. 
To demonstrate our approach we ran two experiments. In the first study we dealt with inferring the ongoing task in an easy search, where targets are defined by a single feature and can be spotted without landing many fixations on the distractors. The proposed model infers what objects are being sought given the spatial location of fixations in an eye trajectory and the stimulus on which the visual search task was performed. 
In the second experiment we extended our model to infer the task in a more difficult search task. The application we designed for the difficult search is a soft keyboard, whereby the subjects can type a word by directing their COG on characters appearing on a screen (eye-typing). In this case, inferring the task is equivalent to recognizing the word that is eye-typed, based on the eye movements and the keyboard layout. In order to allow for the deployment of attention on nontarget objects, induced by the augmented complexity of the task, we add a secondary state to the HMM structure that accounts for these off-target allocation of attention on distractors. The double-state structure of the HMM also enables us to capture the dynamics of state transitions, as well as fixations' spatial information, to highlight the more frequently visited, attention-demanding locations that involve heavier interaction with the working memory. 
In this experiment we also show how off-target fixations can help us infer the task by exhibiting task-specific patterns based on the appearance of the target. Therefore, in an extension to our double-state model, we add an extra state to the structure of the HMM, forming a tri-state HMM, to segregate directing the focus of attention (FOA) on similar-to-target or dissimilar-to-target distractors. We compare the results of these models on a database of eye movements obtained in the soft keyboard application to see how the introduction of new states to the model affects the results of task inference in a difficult search task. 
Inverse Yarbus process via Bayesian inference
Generative learning is a class of supervised learning that classifies data in a probabilistic manner (MacKay, 2003). By applying probabilistic inference we can develop a mathematical framework for merging all sources of information and presenting the result in the form of a probability density function. In the case of developing an inverse projection from eye movement space to visual task space, this structure will be useful in the sense that it can incorporate prior knowledge about the tasks. Moreover, we need the inference to give us a probability distribution over different possible tasks rather than providing us with a single task as the output. In this way we can design a higher-level process that makes decisions about the task and provides us with the degree of confidence in the decision. 
Suppose we have data of the form <Q, k>, where kK is a task label in the set of all task labels K and Q is the vector containing the observation sequence of fixation locations (q⃗1, q⃗2, … , q⃗T) sampled from a stochastic process {q⃗t} at discrete times t = {1, 2, … , T} over random image locations. Each q⃗i, itself, is a vector containing the coordinates of a fixation at time i defined by (xi, yi), where xi and yi are the x and y coordinates of the ith fixation, respectively. 
In general, generative learning algorithms model two entities: 
  •  
    P(k): The prior probability of each task kK.
  •  
    P(Q|k): The task conditional distribution which is also referred to as the likelihood function.
We can write the probability of task k given an observed new sequence Q by a simple application of Bayes's rule:  Thus, in order to make an inference, we need to obtain the likelihood term and modulate it by our prior knowledge about the tasks. The likelihood term can be considered as an objective evaluation of the forward Yarbus process in the sense that it evaluates the probability of seeing an observation given a task. The likelihood term can be expressed as follows:  The standard approach for quantifying the likelihood is to use a saliency map as an indicator of how attractive a given part of the field-of-view is to attention (Itti, Koch, & Niebur, 1998). In the theories of visual attention there are two major viewpoints that either emphasize bottom-up, image-based, and task-independent effects of the visual stimuli on the saliency map or top-down, volition-controlled, and task-dependent modulation of such maps. 
In bottom-up models, the allocation of attention is based on the characteristics of the visual stimuli and does not employ any top-down guidance or task information to shift attention (i.e., P[Q|k] is assumed to be equal to P[Q]). Moreover, in this model it is assumed that observations q⃗i are conditionally independent which reduces the likelihood term to:  Bottom-up models have been extensively researched and are quite well-developed (Itti & Koch, 2001a), but empirical evaluation of such models show that they are disappointingly poor at accounting for actual attention allocations when a visual task is involved (Einhäuser, Rutishauser, & Koch, 2008). In our view the bulk of this shortfall is due to the lack of task-dependence in the models. 
On the other hand, the top-down models (e.g., Itti & Koch, 2001b; Rutishauser & Koch, 2007) improve the bottom-up models by incorporating the task-dependency and can be used to generate the likelihood term of Equation 2 by the following equation:  Although top-down models somewhat address the problem of task independency of bottom-up models, they are based on some assumptions that degrade their performance in obtaining the likelihood term. In our proposed model we use HMMs to relax these assumption and address the shortcomings of saliency-based models. 
Covert versus overt visual attention
When a visual task is given to an observer, although correctly executing the task needs directing the FOA to certain targets in an image, the observed COG trajectory can vary from subject to subject.1 In other words, eye position does not tell the whole story when it comes to tracking attention (Carrasco, 2011). 
While it is well known that there is a strong link between eye movements and attention (Rizzolatti, Riggio, & Sheliga, 1994), the attentional focus is nevertheless frequently well away from the current eye position (Fischer & Weber, 1993). Eye tracking methods may be appropriate when the subject is carrying out a task that requires foveation. However, these methods are of little use (and even counterproductive) when the subject is engaged in tasks requiring peripheral vigilance. 
Figure 2 shows two different eye trajectories recorded while viewers were counting the number of “A”s in an image. As can be seen, these two images illustrate different levels of linkage between the COG and FOA. In the left figure, fixation points mainly land on the targets of interest (overt attention), whereas in the right figure, the COG does not necessarily follow the FOA and sometimes our awareness of a target does not imply foveation on that target (covert attention). 
Figure 2
 
Eye trajectories recorded while executing a task given the same stimulus. In the trajectories straight lines depict saccades between two consecutive fixations (shown by dots). In this figure two snapshots of the eye movements during the task of counting the “A”s is shown. The results from counting the characters were correct for both cases. Thus, the target that seems to be skipped over (the middle right “A” in the right Figure) has been attended at some point.
Figure 2
 
Eye trajectories recorded while executing a task given the same stimulus. In the trajectories straight lines depict saccades between two consecutive fixations (shown by dots). In this figure two snapshots of the eye movements during the task of counting the “A”s is shown. The results from counting the characters were correct for both cases. Thus, the target that seems to be skipped over (the middle right “A” in the right Figure) has been attended at some point.
The first scientist to provide an experimental demonstration of covert attention is known to be Helmholtz (1896). In his experiment, Helmholtz briefly illuminated inside a box by lighting a spark and looked at it through two pinholes. Before the flash he attended to a particular region of his visual field without moving his eyes in that direction. He showed that only the objects in the attended area could be recognized implying attention can be away from the eye movements. 
In real life, human attention often deviates from the locus of fixation to give us knowledge about the parafoveal and peripheral environment. This knowledge can help the brain decide about the location of the next fixation that is most informative for building the internal representation of the scene. This discrepancy between the FOA and the COG helps us efficiently investigate a scene, and at the same time makes the FOA covert and consequently hard to track. An off-target fixation could also be attributed to an accidental, attention-independent movement of eye, equipment bias, undershooting or overshooting of the target (Becker, 1972), or the phenomenon of COG fixations (He & Kowler, 1989; Najemnik & Geisler, 2005; Zelinsky, Rao, Hayhoe, & Ballard, 1997), and might cause the occurrence of dissimilar eye trajectories from multiple forward mappings from the task space and therefore makes the inverse mapping from eye movement space into visual-task space (inverse Yarbus) an ill-posed problem. 
One solution to the problem of generating FOA trajectories is to force the binding of attention to eye movement by lowering the ratio between the target salience and noise strength distractor salience. In this way the targets become more resolution demanding, thus needing foveation in order to be distinguished from their surrounding distractors. The resulting COG, then, will be the same as the FOA trajectory. The manipulation of signal to noise ratio (SNR) has been studied before by Koch and Ullman (1985) as a way to attract attention. In another study Wolfe, Butcher, Lee, and Hyle (2003) proposed maximizing the SNR to decrease search time. Although by decreasing the SNR we obtain attention trajectories more easily, manipulation of the image SNR is not always feasible. For instance, in natural images, or more generally in nonsynthetic stimuli, we have limited control over the image to adjust the saliency of targets. 
Another way to regularize the ill-posed problem of the inverse Yarbus process is to consider the fact that although a task can generate dissimilar eye movements, it always has its own targets of interest in an image that can be revealed by tracking overt and covert attention. Thus, if we could find a way to track covert attention (tracking overt attention is merely equivalent to eye tracking) given the eye movements, we could regularize the ill-posed inverse problem and find a solution to our task inference problem. However, covert attention is different than eye position and revealing its location needs more complex tracking algorithms than simply tracking the eye position. 
One important point to be noted is that “off-stimulus” fixations do not necessarily mean that the FOA is away from the fixation (i.e., covert). For instance, in a phenomenon known as the center of gravity (also known as the global effect; He & Kowler, 1989; Najemnik & Geisler, 2005; Zelinsky et al., 1997), the stimulus is actually the collection of features, and the location of the stimulus is the center-of-mass of this collection. Hence the FOA is in this case overt. 
In the rest of the paper we elaborate on a method for tracking the focus of attention, whether overt or covert, that is based on the theory of HMMs. Unlike the saliency models, the proposed model should be able to represent off-stimulus locations and infer the task regardless of the attentional overtness. 
Hidden Markov Models
A HMM is a statistical model based on Markov processes in which the states are unobservable. In other words, HMMs model situations in which we receive a sequence of observations (that depend on a dynamic system), but do not observe the state of the system itself. 
More specifically, a HMM is a finite state machine (FSM) where each state is associated with a probability distribution or observation function. A set of transition probabilities govern the transitions between the states and after each transition an observation is generated according to the state's observation function. The only outcome from a HMM is the observation sequence and the generating state sequence is hidden to the observer; hence we call the technique Hidden Markov Model (see Figure 3). 
Figure 3
 
A sample first-order HMM. A HMM is defined by its number of states, transition probabilities, observation pdfs, and initial state distribution. By definition, the states are hidden to the observer and the output is a series of observations that are the outcomes of the observation pdfs. At each time step, the process picks a state according to the initial and transition probabilities and the output is a series of observations according to observation pdf of the state that is being visited at the time.
Figure 3
 
A sample first-order HMM. A HMM is defined by its number of states, transition probabilities, observation pdfs, and initial state distribution. By definition, the states are hidden to the observer and the output is a series of observations that are the outcomes of the observation pdfs. At each time step, the process picks a state according to the initial and transition probabilities and the output is a series of observations according to observation pdf of the state that is being visited at the time.
A typical discrete-time, continuous HMM λ can be defined by a set of parameters λ = (A, B, Π) where A is the set of state transition probabilities that governs the transitions between the states; B is the set of the parameters defining the observation probability density function of each state; and Π is the initial state distribution and defines the probability of starting from each of the states given an observation. 
In the literature related to the HMMs we find three major problems: evaluation, decoding, and training. Assume we have a HMM λ and a sequence of observations Q. Evaluation or scoring is the computation of the probability of observing the sequence given the HMM, i.e., P(Q|λ). Decoding finds the best state sequence that maximizes the probability of the observation sequence given the model parameters. Finally, training adjusts model parameters to maximize the probability of generating a given observation sequence (training data). The most commonly used algorithms that cope with evaluation, decoding, and training problems are called forward, the Viterbi and Baum-Welch algorithms, respectively. Details about the methods can be found in Rabiner (1990) and Huang, Ariki, and Jack (1990). 
In our proposed model, each state is composed of a certain number of targets. These targets tell us all possible locations of the FOA while we are in a state. The targets can be obtained from the saliency maps, or in the case of discrete objects, we can consider the center of each object in the two-dimensional (2-D) image as a potential target, from which we assign a number of them to each state. 
Selection of the objects to be assigned to each state depends on the task. In the following experiments we design a single-state and a multi-state HMM for inferring the task in an easy (Experiment 1) and a difficult (Experiment 2) search task, respectively. In the easy task, the targets can be distinguished from the distractors by a single feature, making them easy to spot. In the difficult search task, however, a combination of features should be considered in order to locate the targets, which makes the task more difficult. The level of difficulty in the tasks is directly reflected in the number of fixations that are made on nontarget objects. In the easy task the targets are easy to spot and finding them involves very few (if any) fixations on nontargets. However, in the difficult task, finding the target usually involves frequent foveation of the distractors in order to investigate the distinguishing features of the target. In the single-state design, thus, we only select the targets and put them in a single-state HMM. This model selection implies that most of the fixations are on targets, and mapping the fixations to the relevant targets of each task can reveal the task. In Experiment 2 we show that in order to design a more realistic generative model for a difficult task, we need to add another state and put all the nontarget objects into the secondary state. 
In Experiment 2 we also show that not all nontarget objects are fixated with the same frequency during a difficult task. The pattern of fixations on nontarget objects show that the ones that are similar to the target get fixated more often. This feature suggests to further classify the nontargets based on their similarity to the target. Therefore, in the next attempt, we improve the task inference accuracy of double-state HMM by adding a third state to the model, where one state is dedicated to the targets, one to the nontarget objects that are similar to the target, and one to the ones that are dissimilar to the target. 
Each observation density function is defined by a 2-D Gaussian function centered on each target and forms a mixture of Gaussians for all the targets in a state. In other words directing attention (covert or overt) to a target is equivalent to going to the state containing that target and selecting the Gaussian observation function that represents it. 
The location of the COGs is modeled to be the outcome of the Gaussian mixture models (GMMs), which can be away from the target that is being attended. The eye movement trajectories are used to train the GMMs, transition probabilities, and initial state distributions to form task-dependent models λk, which in turn can be used to evaluate the probability P(Q|λk) for test trajectories. By this interpretation of variables we can use a sequence of fixations (Q) to represent the hidden foci of attention in the Bayesian inference and modify Equation 1 to:  The likelihood term can be obtained by the forward algorithm and the training of the model parameters (i.e., A, B, and Π), which are required by the forward algorithm, can be done by expectation maximization (EM) based algorithm of Baum-Welch (Rabiner, 1990). 
In order to better understand how the concepts of hidden state and observation in a HMM relate to the covert FOA and overt COG, respectively, here we sketch a prototype that employs HMMs as a cognitive process model of attention. Figure 4 Left and Middle show a sample stimulus and its corresponding bottom-up saliency map (see Itti & Koch, 2001a, for more information on how to obtain the saliency map). The bottom-up saliency map shows the conspicuous locations in an image that are potential targets of attention, regardless of the task at hand (i.e., in free viewing). Therefore, in our model we consider these locations as the comprising targets of the hidden states of a HMM. Figure 4 Right shows eye fixation locations of a subject while performing a task (counting characters) superimposed on the original stimulus. The fixation locations sometimes undershoot or overshoot the targets due to the noisiness of the eye tracker or oculomotor properties of human eyes. In our model we posit that these fixation locations constitute the observations in the HMM context. We postulate that these observations are random outcomes from a 2-D Gaussian probability density function (with features x and y in Cartesian coordinates), which is maximum on the target location and fades off as we become more distant (Euclidean) from the targets. 
Figure 4
 
(Left) The original stimulus. (Middle) The corresponding saliency map. (Right) Eye trajectory superimposed on the image.
Figure 4
 
(Left) The original stimulus. (Middle) The corresponding saliency map. (Right) Eye trajectory superimposed on the image.
As opposed to the classical methods of attention tracking, the proposed model relaxes the overtness constraint postulated in both bottom-up and top-down methods by using HMMs to model the task-dependent attention process. HMMs allow for covert attention by postulating the fixations to be the outcome of observation distributions, which can be a point away from the FOA, whereas in the top-down and bottom-up attention models the fixation location is assumed to be on attention demanding spots in an image. Therefore, by using λk (rather than k) in Equation 5 we can track the covert attention by finding the most probable, hidden foci of attention in λk (for all tasks) that can generate the given observation sequence and calculate the likelihood term in order to make an inference. 
Experiment 1: Task inference in an easy visual search task
All the experiments conducted in this research were approved by the McGill Ethics Review Board. The Research Ethics Boards of McGill University adhere to and are required to follow the Canadian Institutes of Health Research, Natural Sciences and Engineering Research Council of Canada, and Social Sciences and Humanities Research Council of Canada Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans, December 2010. This policy espouses the core principles of Respect for Persons, Concern for Welfare and Justice, in keeping with leading international ethical norms, such as the Declaration of Helsinki. 
Methods
In this experiment we use the proposed HMM-based model to infer the ongoing task in an easy visual search. The inference is made by applying Bayes rule (Equation 5) to the observation sequences of a database of task-dependent eye movements. In order to obtain the HMMs for each visual task, we need to train the parameters by using the eye movement database of the corresponding task. To do so, first we need to define the structure of our desired HMMs by creating a generic HMM. 
Since in the easy search the targets are distinguished from the distractors by a single feature, attention is mostly directed to task-relevant objects in the scene. This gives us a good basis to compare the results of HMMs and top-down models, because top-down models are ineffective in dealing with off-target attention deployment and assume the FOA to be mostly on targets. To be able to compare the models, in our HMM-based model for the easy search we also assume the FOAs to be mostly on targets and degenerate the conventional structure of HMMs to a single-state, self-returning one, which results in a model that allows for covert attention and is similar to top-down models, otherwise. 
In the generic structure of the single-state HMM (SSHMM) the state represents the target locations for different tasks. For the target locations we postulate that the observations are random outcomes of a mixture of 2-D Gaussians with features x and y in Cartesian coordinates that are maximum on the centroids of the targets and fade away as we become more distant, as measured by a Euclidean metric, from them (see Figure 5 Left). In Figure 5 Middle we put the observation pdfs of all the objects together and superimposed them on the original image and its corresponding bottom-up saliency map. It is from this grid of Gaussians that we select the ones related to the task and combine them into a Gaussian mixture model (GMM) with equal weights to represent the state's observation pdf. Figure 5 Right shows an example of the GMM distribution for the task of counting the characters. From the pool of pdfs in Figure 5 Middle, only the Gaussians centered on the characters are selected. 
Figure 5
 
The structure of the SSHMM. (Left) The generic SSHMM for task inference in an easy visual search. The transition matrix (a) is composed of a deterministic loop from the state to itself (a = 1) and the observation pdf comprises mixture of Gaussians centered on target-relevant objects in the image. (Middle) Observation pdfs give us the probability of seeing an observation given a hidden state. In this figure we put the fixation location pdfs of all the targets together and superimposed them on the original image and its corresponding bottom-up saliency map. (Right) In this figure we show the GMM that constitutes the SSHMM of the task of character counting. From the pool of pdfs in the middle figure, only the Gaussians centered on the characters are selected.
Figure 5
 
The structure of the SSHMM. (Left) The generic SSHMM for task inference in an easy visual search. The transition matrix (a) is composed of a deterministic loop from the state to itself (a = 1) and the observation pdf comprises mixture of Gaussians centered on target-relevant objects in the image. (Middle) Observation pdfs give us the probability of seeing an observation given a hidden state. In this figure we put the fixation location pdfs of all the targets together and superimposed them on the original image and its corresponding bottom-up saliency map. (Right) In this figure we show the GMM that constitutes the SSHMM of the task of character counting. From the pool of pdfs in the middle figure, only the Gaussians centered on the characters are selected.
As we can see, in this model only the hiddenness of the HMMs is emphasized and the Markovness of the sequences is marginalized to make a maximum likelihood estimator with a mixture distribution. 
In other words, the main difference between the classical top-down models and the proposed SSHMM is that in the top-down model we associate each fixation to the nearest neighbor target, whereas in the SSHMMs a fixation on an object might be a noisy observation of an attentional focus on another target. In this way, by comparing the results of the top-down model and the SSHMM we can examine the importance of using the observation distribution in HMMs and highlight the significance of considering covert attention in task inference. 
Having defined the general structure of the SSHMM, we can obtain task-dependent HMMs by training the generic HMM with task-specific eye trajectories by using the expectation maximization-based (EM-based) algorithm of Baum-Welch (Rabiner, 1990). In the training, we fix the means of the Gaussians to align with the center of the task-relevant objects and use a uniform distribution for the mixture of Gaussians' prior class probabilities to remove any spatial bias towards any target in stimuli. Moreover, since we have a deterministic state transition (A) and initial state distributions (Π), the only parameters to be trained are the covariances of the observation pdfs (C). 
After training the task-dependent SSHMM (λk) for each task, we can calculate the likelihood term (P[Q|λk]) by applying the parameters of λk to the forward algorithm. In this way, we will be able to make inferences about the tasks given an eye trajectory by plugging the likelihood term in Equation 5
Experimental procedure
In order to perform the evaluation, we compare the results of our model with those of the top-down models. To build a database of task-dependent eye trajectories, we ran 1,080 trials and recorded the eye movements of six subjects while performing a set of predefined simple visual tasks. Six McGill graduate students (three females and three males), aged between 18 and 30, with normal or corrected-to-normal vision volunteered to participate in this experiment and all were naive about the purpose of the experiment. Five different visual stimuli were generated by a computer and displayed on a 1280 × 800 pixel screen at a viewing distance of 45 centimeters (1° of visual angle corresponds to 30 pixels, approximately). Each stimulus was composed of 30 objects, each randomly selected from a set of nine objects (horizontal bar, vertical bar, and character “A” in red, green, and blue colors) that were placed at the nodes of an imaginary 5 × 6 grid (6.75° × 8.1°) superimposed on a black background (see the lower layer of Figure 5 Middle). The visual tasks were counting red bars, green bars, blue bars, horizontal bars, vertical bars, or characters; hence six tasks in total. Each of the tasks was defined so that their corresponding targets can be distinguished from distractors by a single feature. For instance, characters are the only objects responding to slanted orientation filters and red objects can be detected by red feature maps alone (Itti & Koch, 2001a). 
At the beginning of each trial, a textual message defining the targets to be sought (e.g., red, green, or characters) followed by a fixation mark of size 0.26° × 0.26° appeared at the center of the screen. After foveating the fixation mark, the participant initiated the trial with a key-press. Once the trials were triggered, one of the five stimuli was shown on the display and the eye movements of the subject were recorded while performing the specified visual task. 
An eye tracker (RK-726PCI, ISCAN Inc., Woburn, MA) was used to record the participant's left eye positions at 60 Hz and a chin rest was used to minimize head movements. According to the manufacturer's device description (ISCAN, version 1.1.01), the eye tracker's resolution is approximately 0.3° over ±20° horizontal and vertical range using the pupil/corneal reflection difference (the actual accuracy is likely to be poorer). An LCD monitor was used for displaying the images and the subjects used both eyes to conduct the experiments. 
Each subject did six segments of experiments, each of which consisted of performing the six tasks on five stimuli resulting in 180 trials for each subject (1,080 trials in total).2 
At the beginning of each session, we calibrated the eye tracker by having the participant look at a 20-point calibration grid (4 × 5) that extended to 8.0° × 10° of visual angle. The area covered by the calibration grid is stretched beyond the stimuli which spans 6.75° × 8.1° of visual angle. 
After recording eye movements, data analysis was carried out on each trial wherein we removed the blinks, outliers, and trials with wrong answers in the verification phase from the data and classified the eye movement data into saccades and fixations using the velocity-threshold identification (I-VT) method (Erkelens & Vogels, 1995) with a 50°/s threshold. It is generally agreed upon that visual and cognitive processing occur during fixations and little or no visual processing can be achieved during a saccade (Fuchs, 1971); therefore, in our analysis we only considered the fixation points. 
In training the task-dependent HMMs we set the initial values of the means equal to the centroids of the task-relevant targets on the bottom-up saliency map. Moreover, in order to be able to use a trained HMM independent of the stimuli they have been trained on, we use a technique called parameter tying (Rabiner, 1990) to force a unique, task and stimuli independent covariance matrix across all the Gaussian distributions in the mixtures (Appendix, “Baum-Welch algorithm” section). In other words, both in testing and training phases we build models by dynamically changing the means of GMMs according to the task-relevant targets in the stimulus. This is a crucial aspect of our model, since it is the definition of these GMMs that, unlike the saliency models, enables our model to represent off-stimulus locations, as well as on-stimulus ones. In the training phase we use nearest neighbor matching to find the closest state to each fixation point in the training database and use the sample covariance as the initial estimate of the covariance matrix in the generic HMM and in the test phase we use the trained covariance matrix. 
In top-down attention models, the viewer's task emphasizes the conspicuity of the relevant targets and provides us with a task modulated saliency map. Since top-down methods only model overt attention, we used a nearest neighbor approach to find the closest target to each fixation location to represent its attentional allocation. To train the top-down model, we set the target maps by manually selecting the to-be-counted objects in each stimulus (e.g., red objects in the task of counting the number of red objects). For optimization we used MATLAB optimization toolbox function fminsearch. The training was done in batch mode and a fixed sum of weights equal to one was used to avoid divergence. The resulting weight vectors were used to acquire the task-dependent saliency maps using a saliency toolbox (Walther & Koch, 2006) based on Itti and Koch's (2001a) model. By normalizing the resulting saliencies to one we could use them as probabilities P(q⃗i|k) and calculate the likelihood term of Equation 4. The viewer's task, then, was calculated by plugging the likelihood term into Equation 1. Since we used equal a-priori probabilities for all tasks, the inference was reduced to a maximum likelihood (ML) estimator; but with unequal prior probabilities we could use our prior knowledge about the tasks and turn the inference to a maximum a-posteriori (MAP) estimator. 
Besides incapability of dealing with covert attention, the top-down attention-tracking techniques are computationally complex. Finding task-dependent weight vectors requires performing optimization for each stimulus. However, in the HMM framework we are able to achieve a considerable reduction in training by dynamically redesigning the structure of task-dependent HMMs. Namely, once we estimate the covariance matrix in the training, we can use the computed values to dynamically build task-dependent models across different stimuli. 
Results
Figure 6 shows the accuracy of the SSHMMs and the top-down models in inferring the viewer's task in terms of the number of correctly classified instances of a task. Each bar summarizes the accuracy of its corresponding model by representing the mean (%) along with its standard error of the mean (SEM) in correctly inferring the visual task. For each bar we ran a 10-fold cross-validation (Bishop, 2006) on a dataset of 1,080 task-specific eye trajectories to train/test the model and compared the performance of the models by drawing their corresponding bars for each visual task. As can be seen the SSHMM significantly outperforms the top-down method in all six cases, and that is mostly due to relaxing the overtness of attention constraint imposed in the top-down models. 
Figure 6
 
Comparison of the accuracy of visual task inference using the SSHMM and the top-down models. Each bar demonstrates the recognition rate (%) of inferring simple visual tasks of counting red (R), green (G), blue (B), horizontal (H), and vertical (V) bars, as well as counting the characters (C). The mean value and the SEM are represented by bars and the numerical values are given in the lower table.
Figure 6
 
Comparison of the accuracy of visual task inference using the SSHMM and the top-down models. Each bar demonstrates the recognition rate (%) of inferring simple visual tasks of counting red (R), green (G), blue (B), horizontal (H), and vertical (V) bars, as well as counting the characters (C). The mean value and the SEM are represented by bars and the numerical values are given in the lower table.
Experiment 2: Task inference in a difficult visual search task
In Experiment 1 we successfully applied the idea of using HMMs to infer the ongoing task in an easy visual search where the targets differed from the surrounding distractors by a unique visual feature, such as color, orientation, size, or shape, and could be located in a stimulus within a short period of time. Nevertheless, in real life we usually encounter situations where the target is surrounded by distractors with similar features and can be distinguished from them only by comparing a combination of visual features. In this experiment we extend our method to a more complicated group of tasks and investigate the applicability of our HMM-based method in task inference in a difficult visual search. Moreover, Experiment 1 was done on synthetic stimuli with a limited set of simple tasks, whereas in this section we will use a more realistic application to evaluate our proposed model on a wider range of tasks. 
In our experiment we made the task difficult by making the targets distinguishable by a combination of features. The difficulty of the task increased the response time and caused several attentional deployments on nontarget objects (off-target FOAs) in order to examine their task relevant features and dismiss them from potential target locations. These off-target FOAs on objects that are not directly relevant to the ongoing task are not fully in line with the structure of the SSHMM model and will presumably cause an attenuation in the accuracy of task inference. In this section we tailor our model to allow for off-target FOAs in a difficult search paradigm. 
Methods
In this experiment we investigate task inference in a difficult search task by developing an eye-typing application, where users can type a character string by directing their gaze to an on-screen keyboard. In this scenario inferring the task is equivalent to determining what word has been eye-typed by observing the eye movements of the subject while performing the task; hence there are a wide range of potential tasks. In order to force visual search, we randomized the location of characters in the keyboard layout in each trial. After each trial we also ran a verification phase where a question is asked about the location of one of the characters in the word the subject has already typed to monitor the correctness of the process (see Figure 7). 
Figure 7
 
Eye-typing application setup. (Left) The schematic of the on-screen keyboard used in the experiments. We removed a letter (“Z”) in order to have a square layout to reduce directional bias. Also the location of each character is randomized in each trial so that the user has to search for the characters. (Middle) Eye movements of a subject is overlaid on the keyboard layout, on which the trial was executed. The subject searches among the characters to eye-type the word “TWO.” The dots indicate the fixations, and their diameters increase as the fixation duration rises. The connecting lines between the dots show the saccades that bring the new fixation location into the COG. (Right) After each trial a question is asked of the user about the location of a character that appeared in the word to validate the result.
Figure 7
 
Eye-typing application setup. (Left) The schematic of the on-screen keyboard used in the experiments. We removed a letter (“Z”) in order to have a square layout to reduce directional bias. Also the location of each character is randomized in each trial so that the user has to search for the characters. (Middle) Eye movements of a subject is overlaid on the keyboard layout, on which the trial was executed. The subject searches among the characters to eye-type the word “TWO.” The dots indicate the fixations, and their diameters increase as the fixation duration rises. The connecting lines between the dots show the saccades that bring the new fixation location into the COG. (Right) After each trial a question is asked of the user about the location of a character that appeared in the word to validate the result.
In the Results section we show that the SSHMMs are not as effective for inferring task in a difficult search task. The bulk of this shortfall is due to the off-target FOAs that take place as a result of the increase in task difficulty. In the new models proposed here we add extra states to the model that represent the off-target FOAs in trajectories. This adds to the training burden due to the introduction of extra parameters to the model for the new states' observation matrices and state transition probabilities, but the advantage is two-fold: first, we allow for the off-target FOAs both in training and testing; second, the transition matrix becomes stochastic as opposed to the deterministic, self-returning transitions we had in SSHMM. A stochastic transition matrix introduces another source of information to the model by capturing the pattern of transitions between the states. This information can be matched against the test data to see how well the model accords with it. In this way we can consider the dynamics of attention and use pattern matching to locate the targets and predict the ongoing task. 
Double-state word HMM
In the first attempt to model the human generative model of eye trajectories we add another state to the structure of the SSHMM. Figure 8 Left shows the amendment we made to the model in order to make it capable of dealing with the off-target FOAs. Here in each state we suggest that a mixture of Gaussian distributions demonstrates the possibility of observing different fixation locations. For the on-target state, we postulate that fixation locations are random outcomes of a 2-D GMM with equal weights, which is maximum on target character locations. Having defined the 2-D GMM of the on-target state we can simply define the pdf of the off-target state by complementing the distribution function of target state and normalizing it. The rationale behind this is that the probability of fixating a point in the close vicinity of a character is usually higher than other locations when we find it as the target. On the other hand, the farther from the character we fixate, the more probably we are attending a nontarget location. In Figure 8 Right we copied a replicate of all 2-D observation pdfs of characters of a keyboard and superimposed them on the keyboard itself. 
Figure 8
 
Structure of the DSWHMM. (Left) For each state (on-target and off-target) a GMM with equal weights defines the pdf of fixation locations. The transition probabilities (aij|i, jO, T) give us the probability of directing the attention from a target/nontarget object to a target/nontarget object in the image. For instance, if aTTaTO it means that the targets are easy to spot and chances of off-target fixations are low and if aTTaTO it means targets are hard to spot and finding them involves fixations on distractors. The initial state probabilities (Πi|iO, T) give us the probability of starting a search from each of the states. For instance, in the case that targets are hard to spot, the probability of starting the quest from the O state (off-target) is higher. (Right) In this figure we put the target states' fixation location pdfs of all the characters together and superimposed them on the original keyboard.
Figure 8
 
Structure of the DSWHMM. (Left) For each state (on-target and off-target) a GMM with equal weights defines the pdf of fixation locations. The transition probabilities (aij|i, jO, T) give us the probability of directing the attention from a target/nontarget object to a target/nontarget object in the image. For instance, if aTTaTO it means that the targets are easy to spot and chances of off-target fixations are low and if aTTaTO it means targets are hard to spot and finding them involves fixations on distractors. The initial state probabilities (Πi|iO, T) give us the probability of starting a search from each of the states. For instance, in the case that targets are hard to spot, the probability of starting the quest from the O state (off-target) is higher. (Right) In this figure we put the target states' fixation location pdfs of all the characters together and superimposed them on the original keyboard.
Beyond spatially allowing for off-target FOAs, the new design encapsulates another aspect of human attention that scrutinizes attention dynamics from a temporal point of view. The relation between working memory and visual attention is established in neurophysiological and psychophysiological studies of human attention (De Fockert, Rees, Frith, & Lavie, 2001; Lavie, 1995). It has been known that attention interacts with working memory and the intensity of this interaction depends on the memory requirement of the ongoing task. In the eye-typing application, after finding each character, the cognitive process that is responsible for driving visual attention retrieves the next character in the word string from short-term memory to direct attention towards relevant features in the image. Different variations of this interaction between the attentional system and short-term memory can be seen in most of the real world visual search tasks as well. In another example, when a viewer counts the number of an object in a scene, the interaction is invoked after finding each target to increase the count by one in the working memory. Moreover, as shown in Figure 7, after each trial a question is asked of the user to verify if the characters are attended correctly or not. Therefore, when the target character is found, the short-term memory keeps track of the coordinates of the characters in order to correctly respond to the verification question. 
Due to the different neural circuitry of visual attention and working memory in the brain (Goldman-Rakic, 1995), more interaction between these two functionalities will cause a longer stay on memory demanding target objects. In a recent study Mills, Van der Stigchel, Hoffman, and Dodd (2011) investigated the influence of task on temporal characteristics of eye movements during scene perception. They showed that in visual search, task set biases spatial (e.g., saccade amplitude), rather than temporal (e.g., latency), parameters of eye movement. This effect is also demonstrated in a study by Castelhano, Mack, and Henderson (2009), where they examined the influence of task on fixation duration and did not find a significant difference in average fixation duration when memory is involved more heavily in performing the task. However, they showed that for objects that require more thorough processing to be encoded into memory, a strategy is adopted to increase the number, rather than duration, of fixations on them. This conclusion is also supported by an earlier study by Loftus (1972) who showed that an increase in the interaction between attention and memory does not affect the duration of fixation, but rather increases the number of fixations made in the region. 
The design of the double-state word HMM (DSWHMM) allows us to exploit this phenomenon to spot a pattern in memory interaction and locate the target objects. In a difficult search task, where the level of memory involvement changes significantly during the search, when a target is found, the level of interaction with the memory rises leading to multiple fixations on the same object. While training a task-dependent model, thus, this pattern is reflected in the transition matrix, which will show a bias in the transitions from the on-target state to itself. This information, along with spatial information embedded in the observation pdf, can be used for locating the task relevant objects in a scene and infer the task based on them. 
Double-state character HMM
In DSWHMMs we showed how by introducing a second state to the SSHMM we can capture the temporal dynamics of visual attention and at the same time allow for off-target FOAs. Although in the Results we show that DSWHMM is a practical method for task inference in a small-size dictionary, it is not clear what portion of the obtained accuracy is due to the a-priori constraints set by the dictionary (zero probability for nonexisting words and equal probabilities for the existing ones) and what portion is due to the likelihood provided by the HMM structure. 
The dictionary size plays an important role in the classification accuracy of the DSWHMMs. On the one hand, the word model proposed in DSWHMM is insensitive to the order of characters and cannot tell anagrams (words with same characters but different order of characters) apart. On the other hand, the model performs best when the dictionary contains short words. For longer words, since the target GMMs includes more Gaussians, the area covered by the GMMs expands to a larger area on the keyboard, which leads to classifying most of the fixations as on-targets. Thus, by increasing the size of dictionary, the chances of having long words or anagrams increase which will cause the performance to drop. 
The main reason behind this dependency on dictionary size is the fact that in the word models all characters are treated the same and are all included in the model by a unique GMM. This fact makes the model insensitive to the order of comprising characters. Moreover, in long words, since the GMM of the on-target state span a larger area of the keyboard surface, off-target fixations become harder to spot. 
The solution we suggest is to assign the double-state HMMs to the characters rather than the whole word. Namely, we suggest to train 25 character models (The letter “Z” is omitted from the keyboard layout) and concatenate them according to a word's dictation to build each word model in the dictionary. In this way not only do we make the model robust to the length of the words by treating each character independent of the proceeding or the following characters, but we also respect the order of the comprising characters while building up the word models. Thus, we expect the model to be more robust to dictionary size. 
In Figure 9 Left we show the general structure of a double-state character HMM (DSCHMM) for the character “C.” While the structure is the same as DSWHMMs, the observation distributions are centered around the characters rather than the whole word. Similar to DSWHMM, the transition matrix governs the transitions between the off-target and the on-target states and the initial state distributions define the odds of starting off from each state. Figure 9 Right shows how to concatenate the character models to build up a word model. Since transition from one character to another is comparable to starting to search for a new character, we assign the corresponding initial state probability to the transition to each state. 
Figure 9
 
Structure of the DSCHMM. (Left) shows the general structure of DSCHMMs for character “C.” The parameters aij and Πi are defined exactly the same as in the DSWHMM. The only difference between this model and DSWHMM is that in this model we train a model for searching each character separately. Thus, in the target state we will only have one Gaussian observation pdf around the location of the character in the image. Therefore, in order to build a word model we have to concatenate the constituting characters of the word. (Right) Shows how to concatenate the character models to build up a word model (here for a hypothetical word “CA”). In this model (xC, yC) and (xA, yA) are the coordinates of characters “C” and “A,” respectively. For the transitions between the sub HMMs (for each character), we use the initial state probabilities Πi, since looking for the next character after finding the proceeding one is similar to start a new search for the new character.
Figure 9
 
Structure of the DSCHMM. (Left) shows the general structure of DSCHMMs for character “C.” The parameters aij and Πi are defined exactly the same as in the DSWHMM. The only difference between this model and DSWHMM is that in this model we train a model for searching each character separately. Thus, in the target state we will only have one Gaussian observation pdf around the location of the character in the image. Therefore, in order to build a word model we have to concatenate the constituting characters of the word. (Right) Shows how to concatenate the character models to build up a word model (here for a hypothetical word “CA”). In this model (xC, yC) and (xA, yA) are the coordinates of characters “C” and “A,” respectively. For the transitions between the sub HMMs (for each character), we use the initial state probabilities Πi, since looking for the next character after finding the proceeding one is similar to start a new search for the new character.
Tri-state HMM
Modeling subword units (i.e., characters), rather than words, allows us to investigate the fixations in more detail. In the DSCHMM structure we classified the attention deployment into the ones on target or the ones on nontarget characters. However, we believe that even attended nontarget characters carry information about the sought character. Figure 10 Left shows the top 9 bins of the histogram of fixations on characters (fixation distribution histogram) when looking for character “W.” It can be seen that even off-target fixations show a pattern in the sense that seemingly similar characters tend to draw attention towards themselves more often than dissimilar ones. 
Figure 10
 
Spatial distribution of fixations (fixation distribution histogram) while searching for a character. (Left) shows the top nine bins of the fixation distribution histogram when looking for character “W.” Similar characters tend to draw attention towards themselves, which is inline with the psychological experiments. (Middle) Shows the result of the experiment in perceptual measurement of image similarity (based on Gilmore et al., 1979). The figure is reproduced with permission from Springer Publishing Company. (Right) Shows the average of top nine fixation location histogram bins, along with their respective SEM, when looking for different characters in the keyboard.
Figure 10
 
Spatial distribution of fixations (fixation distribution histogram) while searching for a character. (Left) shows the top nine bins of the fixation distribution histogram when looking for character “W.” Similar characters tend to draw attention towards themselves, which is inline with the psychological experiments. (Middle) Shows the result of the experiment in perceptual measurement of image similarity (based on Gilmore et al., 1979). The figure is reproduced with permission from Springer Publishing Company. (Right) Shows the average of top nine fixation location histogram bins, along with their respective SEM, when looking for different characters in the keyboard.
This phenomenon is studied in the psychological literature related to perceptual measurement of image similarity (Keren & Baggen, 1981). Particularly, our finding is inline with a psychophysical experiment in (Gilmore, Hersh, Caramazza, & Griffin, 1979, figure 1) that classifies uppercase English letters according to their similarity in appearance. In Figure 10 Middle the result of this classification is shown in form of a hierarchical cluster that classifies the characters into clusters. The lower the connecting line between the clusters, the higher the similarity between them. 
In our database of eye movements we saw a similar pattern when analyzing other characters as well. Figure 10 Right shows the average of top 9 fixation location histogram bins when looking for different characters. This trend suggests that off-target fixations can also be used as another source of information. Namely, when looking for a target, similar characters are more likely to be found among the off-target fixations, which can help us narrow down our choices in the inference process. Figure 11 Top shows the new structure we propose to be used for task inference in character recognition application. In this new setup, we split the off-target state to dissimilar state (D-state) and similar state (S-state) according to the similarity of the attended object to the target character. We believe the new tri-state HMM (TSHMM) for character recognition allows us to investigate the fixations in more detail and gives us more information about the sought character. In this model not only the dynamics of on-target FOAs are taken into account, but also the off-target FOAs play an important role in revealing the target character. Figure 11 Bottom shows how we build a word model by concatenating the HMMs of the comprising characters. As can be seen, the transitions between the characters are made from the target state and the structure is similar to the DSCHMM, otherwise. We heuristically select the top two characters of the fixation distribution histogram of each target to model the GMM of its S-state. The distribution function of the D-state is obtained by complementing the mixture of target Gaussian pdf and GMM of the S-state (all with the same weight). Similar to the DSCHMM, when going from one character to another, the transition probabilities are assumed to be proportional to the initial state probabilities. 
Figure 11
 
The structure of the TSHMM for character recognition. (Top) shows the TSHMM for a single character. The mean vector of the S-state's GMM is centered on the top two characters in the fixation distribution histogram of the fixations made during the search for the target character in the training data. Similar to the other models, the transition probabilities aij governs the transitions between the states and the initial state distributions Πi gives us the odds of starting a search from each state. Similar to the DSCHMM, the HMMs are trained for each character separately and concatenated in order to make a word model. (Bottom) shows how to concatenate the character models to build up the word model. Similar to the DSCHMM, the transitions between the states are governed by the initial state probabilities.
Figure 11
 
The structure of the TSHMM for character recognition. (Top) shows the TSHMM for a single character. The mean vector of the S-state's GMM is centered on the top two characters in the fixation distribution histogram of the fixations made during the search for the target character in the training data. Similar to the other models, the transition probabilities aij governs the transitions between the states and the initial state distributions Πi gives us the odds of starting a search from each state. Similar to the DSCHMM, the HMMs are trained for each character separately and concatenated in order to make a word model. (Bottom) shows how to concatenate the character models to build up the word model. Similar to the DSCHMM, the transitions between the states are governed by the initial state probabilities.
Experimental procedure
To build a database of task-dependent eye trajectories, we ran a set of trials and recorded the eye movements of six McGill graduate students (three females and three males), aged between 18 and 30, while eye-typing 26 different 3-character words. The subjects had normal or corrected-to-normal vision and all were naive about the purpose of the experiment. The trials started with a fixation mark of size 0.26° × 0.26° appearing at the center of the screen. After foveating the fixation mark, the participant initiated the trial with a key-press. Once a trial was triggered, a textual message at the center of the screen showed the word to be eye-typed. Once the subject indicated his readiness by pressing a key, another fixation mark appeared at the center followed by an on-screen keyboard similar to the one shown in Figure 7 Left. At this phase subjects eye-typed the word by searching for the characters appearing in it as quickly as possible and signaled when they were done by pressing a key (subjects were only told to eye-type the words as quickly as possible and press a key when done). 
Each trial was followed by a verification wherein a question, in form of a forced choice paradigm, about the location of a randomly selected character in the word was asked. The selected character appeared as the label of two keys in the keyboard, of which only one corresponded to the original location of the character in the keyboard layout (see Figure 7 Right). The viewer selected one of the keys as the correct location of the character by fixating it and pressing a button. In the data processing phase, we took the result of the question as an indication of whether the subjects had performed the task attentively or not. Once the question was answered the next word was shown and the trial carried on. 
Each keyboard was composed of 25 uppercase English characters randomly located on a 5 × 5 grid superimposed on a gray background (we removed the letter “Z” in order to have a square layout to reduce directional bias). The 3-letter words were selected so that there was no repetition of characters in them. At the beginning of every experimental session, we calibrated the eye tracker by having the participant look at a 16-point calibration display (4 × 4) that extended to 10° × 10° of visual angle. The area covered by the calibration grid was stretched beyond the stimuli, which spans 6.7° × 6.7° of visual angle. 
An eye tracker (ISCAN RK-726PCI) was used to record the participant's left eye positions at 60 Hz and a chinrest was used to minimize head movements. According to the manufacturer's device description (ISCAN, version 1.1.01), the eye tracker's resolution is approximately 0.3° over ±20° horizontal and vertical range using the pupil/corneal reflection difference (the actual accuracy is likely to be poorer). An LCD monitor was used for displaying the images and the subjects used both eyes to conduct the experiments. 
After recording eye movements, data analysis was carried out on each trial wherein we removed the blinks, outliers, and trials with wrong answers in the verification phase from the data and classified the eye movement data into saccades and fixations using the velocity-threshold identification (I-VT) method (Erkelens & Vogels, 1995) with a 50°/s threshold. For the same reason as in Experiment 1, in our analysis we only considered the fixation points and removed the eye positions in between the fixations. Moreover, in some of the initial trials, after eye-typing the word, the viewer returned to the locations of the characters to double-check the coordinates of them. In order to simulate a real eye-typing application we removed these parts from the trajectories in the preprocessing, too. 
After the preprocessing, we obtained a database of 145 trajectories {Q1, Q2, … , Q145}, each of the form (q⃗1, q⃗2, …, q⃗T), where each q⃗i contains coordinates and duration of the fixation at time i, in form of a tuple (xi, yi, di). Each tuple gives us the information regarding the x-coordinate, y-coordinate, and duration of the ith fixation, respectively. 
In order to train the models in the DSWHMM, DSCHMM, and TSHMM, we have to adjust the mean vector of the 2-D Gaussians according to the training word so that it aligns with the center of target character locations. According to Rabiner (1990) a uniform (or random) initial estimation of Π and A is adequate for giving useful re-estimation of these parameters (subject to the stochastic and the nonzero value constraints). Thus, in the generic HMM, we set random initial values for the transition and the initial state probabilities and run the Baum-Welch algorithm on the training set to obtain the final model. Again we used the parameter tying technique (Rabiner, 1990) to force a unique, task and stimuli independent covariance matrix across all the Gaussian distributions in the mixtures. Thus, we can build the word model for the test data by dynamically changing the means of the states according to the character locations of the words and using the estimated covariances of characters. 
The resulting HMM is used to estimate the likelihood term of Equation 5 for all the words in the dictionary by using the forward method on the trained model. By estimating the likelihood term we can calculate the posterior probability of each word by plugging the likelihood term into Equation 5. In this case the a-priori term will be our prior knowledge about the words and Q is the observation sequence containing the fixation locations. 
Results
In the first attempt in inferring the task in the difficult search, we compare the results of using the top-down, SSHMM, and DSWHMM models for task inference and show the results together in Figure 12. We can see that although the SSHMM performs slightly better than the top-down technique, the accuracy decreases significantly compared to the results of task inference in the easy search. The rightmost bar shows the result of task inference when using the DSWHMM. Each bar summarizes the accuracy of its corresponding model by representing the mean (%) along with its SEM in correctly inferring the visual task. For each bar we ran a 10-fold cross validation on our database of 145 trajectories in order to define the training and test sets and used the same epochs across all the methods. We also used equal probabilities as the word priors, which converts Equation 5 to a ML estimator. 
Figure 12
 
Comparison of task classification accuracy using different models in a difficult visual search. Each bar demonstrates the mean classification rate (%) of correctly recognizing the intended word in the eye-typing application. The mean value and the SEM are represented by bars and the numerical values are given in the lower table.
Figure 12
 
Comparison of task classification accuracy using different models in a difficult visual search. Each bar demonstrates the mean classification rate (%) of correctly recognizing the intended word in the eye-typing application. The mean value and the SEM are represented by bars and the numerical values are given in the lower table.
As can be seen, the DSWHMM significantly outperforms the SSHMM and the top-down methods. The estimation of parameters after training is shown in Table 1. A very interesting phenomenon seen in the training results is the standard deviation (SD) of the Gaussian distributions around the characters (σ2D), which expands to an area of about 3.6° of the visual angle. This angle appears consistent with previous estimates of the size of the operational fovea as the central 3° of vision (Johansson, Westling, Bäckström, & Flanagan, 2001). Besides, in (Carpenter, 1991) it is shown that targets within 4° of central vision are still perceived at 50% of maximal acuity. Although, based on the current evidence we cannot tell whether this finding is a real effect or merely a coincidence, another experiment where the distance between the observer and the screen is altered can help us determine that. 
Table 1
 
Parameters of the DSWHMMs after training.
Table 1
 
Parameters of the DSWHMMs after training.
Parameter Value
aOO 71%
aOT 29%
aTT 67%
aTO 33%
πO 95%
πT  5%
σ2D 3.6°
In this experiment we also study the effect of dictionary size on the accuracy of task inference in the difficult task. We test the accuracy of task inference using the DSWHMM, DSCHMM, and TSHMM for different dictionary sizes. We created four sets of dictionaries of 26, 52, 104, and 312 English words using the Carnegie Mellon pronouncing dictionary (CMPD) (Weide, 2005). All dictionaries were built so that they all have the original 26 words and include all the words of the smaller dictionaries. The words were selected randomly from the CMPD and the words length varied between three to five characters. In all HMMs, we bound the covariances to a unique diagonal covariance matrix and obtained the variance, transition matrix, initial state distribution and the mean vectors from the training set. We used 10-fold cross validation to set the training set and the test set on the database of eye movements. For the DSCHMM and TSHMM we built the word models according to the templates shown in Figures 9 Right and 11 Bottom, respectively. 
In the TSHMM the characters selected in the S-state represented the top two bins in the fixation distribution histogram of the target character obtained in the training phase. Similar to Figure 10, the fixation distribution histogram was created by counting the number of fixations on each character (using nearest neighbor) when seeking a target. To do so we manually labeled all 145 eye trajectories and split them into three parts, each of which represents the eye movements of the subject while looking for a character. The distribution function of the D-state is obtained by complementing the mixture of target Gaussian pdf and GMM of the S-state (all with the same weight). 
Figure 13 shows the accuracy of word inference using the DSWHMM, DSCHMM, and TSHMM methods ranging over four dictionary sizes. Although the accuracy of the DSCHMM is slightly less than that of DSWHMM (74.2% vs. 76.4%) for the 26-word case, it shows less decline as the dictionary size increases. As expected, the TSHMM starts off even better than DSWHMM and stays in the same range over different dictionary sizes. The table below the figure shows the accuracy and the SEM of the corresponding bar. 
Figure 13
 
Comparison of task classification accuracy using the TSHMM, DSCHMM, and DSWHMM methods in a difficult visual search. Each bar demonstrates the mean classification rate (%) of correctly recognizing the intended word in the eye-typing application. The mean value and the SEM are represented by bars and the numerical values are given in the following table.
Figure 13
 
Comparison of task classification accuracy using the TSHMM, DSCHMM, and DSWHMM methods in a difficult visual search. Each bar demonstrates the mean classification rate (%) of correctly recognizing the intended word in the eye-typing application. The mean value and the SEM are represented by bars and the numerical values are given in the following table.
Discussions
We applied our model based on the theory of Hidden Markov Models (HMMs) to two tasks of easy and difficult search. For the easy search of Experiment 1, the application of a single-state HMM (SSHMM) with mixture of Gaussian observation distribution functions gave us good results in task inference. However, Experiment 2 showed that the SSHMMs are not as effective for task inference in more complex tasks due to the frequent off-target deployment of attention. 
Based on the literature related to the effect of task on eye movements we know that complex tasks impose patterns on transitions rather than changing the aggregate measures of eye movements. For instance, Mills et al. (2011) showed that while the average duration of fixations or mean fixation amplitudes remain unchanged for different tasks, the chances of refixating a target increases for a target that is informative to the ongoing task. The same effect is shown in Castelhano et al. (2009), where it is shown that the visual task does not influence the features obtained from individual fixations, but rather it is the pattern of fixations that changes across the tasks. In the new model, along with the spatial information of 2-D Gaussian mixture models (GMMs), we used transition pattern information to elicit information about the task. By introducing a second state, the double-state word HMMs (DSWHMMs) were able to capture the transition dynamics of eye movement data and use self-returning transitions as a sign of interaction intensity with short-term memory, which in turn were used as an indication of whether the focus of attention (FOA) is on target or nontarget object. 
In another attempt we tried to reduce the a-priori constraint set by using a small size dictionary in the DSWHMM. To do so, we proposed to model the attention cognitive process that drives eye movements for seeking characters as the targets rather than the word in whole. In this way not only we were able to respect the order of characters in a word, but also allowed for longer words in the dictionary. The results showed that modeling character using a double-state character HMM (DSCHMM) increases the consistency of word inference across different dictionary sizes, whereas DSWHMM showed to be sensitive to the dictionary size. 
In the last model we proposed that even off-target FOAs show a specific pattern given the target. Namely, we found out that in a visual search for a character, attention tends to land on characters similar to the target to narrow down the potential locations of the target. Thus, we proposed to split the off-target state to two separate states representing the FOAs on similar and dissimilar characters to the target. The results showed that not only is the tri-state HMM (TSHMM) robust to the size of the dictionary, but also that the additional information elicited from the off-target fixations helps us better infer the task. 
In general, we used the SSHMMs to model the covert attention in the easy visual search where the fixations are mainly on the targets. In the double-state HMMs (DSHMMs) we showed how we can capture the temporal dynamics of human attention and at the same time allow for off-target deployment of attention while maintaining the support for covert attention, inherited from the SSHMMs, by introducing a second state to the HMM structure. Two variations of the DSHMM were used to infer the task in whole (DSWHMM) or in part (DSCHMM). In the TSHMMs we took advantage of the information hidden in the off-target FOAs by introducing a third state to the model. Overall, the results supported the idea of attention modeling using the HMMs and suggested a solid probabilistic framework for task inference. 
In our view there are several reasons behind our improved results as compared to those in Greene et al. (2011). In Greene's experiment only aggregate features of eye movements, such as the number of fixations, duration of fixations, etc., were used to classify the trajectories. These features, however, have been shown previously (e.g., Castelhano, Henderson et al., 2008) to not be reliable in task inference and therefore cannot be used to regularize the ill-posedness of the problem. 
Another reason behind the failure of the aggregate-based method in inferring the task is that no information about the image is used in classifications. This is while there is a well-studied relation between these two and the image context is proven to have a major effect on eye movements (Torralba, Oliva, Castelhano, & Henderson, 2006
The purpose of this paper was to infer the visual task based on the recordings of the fixations made on an image and not to infer the FOA. This can specifically be noticed in the first experiment, where all the fixations are assumed to be on target, while in real word situation off-target fixations, even for simple tasks, are inevitable. However, in Experiment 2 the hidden states of the model intuitively become closer to our expectation of the FOA. For instance, in Experiment 2 the states are dedicated to the on-target and off-target fixations, which can be safely assumed to happen in the FOA trajectories during a task execution. In Experiment 2 we further break down the off-target fixations to the ones on similar and dissimilar targets, which is shown to be in line with the experimental results of (Gilmore et al., 1979) that is shown in Figure 10. Thus, the maximum a-posteriori (MAP) estimation of the hidden state sequence given an eye trajectory (that can be obtained using the Viterbi method) can be used as an estimate of the possible targets of the visual attention for a given task. 
That said, if we were to accurately compare the hidden states of our models with the focus of attention, we would have to use methods such as attentional-probe, detecting microsaccades (Hafed & Clark, 2002) or functional magnetic resonance imaging (fMRI) recording (Wojciulik, Kanwisher, & Driver, 1998), in order to locate the FOA and measure the correlation between their estimation and the centroids of the HMM states. 
In all the experiments we used equal probabilities as the task priors for the existing words in the dictionary, which results in a maximum likelihood (ML) estimator. However, in the proposed framework we can fuse other sources of information about the tasks as priors and increase the accuracy of the results. 
While the results presented in this report are very promising, further investigations could extend the idea to natural scenes and more realistic situations like those of Yarbus. However, the models proposed in both the easy and the difficult visual search were general and independent of the applications. Thus, we believe given a database of eye movements in natural scenes, we can apply the models for task inference. 
Although fixation location and their transition patterns are good parameters to distinguish between on-target and off-target fixations, application of other parameters (modalities) of eye movements, such as blink rate, diameter of pupil (Privitera, Renninger, Carney, Klein, & Aguilar, 2010) and saccade amplitude, should be evaluated as well. 
Conclusion
In this article we presented a probabilistic framework for task inference in visual search. This is helpful in applications where knowing the target objects in a scene can help us improve the user experience in interaction with an interface. For instance, knowing what the user is seeking in a webpage combined with a dynamic design can lead to a smart webpage that highlights the relevant information in a page according to the ongoing visual task. The same idea applies to an intelligent signage that changes its contents to show relevant advertisements according to the foci of attention inferred from each viewer's eye movements. 
We showed that the theory of HMMs provides a reliable model for the attention cognitive process of the human brain. In particular the observation generation of HMMs is fully compatible with the phenomenon of covert attention. 
In Experiment 1, we developed a HMM-based method to infer the visual task in an easy search task. In order to show the advantage of using the observation densities, we degenerated the structure of the HMM to a single state one (SSHMM) that, similar to the top-down models, assumes the fixations to be mostly on targets. The results emphasize the importance of observation densities in attention modeling that allow for covert attention and other sources of the discrepancy between the COG and the FOA. 
In Experiment 2 we extended the suggested model to a more sophisticated one that was able to infer the visual task in difficult search tasks as well. In the DSWHMM we introduced another state to the SSHMM to allow for off-target fixations along with the on-target ones. By introducing the new state we were able to improve the results by capturing the dynamics of attention that applies to the resulting eye movements during transitions between on-target and off-target fixations. 
The same structure was tested to infer the subtasks rather than the whole task, where we split a task to its fundamental elements and infer the subtasks using the DSCHMM. This gave us an opportunity to expand the size of the test dictionary of valid words and recognize the new entries based on their comprising characters. Furthermore, this model led to a third model that not only extracts information from on-target fixations, but also considers off-target fixations made during the execution of each sub-task as a source of information and narrows down the target character based on the proceeding off-target fixations. Therefore, we added another state to the DSCHMM and formed a TSHMM to distinguish between the off-target fixations that were made on similar-to-target or dissimilar-to-target objects in a scene and further improve the results of the task inference based on this information. 
Acknowledgments
The authors would like to thank Fonds Qubcois de la Recherche sur la Nature et les Technologies (FQRNT) and Natural Sciences and Engineering Research Council of Canada (NSERC) for their support of this work, the editor, and the anonymous reviewers for the valuable comments on our submitted manuscript, and Prasun Lala for his helpful comments and discussions. 
Commercial relationships: none. 
Corresponding author: Amin Haji-Abolhassani. 
Email: amin@cim.mcgill.ca. 
Address: Center for Intelligent Machines, Department of Electrical and Computer Engineering, McGill University, Montreal, Quebec, Canada. 
References
Becker W. (1972). The control of eye movements in the saccadic system. Cerebral control of eye movements and motion perception, 82, 233–243.
Bishop C. (2006). Pattern recognition and machine learning (Vol. 4). New York: Springer.
Breitmeyer B. G. Kropfl W. Julesz B. (1982). The existence and role of retinotopic and spatiotopic forms of visual persistence. Acta Psychologica, 52 (3), 175–196. [CrossRef] [PubMed]
Bulling A. Ward J. Gellersen H. Tröster G. (2009). Eye movement analysis for activity recognition. Proceedings of the 11th International Conference on Ubiquitous Computing, (pp. 41–50 ). New York: ACM.
Buswell G. (1935). How people look at pictures: A study of the psychology of perception in art. Chicago: University of Chicago Press.
Carpenter R. (1991). The visual origins of ocular motility. Vision & Visual Function, 8, 1–10.
Carrasco M. (2011). Visual attention: The past 25 years. Vision Research, 51 (13), 1484–1525. [CrossRef] [PubMed]
Castelhano M. Mack M. Henderson J. (2009). Viewing task influences eye movement control during active scene perception. Journal of Vision, 9 (3): 6, 1–15, http://www.journalofvision.org/content/9/3/6, doi:10.1167/9.3.6. [PubMed] [Article] [CrossRef] [PubMed]
Castelhano M. S. Henderson J. M. (2008). Stable individual differences across images in human saccadic eye movements. Canadian Journal of Experimental Psychology, 62 (1), 1–14. [CrossRef] [PubMed]
Clark J. O'Regan J. (1998). Word ambiguity and the optimal viewing position in reading. Vision Research, 39 (4), 843–857.
De Fockert J. Rees G. Frith C. Lavie N. (2001). The role of working memory in visual selective attention. Science, 291 (5509), 1803. [CrossRef] [PubMed]
Einhäuser W. Rutishauser U. Koch C. (2008). Task-demands can immediately reverse the effects of sensory-driven saliency in complex visual stimuli. Journal of Vision, 8 (2): 2, 1–19, http://www.journalofvision.org/content/8/2/2, doi:10.1167/8.2.2. [PubMed] [Article] [CrossRef] [PubMed]
Ellis S. R. Stark L. (1986). Statistical dependency in visual scanning. Human Factors: The Journal of the Human Factors & Ergonomics Society, 28 (4), 421–438.
Erkelens C. Vogels I. (1995). The initial direction and landing position of saccades. Eye Movement Research: Mechanisms, Processes, & Applications, 6, 133–144.
Findlay J. M. Gilchrist I. D. (2001). Visual attention: The active vision perspective. In Jenkin M. Harris L. (Eds.), Visual attention: The active vision perspective (pp. 83–103). New York: Springer.
Fischer B. Weber H. (1993). Express saccades and visual attention. Behavioral & Brain Sciences, 16, 553. [CrossRef]
Fuchs A. (1971). The saccadic system. In Collins C. C. Hyde J. E. (Eds.), The control of eye movements (pp. 343-362). New York: Academic Press.
Gilmore G. Hersh H. Caramazza A. Griffin J. (1979). Multidimensional letter similarity derived from recognition errors. Attention, Perception, & Psychophysics, 25 (5), 425–431. [CrossRef]
Goldman-Rakic P. (1995). Cellular basis of working memory. Neuron, 14 (3), 477. [CrossRef] [PubMed]
Greene M. Liu T. Wolfe J. (2011). Reconsidering Yarbus: Pattern classification cannot predict observer's task from scan paths. Journal of Vision, 11 (11), 498, http://www.journalofvision.org/content/11/11/498, doi:10.1167/11.11.498. [Abstract] [CrossRef]
Hafed Z. Clark J. J. (2002). Microsaccades as an overt measure of covert attention shifts. Vision Research, 42 (22), 2533–2545. [CrossRef] [PubMed]
Hayashi M. Oman C. M. Zuschlag M. (2003). Hidden Markov models as a tool to measure pilot attention switching during simulated ILS approaches. In Proceedings of the 12th International Symposium on Aviation Psychology, (pp. 502–507). Presented April 14-17, 2003, Dayton, OH.
He P. Kowler E. (1989). The role of location probability in the programming of saccades: Implications for center-of-gravity tendencies. Vision Research, 29 (9), 1165–1181. [CrossRef] [PubMed]
Helmholtz H. (1896). Handbuch der physiologischen optik, Dritter Abschnitt. Zweite Auflage ed.
Hu J. Brown M. Turin W. (1996). HMM based on-line handwriting recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 18 (10), 1039–1045.
Huang X. Ariki Y. Jack M. (1990). Hidden Markov models for speech recognition. Edinburgh: Edinburgh University Press.
ISCAN , I. (version 1.1.01). Operation instruction: Rk-726pci pupil/corneal reflection tracking system. Retrieved from ftp://ftp.cmrr.umn.edu/iscan/iscan_Track.pdf
Itti L. Koch C. (2001a). Computational modelling of visual attention. Nature Reviews Neuroscience, 2 (3), 194–204. [CrossRef]
Itti L. Koch C. (2001b). Feature combination strategies for saliency-based visual attention systems. Journal of Electronic Imaging, 10, 161–169. [CrossRef]
Itti L. Koch C. Niebur E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence, 20 (11), 1254–1259. [CrossRef]
Johansson R. Westling G. Bäckström A. Flanagan J. (2001). Eye–hand coordination in object manipulation. Journal of Neuroscience, 21 (17), 6917–6932. [PubMed]
Keren G. Baggen S. (1981). Recognition models of alphanumeric characters. Attention, Perception, & Psychophysics, 29 (3), 234–246. [CrossRef]
Koch C. Ullman S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiology, 4 (4), 219. [PubMed]
Kowler E. (2011). Eye movements: The past 25 years. Vision Research, 51 (13), 1457–1483. [CrossRef] [PubMed]
Lavie N. (1995). Perceptual load as a necessary condition for selective attention. Journal of Experimental Psychology: Human Perception & Performance, 21 (3), 451. [CrossRef]
Loftus G. (1972). Eye fixations and recognition memory for pictures. Cognitive Psychology, 3 (4), 525–551. [CrossRef]
MacKay D. (2003). Information theory, inference, and learning algorithms. Cambridge, UK: Cambridge University Press.
Matin E. (1974). Saccadic suppression: A review and an analysis. Psychological Bulletin, 81 (12), 899–917. [CrossRef] [PubMed]
Mills M. Hollingworth A. Van der Stigchel S. Hoffman L. S. Dodd M. D. (2011). Examining the influence of task set on eye movements and fixations. Journal of Vision, 11 (8): 17, 1–15, http://www.journalofvision.org/content/11/8/17, doi:10.1167/11.8.17. [PubMed] [Article] [CrossRef] [PubMed]
Nair V. Clark J. (2002). Automated visual surveillance using hidden Markov models. In the 15th International Conference on Vision Interface (pp. 88-93). Calgary, Canada: Canadian Image Processing and Pattern Recognition Society.
Najemnik J. Geisler W. S. (2005). Optimal eye movement strategies in visual search. Nature, 434 (7031), 387–391. [CrossRef] [PubMed]
Pieters R. Rosbergen E. Wedel M. (1999). Visual attention to repeated print advertising: A test of scanpath theory. Journal of Marketing Research, 424–438.
Privitera C. Renninger L. Carney T. Klein S. Aguilar M. (2010). Pupil dilation during visual target detection. Journal of Vision, 10 (10): 3, 1–14, http://www.journalofvision.org/content/10/10/3, doi:10.1167/10.10.3. [PubMed] [Article] [CrossRef] [PubMed]
Rabiner L. (1990). A tutorial on hidden Markov models and selected applications in speech recognition. Readings in Speech Recognition, 53 (3), 267–296.
Rimey R. D. Brown C. M. (1991). Controlling eye movements with hidden Markov models. International Journal of Computer Vision, 7 (1), 47–65. [CrossRef]
Rizzolatti G. Riggio L. Sheliga B. (1994). Space and selective attention. Attention and Performance XV: Conscious and Nonconscious information Processing, 15, 231–265.
Rutishauser U. Koch C. (2007). Probabilistic modeling of eye movement data during conjunction search via feature-based attention. Journal of Vision, 7 (6): 5, 1–20, http://www.journalofvision.org/content/7/6/5, doi:10.1167/7.6.5. [PubMed] [Article] [CrossRef] [PubMed]
Salvucci D. Anderson J. (2001). Automated eye-movement protocol analysis. Human-Computer Interaction, 16 (1), 39–86. New York: ACM. [CrossRef]
Salvucci D. D. Goldberg J. H. (2000). Identifying fixations and saccades in eye-tracking protocols. In Proceedings of the 2000 Symposium on Eye Tracking Research & Applications, (pp. 71–78).
Simola J. Salojärvi J. Kojo I. (2008). Using hidden Markov model to uncover processing states from eye movements in information search tasks. Cognitive Systems Research, 9 (4), 237–251. [CrossRef]
Stark L. W. Ellis S. R. (1981). Scanpaths revisited: Cognitive models direct active looking. In Fisher D. F. Monty R. A. Senders J. W. (Eds.), Eye movements and psychological processes (pp. 192–226). Hillsdale, NJ: Lawrence Erlbaum Associates.
Torralba A. Oliva A. Castelhano M. S. Henderson J. M. (2006). Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search. Psychological Review, 113 (4), 766–786. [CrossRef] [PubMed]
Treisman A. Gelade G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12 (1), 97–136. [CrossRef] [PubMed]
Van Der Lans R. Pieters R. Wedel M. (2008). Eye-movement analysis of search effectiveness. Journal of the American Statistical Association, 103 (482), 452–461. [CrossRef]
Walther D. Koch C. (2006). Modeling attention to salient proto-objects. Neural Networks, 19 (9), 1395–1407. [CrossRef] [PubMed]
Weide R. (2005). The Carnegie Mellon pronouncing dictionary. Retrieved from http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
Wojciulik E. Kanwisher N. Driver J. (1998). Covert visual attention modulates face-specific activity in the human fusiform gyrus: fMri study. Journal of Neurophysiology, 79 (3), 1574–1578. [PubMed]
Wolfe J. Butcher S. Lee C. Hyle M. (2003). Changing your mind: On the contributions of top-down and bottom-up guidance in visual search for feature singletons. Journal of Experimental Psychology: Human Perception & Performance, 29 (2), 483–501. [CrossRef]
Wolfe J. Cave K. Franzel S. (1989). Guided search: An alternative to the feature integration model for visual search. Journal of Experimental Psychology, 15, 419–433. [PubMed]
Yarbus A. (1967). Eye movements and vision (translated from the Russian by B. Haigh). New York: Plenum Press.
Zelinsky G. J. Rao R. P. Hayhoe M. M. Ballard D. H. (1997). Eye movements reveal the spatiotemporal dynamics of visual search. Psychological Science, 448–453.
Footnotes
1  So far we have used COG and FOA interchangeably, but from now on, after explaining the difference between these two phenomena, we will distinguish between these two terms.
Footnotes
2  In order to reduce the memory effect, we set up the experiments so that each stimulus is displayed once in every five trials. Moreover, in each segment, each combination of stimulus-task is executed only once, which results in a repetition in executing a task on an image only in every other 30 trials. Given the nature of synthetic stimuli that comprise similar objects at random locations, we believe the effect of memory is not significant.
Appendix
Hidden Markov Models
The Hidden Markov Model (HMM) is a statistical model based on the Markov process in which the states are unobservable. In other words, HMMs model situations in which we receive a sequence of observations (that depend on a dynamical system), but we do not observe the state of the system itself. 
A typical discrete-time, continuous HMM λ can be defined by a set of parameters λ = {A, B, Π} where, 
  •  
    N: number of states in the model
  •  
    States: S = {s1, s2, … , sN}
  •  
    State at time t: qtS
  •  
    M: number of feature values in each observation
  •  
    O: sequence of T observations (O1, O2, … , OT)
  •  
    Ot (t ∈ [1, T]): An observation sample consisted of M feature values (ot,1, ot,2, … , ot,M)
  •  
    Π = {πi}: initial state distribution
  •  
    πi: probability of starting in state i
  •  
    A = {aij}: state transition probability distribution
  •  
    B = {bj(Ot)}: observation probability distribution in state j
Due to their probabilistic nature, HMMs can be used as a generative model to reproduce sequences of observations that are consistent with the implicit Markovian structure of their model. To generate a sample trajectory of length T, we have to choose an initial state according to the initial state distribution Π, choose an observation (Ot) according to the observation parameters (B) of the current state, choose the next state according to the state transition probability (A) and repeat the process for T times (see Figure 14). 
Figure 14
 
HMMs as generating models. We can generate a sample observation sequence (O1, O2, … , OT) by using the probabilistic parameters of HMMs {A, B, Π}, where Ot = (ot,1, ot,2) (i.e., M = 2).
Figure 14
 
HMMs as generating models. We can generate a sample observation sequence (O1, O2, … , OT) by using the probabilistic parameters of HMMs {A, B, Π}, where Ot = (ot,1, ot,2) (i.e., M = 2).
In the literature related to HMMs we can always find three problems that are the main focus of attention: evaluation, decoding, and training. Assume we are given an HMM λ and a sequence of observation O. Evaluation or scoring is the computation of the probability of the observation sequence given the HMM, i.e., P(O|λ). Decoding finds the best state sequence that maximizes the probability of the observation sequence given the model parameters. Finally, training adjusts model parameters to maximize the probability of generating a given observation sequence (training data). The algorithms that cope with evaluation, decoding, and training problems are called forward (backward), Viterbi and Baum-Welch algorithms, respectively. From these three problems we review the methods for evaluation and training as the two problems used in the paper. 
Evaluation
In the evaluation we want to calculate the probability of P(O|λ). One method to do that would be to evaluate it exhaustively over all possible state sequences:  For a given state sequence Q = q1q2qT we would have:   Therefore:  However with T observations and N states in the model we have NT possible states and approximately 2TNT operations. A more efficient way to calculate the term P(O|λ) is to use iterative algorithms of forward algorithm or backward algorithm
Forward algorithm
In this method we define αt(i) = P(O1O2Ot, qt = i|λ) as the probability of observations O1 to Ot with state sequence terminating in state qt = si given HMM λ. We can, then, estimate the probability P(O|λ) by iterating over the following steps until the value diverges to a value: 
This way with T observations and N states, we require approximately N2T operations. 
Backward algorithm
Another way to compute the term P(O|λ) efficiently is to do it in the reverse direction of forward algorithm. In this method we define:  as the probability of observations Ot+1 to OT with state sequence terminating in state qt = si given HMM λ 
Similar to the forward method the complexity of this method for T observations and N states is approximately N2T
Training
In order to train the parameters of an HMM we need to have sequences of observations O. The goal is to train the HMM model parameters λ = {A, B, Π} so that the probability P(O|λ) becomes maximum. In the literature related to the theory of HMMs there is no direct form solution to this problem. Instead, a method based on dynamic programming, called Baum-Welch (AKA Forward-Backward), is suggested to iteratively find a solution to this problem. 
Baum-Welch algorithm
In this method, given an initial HMM model, λ, we estimate a new set of model parameters, λ̂, so that P(O|λ̂) ≥ P(O|λ). In order to do this we need to extensively use forward and backward methods and based on their evaluation iteratively improve the parameters. 
We define ξt(i, j) = P(qt = si, qt+1 = sj|O, λ) as the probability of being in state si at time t and state sj at time t + 1, given the model λ and the observation sequence O. Thus, we have:  Based on the above definition, we define two more posterior probabilities that will be used in the iterative method for finding the model parameters: 
Give these parameters we can re-estimate the model parameters in terms of the defined parameters as follows: 
  •  
    Initial State Probabilities:
Expected number of time instances in state si at time t = 1   
  •  
    Transition Probabilities:
Expected number of transitions from si to sj over expected number of transitions from si   
  •  
    Observation Probabilities:
Expected number of times in state sj and observing Ot over expected number of times in state sj  Therefore the Baum-Welch algorithm tunes the parameters to the training set using the following iterations: 
  •  
    Initialization
Obtain initial estimation for λ = {A, B, Π} 
  •  
    Likelihood Computation
Compute likelihoods αt(i) and βt(i) and posterior probabilities ξt(i, j) for i, j = 1, … , N and t = 1, … , T 
  •  
    Parameter Update
Given likelihoods and posterior probabilities, compute , âij and j(Ot
  •  
    Termination
Repeat Steps Likelihood Computation and Parameter Update steps until convergence; i.e.:   
In our HMM structure we modeled the observation probabilities of each state j (i.e., bj) by a normal distribution (Nj) with mean μj and covariance Cj. The benefit of modeling the observations by a normal distribution is that we can fix the mean to the targets in the image and train the covariance matrix. Moreover, since the foveal region of the vision is circular we can assume a diagonal covariance matrix. This technique is called parameter tying and significantly reduces the training burden in the above iterative algorithm. Using this assumption intuitively increases the convergence rate of the Baum-Welch method by reducing the parameters to be updated in the third step (parameter update). 
Figure 1
 
Eye trajectories measured by Yarbus (1967) by viewers carrying out different tasks. (Upper right) No specific task. (Lower left) Estimate the wealth of the family. (Lower right) Give the ages of the people in the painting. The figure is adapted from Yarbus (1967) with permission from Springer Publishing Company.
Figure 1
 
Eye trajectories measured by Yarbus (1967) by viewers carrying out different tasks. (Upper right) No specific task. (Lower left) Estimate the wealth of the family. (Lower right) Give the ages of the people in the painting. The figure is adapted from Yarbus (1967) with permission from Springer Publishing Company.
Figure 2
 
Eye trajectories recorded while executing a task given the same stimulus. In the trajectories straight lines depict saccades between two consecutive fixations (shown by dots). In this figure two snapshots of the eye movements during the task of counting the “A”s is shown. The results from counting the characters were correct for both cases. Thus, the target that seems to be skipped over (the middle right “A” in the right Figure) has been attended at some point.
Figure 2
 
Eye trajectories recorded while executing a task given the same stimulus. In the trajectories straight lines depict saccades between two consecutive fixations (shown by dots). In this figure two snapshots of the eye movements during the task of counting the “A”s is shown. The results from counting the characters were correct for both cases. Thus, the target that seems to be skipped over (the middle right “A” in the right Figure) has been attended at some point.
Figure 3
 
A sample first-order HMM. A HMM is defined by its number of states, transition probabilities, observation pdfs, and initial state distribution. By definition, the states are hidden to the observer and the output is a series of observations that are the outcomes of the observation pdfs. At each time step, the process picks a state according to the initial and transition probabilities and the output is a series of observations according to observation pdf of the state that is being visited at the time.
Figure 3
 
A sample first-order HMM. A HMM is defined by its number of states, transition probabilities, observation pdfs, and initial state distribution. By definition, the states are hidden to the observer and the output is a series of observations that are the outcomes of the observation pdfs. At each time step, the process picks a state according to the initial and transition probabilities and the output is a series of observations according to observation pdf of the state that is being visited at the time.
Figure 4
 
(Left) The original stimulus. (Middle) The corresponding saliency map. (Right) Eye trajectory superimposed on the image.
Figure 4
 
(Left) The original stimulus. (Middle) The corresponding saliency map. (Right) Eye trajectory superimposed on the image.
Figure 5
 
The structure of the SSHMM. (Left) The generic SSHMM for task inference in an easy visual search. The transition matrix (a) is composed of a deterministic loop from the state to itself (a = 1) and the observation pdf comprises mixture of Gaussians centered on target-relevant objects in the image. (Middle) Observation pdfs give us the probability of seeing an observation given a hidden state. In this figure we put the fixation location pdfs of all the targets together and superimposed them on the original image and its corresponding bottom-up saliency map. (Right) In this figure we show the GMM that constitutes the SSHMM of the task of character counting. From the pool of pdfs in the middle figure, only the Gaussians centered on the characters are selected.
Figure 5
 
The structure of the SSHMM. (Left) The generic SSHMM for task inference in an easy visual search. The transition matrix (a) is composed of a deterministic loop from the state to itself (a = 1) and the observation pdf comprises mixture of Gaussians centered on target-relevant objects in the image. (Middle) Observation pdfs give us the probability of seeing an observation given a hidden state. In this figure we put the fixation location pdfs of all the targets together and superimposed them on the original image and its corresponding bottom-up saliency map. (Right) In this figure we show the GMM that constitutes the SSHMM of the task of character counting. From the pool of pdfs in the middle figure, only the Gaussians centered on the characters are selected.
Figure 6
 
Comparison of the accuracy of visual task inference using the SSHMM and the top-down models. Each bar demonstrates the recognition rate (%) of inferring simple visual tasks of counting red (R), green (G), blue (B), horizontal (H), and vertical (V) bars, as well as counting the characters (C). The mean value and the SEM are represented by bars and the numerical values are given in the lower table.
Figure 6
 
Comparison of the accuracy of visual task inference using the SSHMM and the top-down models. Each bar demonstrates the recognition rate (%) of inferring simple visual tasks of counting red (R), green (G), blue (B), horizontal (H), and vertical (V) bars, as well as counting the characters (C). The mean value and the SEM are represented by bars and the numerical values are given in the lower table.
Figure 7
 
Eye-typing application setup. (Left) The schematic of the on-screen keyboard used in the experiments. We removed a letter (“Z”) in order to have a square layout to reduce directional bias. Also the location of each character is randomized in each trial so that the user has to search for the characters. (Middle) Eye movements of a subject is overlaid on the keyboard layout, on which the trial was executed. The subject searches among the characters to eye-type the word “TWO.” The dots indicate the fixations, and their diameters increase as the fixation duration rises. The connecting lines between the dots show the saccades that bring the new fixation location into the COG. (Right) After each trial a question is asked of the user about the location of a character that appeared in the word to validate the result.
Figure 7
 
Eye-typing application setup. (Left) The schematic of the on-screen keyboard used in the experiments. We removed a letter (“Z”) in order to have a square layout to reduce directional bias. Also the location of each character is randomized in each trial so that the user has to search for the characters. (Middle) Eye movements of a subject is overlaid on the keyboard layout, on which the trial was executed. The subject searches among the characters to eye-type the word “TWO.” The dots indicate the fixations, and their diameters increase as the fixation duration rises. The connecting lines between the dots show the saccades that bring the new fixation location into the COG. (Right) After each trial a question is asked of the user about the location of a character that appeared in the word to validate the result.
Figure 8
 
Structure of the DSWHMM. (Left) For each state (on-target and off-target) a GMM with equal weights defines the pdf of fixation locations. The transition probabilities (aij|i, jO, T) give us the probability of directing the attention from a target/nontarget object to a target/nontarget object in the image. For instance, if aTTaTO it means that the targets are easy to spot and chances of off-target fixations are low and if aTTaTO it means targets are hard to spot and finding them involves fixations on distractors. The initial state probabilities (Πi|iO, T) give us the probability of starting a search from each of the states. For instance, in the case that targets are hard to spot, the probability of starting the quest from the O state (off-target) is higher. (Right) In this figure we put the target states' fixation location pdfs of all the characters together and superimposed them on the original keyboard.
Figure 8
 
Structure of the DSWHMM. (Left) For each state (on-target and off-target) a GMM with equal weights defines the pdf of fixation locations. The transition probabilities (aij|i, jO, T) give us the probability of directing the attention from a target/nontarget object to a target/nontarget object in the image. For instance, if aTTaTO it means that the targets are easy to spot and chances of off-target fixations are low and if aTTaTO it means targets are hard to spot and finding them involves fixations on distractors. The initial state probabilities (Πi|iO, T) give us the probability of starting a search from each of the states. For instance, in the case that targets are hard to spot, the probability of starting the quest from the O state (off-target) is higher. (Right) In this figure we put the target states' fixation location pdfs of all the characters together and superimposed them on the original keyboard.
Figure 9
 
Structure of the DSCHMM. (Left) shows the general structure of DSCHMMs for character “C.” The parameters aij and Πi are defined exactly the same as in the DSWHMM. The only difference between this model and DSWHMM is that in this model we train a model for searching each character separately. Thus, in the target state we will only have one Gaussian observation pdf around the location of the character in the image. Therefore, in order to build a word model we have to concatenate the constituting characters of the word. (Right) Shows how to concatenate the character models to build up a word model (here for a hypothetical word “CA”). In this model (xC, yC) and (xA, yA) are the coordinates of characters “C” and “A,” respectively. For the transitions between the sub HMMs (for each character), we use the initial state probabilities Πi, since looking for the next character after finding the proceeding one is similar to start a new search for the new character.
Figure 9
 
Structure of the DSCHMM. (Left) shows the general structure of DSCHMMs for character “C.” The parameters aij and Πi are defined exactly the same as in the DSWHMM. The only difference between this model and DSWHMM is that in this model we train a model for searching each character separately. Thus, in the target state we will only have one Gaussian observation pdf around the location of the character in the image. Therefore, in order to build a word model we have to concatenate the constituting characters of the word. (Right) Shows how to concatenate the character models to build up a word model (here for a hypothetical word “CA”). In this model (xC, yC) and (xA, yA) are the coordinates of characters “C” and “A,” respectively. For the transitions between the sub HMMs (for each character), we use the initial state probabilities Πi, since looking for the next character after finding the proceeding one is similar to start a new search for the new character.
Figure 10
 
Spatial distribution of fixations (fixation distribution histogram) while searching for a character. (Left) shows the top nine bins of the fixation distribution histogram when looking for character “W.” Similar characters tend to draw attention towards themselves, which is inline with the psychological experiments. (Middle) Shows the result of the experiment in perceptual measurement of image similarity (based on Gilmore et al., 1979). The figure is reproduced with permission from Springer Publishing Company. (Right) Shows the average of top nine fixation location histogram bins, along with their respective SEM, when looking for different characters in the keyboard.
Figure 10
 
Spatial distribution of fixations (fixation distribution histogram) while searching for a character. (Left) shows the top nine bins of the fixation distribution histogram when looking for character “W.” Similar characters tend to draw attention towards themselves, which is inline with the psychological experiments. (Middle) Shows the result of the experiment in perceptual measurement of image similarity (based on Gilmore et al., 1979). The figure is reproduced with permission from Springer Publishing Company. (Right) Shows the average of top nine fixation location histogram bins, along with their respective SEM, when looking for different characters in the keyboard.
Figure 11
 
The structure of the TSHMM for character recognition. (Top) shows the TSHMM for a single character. The mean vector of the S-state's GMM is centered on the top two characters in the fixation distribution histogram of the fixations made during the search for the target character in the training data. Similar to the other models, the transition probabilities aij governs the transitions between the states and the initial state distributions Πi gives us the odds of starting a search from each state. Similar to the DSCHMM, the HMMs are trained for each character separately and concatenated in order to make a word model. (Bottom) shows how to concatenate the character models to build up the word model. Similar to the DSCHMM, the transitions between the states are governed by the initial state probabilities.
Figure 11
 
The structure of the TSHMM for character recognition. (Top) shows the TSHMM for a single character. The mean vector of the S-state's GMM is centered on the top two characters in the fixation distribution histogram of the fixations made during the search for the target character in the training data. Similar to the other models, the transition probabilities aij governs the transitions between the states and the initial state distributions Πi gives us the odds of starting a search from each state. Similar to the DSCHMM, the HMMs are trained for each character separately and concatenated in order to make a word model. (Bottom) shows how to concatenate the character models to build up the word model. Similar to the DSCHMM, the transitions between the states are governed by the initial state probabilities.
Figure 12
 
Comparison of task classification accuracy using different models in a difficult visual search. Each bar demonstrates the mean classification rate (%) of correctly recognizing the intended word in the eye-typing application. The mean value and the SEM are represented by bars and the numerical values are given in the lower table.
Figure 12
 
Comparison of task classification accuracy using different models in a difficult visual search. Each bar demonstrates the mean classification rate (%) of correctly recognizing the intended word in the eye-typing application. The mean value and the SEM are represented by bars and the numerical values are given in the lower table.
Figure 13
 
Comparison of task classification accuracy using the TSHMM, DSCHMM, and DSWHMM methods in a difficult visual search. Each bar demonstrates the mean classification rate (%) of correctly recognizing the intended word in the eye-typing application. The mean value and the SEM are represented by bars and the numerical values are given in the following table.
Figure 13
 
Comparison of task classification accuracy using the TSHMM, DSCHMM, and DSWHMM methods in a difficult visual search. Each bar demonstrates the mean classification rate (%) of correctly recognizing the intended word in the eye-typing application. The mean value and the SEM are represented by bars and the numerical values are given in the following table.
Figure 14
 
HMMs as generating models. We can generate a sample observation sequence (O1, O2, … , OT) by using the probabilistic parameters of HMMs {A, B, Π}, where Ot = (ot,1, ot,2) (i.e., M = 2).
Figure 14
 
HMMs as generating models. We can generate a sample observation sequence (O1, O2, … , OT) by using the probabilistic parameters of HMMs {A, B, Π}, where Ot = (ot,1, ot,2) (i.e., M = 2).
Table 1
 
Parameters of the DSWHMMs after training.
Table 1
 
Parameters of the DSWHMMs after training.
Parameter Value
aOO 71%
aOT 29%
aTT 67%
aTO 33%
πO 95%
πT  5%
σ2D 3.6°
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×