Human vision is an active, dynamic process in which the viewer seeks out specific visual inputs according to the ongoing cognitive and behavioral activity. A critical aspect of active vision is directing a spatially circumscribed region of the visual field (about 3
°) corresponding to the highest resolution region of the retina, the so-called fovea, to the task-relevant stimuli in the environment. In this way our brain gets a clear view of the conspicuous locations in an image and will be able to build up an internal, task-specific, representation of the scene (Breitmeyer, Kropfl, & Julesz,
1982; Findlay & Gilchrist,
2001).
In visual activities, the eyes make rapid movements, called
saccades, typically between two and five times per second, in order to bring environmental information into the fovea. Pattern information is only acquired during periods of relative gaze stability, called
fixations, owing to the retina's temporal filtering and the brain's suppression of information during the saccades (Matin,
1974). Gaze planning, thus, is the process of directing the fovea through a scene in real time in the service of ongoing perceptual, cognitive, and behavioral activity. The question of exactly what is happening during fixations is still something of a puzzle, but the effect of visual task on the pattern and specifications of eye movements has been long studied in the literature (Kowler,
2011).
In two seminal studies, Yarbus (
1967) and Buswell (
1935) showed that visual task has a great influence on specific parameters of eye movement control.
Figure 1 shows Yarbus's observation that eye fixations are not randomly distributed in a scene, but instead tend to cluster on some regions at the expense of others. In this figure we can see how visual task modulates the conspicuity of different regions and as a result changes the pattern of eye movements.
Perhaps the Yarbus effect is best studied for the task of reading. Clark and O'Regan (
1998) showed that when reading a text, the
center of gaze (COG) lands on the locations that minimize the ambiguity of the word arising from the incomplete recognition of the letters. Feature integration theory (Treisman & Gelade,
1980) and guided search (Wolfe, Cave, & Franzel,
1989) are two seminal works that study the effect of visual search on how our brain directs eyes through a scene. In another study Bulling, Ward, Gellersen, and Tröster (
2009) also showed that eye movement analysis is a rich modality for activity recognition.
Although the effect of visual task on eye movement pattern has been investigated for various tasks, there is not much done in the area of visual task inference from the eye movements. In a forward Yarbus process, visual task is given as an input, and the output is task-dependent trajectories of eye movements. In this work, on the other hand, our goal is to develop a method to realize an inverse Yarbus process whereby we can infer the ongoing task by observing the eye movements of the viewer. However, solving an inverse Yarbus mapping is an ill-posed problem and cannot be solved directly. For instance, in a study by Greene, Liu, and Wolfe (
2011) an unsuccessful attempt was made to directly solve an inverse Yarbus problem, which led to the conclusion that: “The famous Yarbus figure may be compelling but, sadly, its message appears to be misleading. Neither humans nor machines can use scan paths to identify the task of the viewer. (Greene et al.,
2011, p. 1)” That said, in this work we find a way to regularize the ill-posedness of the problem and suggest a model to infer a visual task from eye movements.
Visual search is one of the main ingredients of many complex tasks. When we are looking for a face in a crowd or counting the number of certain objects in a cluttered scene, we are unconsciously performing visual search to look for certain features in the faces or in the objects in a scene. As proposed in Treisman and Gelade (
1980), the level of difficulty in a search task can vary according to the number of features distinguishing the target object from the distractors. For instance, targets defined by a unique color or a unique orientation are found more easily compared to the ones defined by a conjunction of features (e.g., red vertical bars). On this basis, we investigate two types of visual search, which we call
easy search and
difficult search. In the easy search task, the complexity of determining the presence of a target is lower than that of the difficult search, which reduces the number of fixations on nontarget objects compared to the difficult search task.
In this work we develop a method to infer the task in visual search, which is essentially equivalent to finding out what the viewer is looking for. This is helpful in applications where knowing the target objects in a scene can help us improve the user experience in interaction with an interface. For instance, knowing what the user is seeking in a webpage combined with a dynamic design can lead to a smart webpage that highlights the relevant information according to the ongoing visual task. The same idea applies to intelligent signage that changes its contents to show relevant advertisements according to the foci of attention inferred from each viewer's eye movements.
The model we propose is based on the generative model of Hidden Markov Models (HMM). For each task, we train a task-specific HMM to model the cognitive process in the human brain that generates eye movements given the task. The output of each HMM, then, would be task-dependent eye trajectories along with their respective likelihoods. In order to infer the ongoing task, we use this likelihood term in a Bayesian inference framework that incorporates the likelihood with a-priori knowledge about the task and gives the posterior probability of various tasks given the eye trajectory.
HMMs have been successfully applied in speech recognition (Rabiner,
1990), anomaly detection in video surveillance (Nair & Clark,
2002), and handwriting recognition (Hu, Brown, & Turin,
1996). In the studies related to eye movement analysis, Salvucci and Goldberg (
2000) used HMMs to break down eye trajectories into fixations and saccades. In another study Salvucci and Anderson (
2001) developed a method for automated analysis of eye movement trajectories in the task of equation solving by using HMMs. Simola, Salojärvi, and Kojo (
2008) used HMMs to classify eye movements into three phases, each representing a processing state of the visual cognitive system during a reading task. Van Der Lans, Pieters, and Wedel (
2008) analyzed the ongoing attentional processes during the visual search. By training a HMM over a database of search trajectories, they showed that HMMs are a powerful tool to model the two underlying cognitive processes of a search operation—that is,
localization of objects and
identification of the target among the distractors.
These studies unanimously concur that eye movements can be well modeled as the outcome of a sequential process that deploys visual resources on the field-of-view during a visual task based on a Markov process (also Ellis & Stark,
1986; Pieters, Rosbergen, & Wedel,
1999; Stark & Ellis,
1981). They also model the fixation positions as noisy outcomes of such a process, where the possible deviation of eye position from the focus of attention can be mapped to the disparity between the observation and hidden states in the context of HMMs (also Hayashi, Oman, & Zuschlag,
2003; Rimey & Brown,
1991). However, most of the studies only consider one specific task, such as reading a text or seeking a specific object among a group of distractors, and do not study the effect of task on the visual attention.
Studying the effect of visual task on the parameters of the cognitive model of attention can help us come up with more accurate models that incorporate task specific features of visual attention. Besides, we can use such task-dependent models in the context of Bayesian inference to infer the ongoing task based on the a-posteriori knowledge of eye movements. To the best of our knowledge, the task-dependent modeling of attention by HMMs and using them to infer the visual task is a topic that has not been investigated in the literature related to computational modeling.
To demonstrate our approach we ran two experiments. In the first study we dealt with inferring the ongoing task in an easy search, where targets are defined by a single feature and can be spotted without landing many fixations on the distractors. The proposed model infers what objects are being sought given the spatial location of fixations in an eye trajectory and the stimulus on which the visual search task was performed.
In the second experiment we extended our model to infer the task in a more difficult search task. The application we designed for the difficult search is a soft keyboard, whereby the subjects can type a word by directing their COG on characters appearing on a screen (eye-typing). In this case, inferring the task is equivalent to recognizing the word that is eye-typed, based on the eye movements and the keyboard layout. In order to allow for the deployment of attention on nontarget objects, induced by the augmented complexity of the task, we add a secondary state to the HMM structure that accounts for these off-target allocation of attention on distractors. The double-state structure of the HMM also enables us to capture the dynamics of state transitions, as well as fixations' spatial information, to highlight the more frequently visited, attention-demanding locations that involve heavier interaction with the working memory.
In this experiment we also show how off-target fixations can help us infer the task by exhibiting task-specific patterns based on the appearance of the target. Therefore, in an extension to our double-state model, we add an extra state to the structure of the HMM, forming a tri-state HMM, to segregate directing the focus of attention (FOA) on similar-to-target or dissimilar-to-target distractors. We compare the results of these models on a database of eye movements obtained in the soft keyboard application to see how the introduction of new states to the model affects the results of task inference in a difficult search task.