Abstract
Human visual attention is highly structured around gathering relevant information on the underlying goals or sub-goals one wishes to accomplish. Typically, this has been modelled using either qualitative top-down saliency models, or through the use of highly reductionist psychophysical experiments. Modelling an information-gathering process in these ways however often ignores the rich and complex repertoire of behaviour that make up ecologically-valid gaze. We propose a new way of analysing natural data, suitable for analysing temporal structure of visual attention in complex, freely moving environments. To achieve this, we capture visual information from subjects performing an unconstrained task in the real world - in this case, cooking in a kitchen. We use eye-tracking glasses with a built-in scene camera (SMI ETG 2W @ 120Hz) to record n=15 subjects, setting up, cooking breakfast and eating in a real-world kitchen. We process the visual data using a deep-learning based pipeline (Auepanwiriyakul et al., 2018, ETRA), to obtain the stream of objects in the field of view. Eye-tracking gives us a sequence of objects people are focussing their overt attention on throughout the task. We resolve ambiguities using pixel-level object segmentation and classification techniques. We analyse these sequences using HMM and context-free grammar induction models (IGGI), revealing a potential hierarchical structure which is invariant across subjects. We compare this grammatical structure against a “ground-truth”, the WordNet lexical database of semantic relationships between objects (Miller et al, 1995), and find some surprising similarities and counterintuitive differences between attention-derived and textual base structure of objects, suggesting that the differences between how we look and how we verbally reason about tasks is an open question of cognition and attention.