December 2022
Volume 22, Issue 14
Open Access
Vision Sciences Society Annual Meeting Abstract  |   December 2022
Gaze Grammars - Is there an invariant hierarchical sequential structure of human visual attention in natural tasks?
Author Affiliations & Notes
  • John Harston
    Brain and Behaviour Lab, Imperial College London
  • Roshan Chainani
    Brain and Behaviour Lab, Imperial College London
  • Aldo Faisal
    Brain and Behaviour Lab, Imperial College London
    UKRI Centre in AI for Healthcare, Imperial College London
    MRC London Institute of Medical Sciences
  • Footnotes
    Acknowledgements  EPSRC
Journal of Vision December 2022, Vol.22, 3894. doi:https://doi.org/10.1167/jov.22.14.3894
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      John Harston, Roshan Chainani, Aldo Faisal; Gaze Grammars - Is there an invariant hierarchical sequential structure of human visual attention in natural tasks?. Journal of Vision 2022;22(14):3894. https://doi.org/10.1167/jov.22.14.3894.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Human visual attention is highly structured around gathering relevant information on the underlying goals or sub-goals one wishes to accomplish. Typically, this has been modelled using either qualitative top-down saliency models, or through the use of highly reductionist psychophysical experiments. Modelling an information-gathering process in these ways however often ignores the rich and complex repertoire of behaviour that make up ecologically-valid gaze. We propose a new way of analysing natural data, suitable for analysing temporal structure of visual attention in complex, freely moving environments. To achieve this, we capture visual information from subjects performing an unconstrained task in the real world - in this case, cooking in a kitchen. We use eye-tracking glasses with a built-in scene camera (SMI ETG 2W @ 120Hz) to record n=15 subjects, setting up, cooking breakfast and eating in a real-world kitchen. We process the visual data using a deep-learning based pipeline (Auepanwiriyakul et al., 2018, ETRA), to obtain the stream of objects in the field of view. Eye-tracking gives us a sequence of objects people are focussing their overt attention on throughout the task. We resolve ambiguities using pixel-level object segmentation and classification techniques. We analyse these sequences using HMM and context-free grammar induction models (IGGI), revealing a potential hierarchical structure which is invariant across subjects. We compare this grammatical structure against a “ground-truth”, the WordNet lexical database of semantic relationships between objects (Miller et al, 1995), and find some surprising similarities and counterintuitive differences between attention-derived and textual base structure of objects, suggesting that the differences between how we look and how we verbally reason about tasks is an open question of cognition and attention.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×