September 2015
Volume 15, Issue 12
Vision Sciences Society Annual Meeting Abstract  |   September 2015
How we look tells us what we do: Action recognition using human gaze
Author Affiliations
  • Kiwon Yun
    Department of Computer Science, Stony Brook University
  • Gary Ge
    Ward Melville High School
  • Dimitris Samaras
    Department of Computer Science, Stony Brook University
  • Gregory Zelinsky
    Department of Computer Science, Stony Brook University Department of Psychology, Stony Brook University
Journal of Vision September 2015, Vol.15, 121. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Kiwon Yun, Gary Ge, Dimitris Samaras, Gregory Zelinsky; How we look tells us what we do: Action recognition using human gaze. Journal of Vision 2015;15(12):121.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Can a person’s interpretation of a scene, as reflected in their gaze patterns, be harnessed to recognize different classes of actions? Behavioral data were acquired from a previous study in which participants (n=8) saw 500 images from the PASCAL VOC 2012 Actions image set. Each image was freely viewed for 3 seconds and was followed by a 10-AFC test in which the depicted human action had to be selected from among 10 action classes: walking, running, jumping, riding-horse, riding-bike, phoning, taking-photo, using-computer, reading, and playing-instrument. To quantify the spatio-temporal information in gaze we labeled segments in each image (person, upper-body, lower-body, context) and derived gaze features, which included: number of transitions between segment pairs, avg/max of fixation-density map per segment, dwell time per segment, and a measure of when fixations were made on the person versus the context. For baseline comparison we also derived purely visual features using a Convolutional Neural Network trained on fixed subregions of the persons. Three linear Support Vector Machine classifiers were trained, one using visual features alone, one using gaze features alone, and one using both features in combination. Although average precision across the ten action categories was poor, the gaze classifier revealed four distinct behaviorally-meaningful subgroups, walking+running+jumping, riding-horse+riding-bike, phoning+taking-photo, and using-computer+reading+playing-instrument, where actions within each subgroup were highly confusable. Retraining the classifiers to discriminate between these four subgroups resulted in significantly improved performance for the gaze classifier, up from 43.9% to 81.2% (and in the case of phoning+picture-taking; gaze = 81.6%, vision = 65.4%). Moreover, the gaze+vision classifier outperformed both the gaze-alone and vision-alone classifiers, suggesting that gaze-features and vision-features are each contributing to the classification decision. These results have implications for both behavioral and computer vision; gaze patterns can reveal how people group similar actions, which in turn can improve automated action recognition.

Meeting abstract presented at VSS 2015


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.