Purchase this article with an account.
Kiwon Yun, Gary Ge, Dimitris Samaras, Gregory Zelinsky; How we look tells us what we do: Action recognition using human gaze. Journal of Vision 2015;15(12):121. doi: https://doi.org/10.1167/15.12.121.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
Can a person’s interpretation of a scene, as reflected in their gaze patterns, be harnessed to recognize different classes of actions? Behavioral data were acquired from a previous study in which participants (n=8) saw 500 images from the PASCAL VOC 2012 Actions image set. Each image was freely viewed for 3 seconds and was followed by a 10-AFC test in which the depicted human action had to be selected from among 10 action classes: walking, running, jumping, riding-horse, riding-bike, phoning, taking-photo, using-computer, reading, and playing-instrument. To quantify the spatio-temporal information in gaze we labeled segments in each image (person, upper-body, lower-body, context) and derived gaze features, which included: number of transitions between segment pairs, avg/max of fixation-density map per segment, dwell time per segment, and a measure of when fixations were made on the person versus the context. For baseline comparison we also derived purely visual features using a Convolutional Neural Network trained on fixed subregions of the persons. Three linear Support Vector Machine classifiers were trained, one using visual features alone, one using gaze features alone, and one using both features in combination. Although average precision across the ten action categories was poor, the gaze classifier revealed four distinct behaviorally-meaningful subgroups, walking+running+jumping, riding-horse+riding-bike, phoning+taking-photo, and using-computer+reading+playing-instrument, where actions within each subgroup were highly confusable. Retraining the classifiers to discriminate between these four subgroups resulted in significantly improved performance for the gaze classifier, up from 43.9% to 81.2% (and in the case of phoning+picture-taking; gaze = 81.6%, vision = 65.4%). Moreover, the gaze+vision classifier outperformed both the gaze-alone and vision-alone classifiers, suggesting that gaze-features and vision-features are each contributing to the classification decision. These results have implications for both behavioral and computer vision; gaze patterns can reveal how people group similar actions, which in turn can improve automated action recognition.
Meeting abstract presented at VSS 2015
This PDF is available to Subscribers Only