Does an interaction catch the eye? Decoding eye movements to predict scene understanding

Gregory Zelinsky; Hossein Adeli

doi:10.1167/14.10.763

Abstract

Can eye movements made during scene viewing be decoded to predict how a scene will be understood? Participants (n=15) freely viewed a scene for 100ms, 1000ms or 5000ms, then freely described the scene that was just viewed. All 96 scenes depicted two people in various contexts, but were divided into interacting and non-interacting conditions depending on whether a given participant mentioned an interaction in their description. Scenes were manually segmented into objects and fixations were associated to described segments. The probability of fixating an object given it was described was .8 after 5000ms of viewing, higher than the .52 probability after 1000ms viewing. There were no significant differences between interacting/non-interacting conditions. The probability of describing an object given its fixation was lower (.58, averaged over conditions) and did not depend on interaction condition or viewing time. These patterns suggest that some objects must be fixated to be described, and that 1000ms of viewing did not always provide this opportunity. The probability of mentioning an interaction was also lower with 100ms viewing, further suggesting a role of fixations in scene understanding. To explore whether sufficient information exists in fixation behavior to predict whether a scene would be described as interacting we derived 22 gaze features capturing the order in which key object types were fixated and used these features to train an SVM classifier. Interaction classification was above chance (64%), indicating that this high-level scene understanding could be determined solely from fixation behavior. Further analysis revealed that fixations on a person or fixations between people are predictive of an interaction description, and fixations on objects or between objects and people are predictive of a non-interaction description. Not only are eye movements important to achieve deeper levels of scene understanding, they can be decoded to predict how a scene will be understood.

Meeting abstract presented at VSS 2014

This feature is available to authenticated users only.

To View More...

You must be signed into an individual account to use this feature.