Abstract
The study of gaze behavior benefits from the classification of the time series into distinct movement types, or events, such as saccade, pursuit, and fixation. Because the manual identification of events is time consuming and subjective, there is a need for automated classifiers. Although there are several solutions for the classification of the eye-in-head signal, there are no established solutions for classification of the coordinated movements of the eyes and head that will occur in less constrained contexts, for example, when wearing a virtual or augmented reality display. Our approach involves training various temporal classifiers on our new Gaze-in-Wild dataset, recorded from over 20 unrestrained participants and hand-coded by 5 practiced labellers. Subjects were instrumented with a 6-axis 100 Hz inertial measurement unit (mean drift: 0.03 deg/sec), a 30 Hz ZED stereo camera, and a 120Hz Pupil labs eye tracker (mean calibration AngError < 1 deg within 10 degrees from calibration pattern center) to record eye and head orientation. The effort culminated in over 2 hours and 20 minutes of hand labelled head-free gaze behavior data, with approximately 20000 detected fixational movements, 18000 saccades, 1400 pursuit events, and 4000 blinks. We use this hand labelled data to benchmark our dataset using standard machine learning classifiers and train our recurrent network model which leverages multiple Neural Arithmetic Logic Units to classify gaze behavior directly from raw, unfiltered eye-in-head and head vectors. Activation map of various hidden units provides insight on learned representations of eye and head coordination and velocity based feature representations which are directly comparable with hand-crafted features. The performance of our classifier is evaluated using various event based metrics and shows that it can attain near-human level classification (>kappa 0.70, event F1> 0.85).
Acknowledgement: Google Daydream grant