Abstract
As human life integrates further with machines, there is a greater need for intuitive human-machine interfaces. Gaze has for long been studied as a window into the human mind, with gaze control interfaces serving to manipulate a variety of systems from computers to drones. Present approaches do not rely on natural gaze cues, however, and use instead concepts such as dwell time or gaze cursors to capture the human command whilst avoiding the problem of Midas Touch. We present a deep learning approach to human object manipulation intention decoding solely based on natural gaze cues. We run data collection experiments with healthy right-handed adults (n=15, 11 males, 4 females), interacting with 6 different objects on a table, when cued with tasks requiring inspection and manipulation. All of them were involved in three-hour sessions with breaks where they were asked to complete visuomotor tasks. In total, around 14,000 individual trials were completed and recorded. This lead to a dataset of human motor and non-motor intentions coupled with high-frequency eye movement data, in the context of a dining table object manipulation scenario. We modelled the task as a time series classification problem and took inspiration from Natural Language Processing sentiment analysis models to design an architecture based on bidirectional LSTMs. Our model was trained and evaluated on our dataset using 5-fold cross-validation. Results show that we can decode human intention of manipulation as opposed to inspection, solely from natural gaze data, with 78.5% average accuracy (1.64% standard deviation). This shows the feasibility of natural gaze interfaces for human-machine interaction, particularly in the context of robotic systems seamlessly supporting their users with object manipulation in different settings, be that assistive for patients with movement impairments, or collaborative, in industrial or service robots.