Abstract
Early qualitative eyetracking studies pioneered by Yarbus revealed strong inter-viewer consistency and task influence. Recent advances in eyetracking technology open up prospects to investigate these effects on large scale datasets acquired under specific task constraints and to apply machine learning techniques to human visual saliency and eye movement prediction, with various practical applications. We present experimental and computational modeling work at the incidence of human visual attention and computer vision, with four components. First, we introduce three novel large scale datasets (exceeding 1.7 million fixations) containing eye movement annotations for real world images and video, for action and context recognition tasks. Second, we analyze quantitatively inter-subject consistency and task influence on visual patterns. We use clustering and Hidden Markov Models to automatically locate areas of interest using eyetracking data. Scanpaths are then viewed as chains of characters and compared using string matching algorithms (Needleman-Wunsch). Our analysis reveals strong spatial and sequential inter-subject consistency. Third, we apply human eye movement data for computer-based action recognition. We compare computer vision spatio-temporal interest point image sampling strategies to human fixations and show their potential for automatic visual recognition. Spatio-temporal Harris Corners are poorly correlated with human fixations, while interest point operators derived from empirical saliency maps outperform existing computer vision operators for action recognition. We use SVMs to train effective top-down models of human visual saliency in conjunction with Histogram of Gradient features and use them within end-to-end bag-of-words computer-based visual action recognition systems to achieve state-of-the-art results in computer vision benchmarks. Fourth, for sequential eye movement prediction, we learn task-sensitive reward functions from eye movement data within inverse optimal control models. We find our models to be superior to existing bottom-up scanpath predictors. We hope that our work will foster further communication, benchmarks and methodology sharing between human and computer vision.
Meeting abstract presented at VSS 2014