Abstract
While previous researches in eye fixation prediction typically rely on integrating low-level features (e.g. color, edge) to form a saliency map, recently it has been found that the structural organization of these features into perceptual objects (proto-objects) can play a significant role, and many times more important than low-level features. In this work, we presented a computational framework based on deep network to demonstrate that proto-object representations can be learned naturally from low-resolution image patches from fixation regions. We advocated the use of low-resolution inputs in this work due to a number of reasons: (1) Stimuli triggering eye movements are usually in para-foveal or peripherial regions of the retina, which are in lower resolution compared with fovea. (2) People can perceive or recognize objects well even it is in low resolution. (3) Fixations from lower resolution images can predict fixations on higher resolution images. In the proposed computational model, we extracted multi-scale image patches on fixation regions from eye fixation datasets, resized them to low resolution and fed them into a two-layer neural network. With layer-wise unsupervised feature learning, we found that many proto-objects like features responsive to different shapes of object blobs were learned out in the second layer. Visualizations also show that these features are selective to potential objects in the scene and the responses of these features work well in predicting eye fixations on the images when combined with learned weights.
Meeting abstract presented at VSS 2015