Abstract
Most models of fixation prediction operate at the feature level, best exemplified by the Itti-Koch (I-K) saliency model. Others suggest that objects are more important (Einhäuser et al., 2008), but defining objects requires human annotation. We propose a computationally-explicit middle ground by predicting fixations using a combination of saliency and mid-level representations of shape known as proto-objects (POs). For 384 real-world scenes we computed an I-K saliency map and a proto-object segmentation, the latter using the model from Yu et al. (2014). We then averaged the saliency values internal to each PO to obtain a salience for each PO segment. The maximally-salient PO determined the next fixation, with the specific x,y position being the saliency-weighted centroid of the PO's shape. To generate sequences of saccades we inhibited fixated locations in the saliency map, as in the I-K model. We found that this PO-saliency model outperformed (p < .001) the I-K saliency model in predicting fixation-density maps obtained from 12 participants freely viewing the same 384 scenes (3 seconds each). Comparison to the GBVS saliency model showed a similarly significant benefit. Over five levels we also manipulated the coarseness of the PO segmentations for each scene on a fixation-by-fixation basis, meaning that the first predicted fixation was based on the coarsest segmentation and the fifth predicted fixation was based on the finest. Doing this revealed considerable improvements relative to the other tested saliency models, largely due to the capture of a relationship between center bias and ordinal fixation position. Rather than being an ad hoc addition to a saliency model, a center bias falls out of our model via its coarse-to-fine segmentation of a scene over time (fixations). We conclude that fixations are best modeled at the level of proto-objects, which combines the benefit of objects with the computability of features.
Meeting abstract presented at VSS 2017