Abstract
Traditional saliency models predict fixations during scene viewing by computing local contrast between low-level color, intensity and orientation features; the higher the summed contrast the greater the probability of fixation. Evidence also suggests that high-level properties of objects are predictive of fixation locations. We attempt fixation prediction from proto-objects (POs), a mid-level representation existing between features and objects. Using our previously-reported proto-object model (Yu et. al., 2014, JoV), we segmented 384 images of real-world scenes into proto-objects, fragments of visual space, at multiple resolutions (feature-space bandwidths). We then built from these segmentations a saliency map by computing feature contrast between each proto-object and its local neighbors using intensity, color, orientation, and now, size and shape features. Center-surround size contrast was computed by comparing pixel area between a given proto-object and each "surrounding" neighbor. To compute shape contrast we first normalized a proto-object and a neighbor to have the same area, aligned them based on maximum area-overlap, then counted the number of pixels in the overlapping area (divided by the union of the areas), with a smaller overlap over neighbors coding a higher contrast. Doing this relative to each proto-object, then combining contrast signals across features and resolutions, generates a proto-object saliency map, which we used to predict the fixation behavior of 12 participants freely viewing the same scenes (each for 3 seconds) in anticipation of a memory test. We found that our proto-object saliency map predicted fixations as well or better than an Itti-Koch saliency model, and was nearly as predictive as the upper-limit defined by a Subject model obtained using the leave-one-out method. We conclude that size and shape features, quantified in terms of proto-objects, are used to guide overt visual attention, and that saliency-based models of fixation prediction need to recognize the importance of these mid-level visual features.
Meeting abstract presented at VSS 2016