Abstract
"Meaning" has recently been added to the list of factors shown to attract fixations. We directly compare the predictive success of meaning maps to other factors known to affect fixation locations in visual search and free-viewing tasks, namely: bottom-up saliency, center bias, and target features (for search). We add to this list of factors a new "objectness" feature, and propose an image-computable method for obtaining scene objectness estimates using an image-segmentation model (Mask R-CNN). An obstacle to using the meaning map method is that the dataset for which meaning estimates are available is only 40 images. To more broadly apply the method, we trained a dilated Inception network to predict meaningful regions in scene images (based on meaning labels from 30 images), and found an average Cross Correlation of 0.82 on the 10 withheld images. With this Deep Meaning model, we can obtain meaning maps for different image datasets for which ground truth meaning labels do not exist. We compared predictions (using NSS) from each factor to ground-truth fixations in COCO-Search18 and four free-viewing datasets: OSIE, MIT1003, the meaning-map dataset, and COCO-FreeView, a new dataset paralleling COCO-Search18. We also manipulated whether factor-independent processing (multiplicative center bias, histogram matching) were used in the priority computation and comparison to the ground-truth fixation-density maps. We found the most predictive factor depended on the dataset and factor-unrelated processing used, which is undesirable. For example, objectness was most predictive without a multiplicative center bias, while meaning was most predictive when one was added. We observed similar differences across free-viewing datasets. For search, target features dominated all others in predicting target-present search, and meaning best predicted target-absent search. Our findings underscore the importance of reporting modeling results for multiple datasets, and on the need for transparent discussion of how predictive success depends on factor-unrelated processing.