Abstract
Predicting free-viewing fixation locations has a long history both in vision science and in computer vision. Recent high-performing models are deep learning based models that are trained on an eye movement dataset such as MIT1003, and subsequently evaluated on benchmarks such as the MIT/Tuebingen Saliency benchmark, which assess model performance on one or multiple datasets. An important challenge that has only marginally been addressed so far, is the desire for saliency models to generalize across different domains, correctly predicting fixation densities for any image and recording setup. In this work, we combine a substantial range of eye movement datasets, including MIT1003, CAT2000, COCO Freeview, FIGRIM, NUSEF, OSIE and others to create a large-scale compound dataset that we envision to grow further over time aiming for maximal size and diversity. On this dataset, we train a fixation prediction model, which is an extended and improved variant of DeepGaze IIE, combining multiple pretrained deep backbones in a joint readout architecture. After training on all or a subset of these datasets, the model is evaluated on the validation splits of all datasets. Our best model improves state-of-the-art by a significant margin on many commonly used benchmark datasets, including MIT300, CAT2000 and COCO Freeview. Our modeling paradigm allows us to assess to which degree gaze patterns from one dataset generalize to other datasets, to which degree using multiple datasets creates synergy effects due to the larger diversity in the data, or to which degree different datasets show conflicting patterns. For example, we find that different datasets require different rescalings of local priority values in a way that is partially, but not fully, explained by different presentation times. Such analyses hint at underlying mechanisms that need to be understood and incorporated into models for building fixation models which are reliably applicable in diverse contexts.