Abstract
Recently, computational models of saliency have achieved large breakthroughs in performance on standard saliency benchmarks (saliency.mit.edu). The top performers are artificial neural networks and consistently outperform traditional models. Some evaluation metrics have begun to saturate. This motivates the following questions: Have saliency models begun to converge on human performance? Where are people looking that saliency models are not? Using a collection of natural images from the MIT saliency benchmark with ground truth eye movements, we aggregated fixations on each image to create fixation maps. By thresholding these fixation maps, we obtained a set of highly-fixated image regions. We asked participants to label these regions (in the context of the full images) using one of two tasks: (1) selecting the tags that describe each region (e.g., face, text, animal, background, etc.), or (2) answering a set of binary questions about the region (e.g., "are any of the people in the image looking at something inside the highlighted region?"). We used the crowdsourcing platform Amazon's Mechanical Turk to collect a fixed set of labels per region. To provide the first direct quantitative analysis of model mistakes, we evaluated whether saliency models made predictions in each type of region. Our analysis revealed that the best neural network models miss the same regions: including objects of gaze, locations of implied action or motion, important text regions, and unusual elements. We quantified up to 60% of the remaining model errors. To continue to approach human-level performance, we argue that saliency models will need to reason about the relative importance of image regions, such as focusing on the most important person in the room or the most informative sign on the road. More accurately tracking performance on saliency benchmarks will require finer-grained evaluations and metrics. Pushing performance further will require higher-level image understanding.
Meeting abstract presented at VSS 2017