September 2017
Volume 17, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   August 2017
Towards cognitive saliency: narrowing the gap to human performance
Author Affiliations
  • Adria Recasens
    Computer Science and Artificial Intelligence Lab, MIT
  • Zoya Bylinskii
    Computer Science and Artificial Intelligence Lab, MIT
  • Ali Borji
    Center for Research in Computer Vision, UCF
  • Fredo Durand
    Computer Science and Artificial Intelligence Lab, MIT
  • Antonio Torralba
    Computer Science and Artificial Intelligence Lab, MIT
  • Aude Oliva
    Computer Science and Artificial Intelligence Lab, MIT
Journal of Vision August 2017, Vol.17, 542. doi:https://doi.org/10.1167/17.10.542
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Adria Recasens, Zoya Bylinskii, Ali Borji, Fredo Durand, Antonio Torralba, Aude Oliva; Towards cognitive saliency: narrowing the gap to human performance. Journal of Vision 2017;17(10):542. https://doi.org/10.1167/17.10.542.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Recently, computational models of saliency have achieved large breakthroughs in performance on standard saliency benchmarks (saliency.mit.edu). The top performers are artificial neural networks and consistently outperform traditional models. Some evaluation metrics have begun to saturate. This motivates the following questions: Have saliency models begun to converge on human performance? Where are people looking that saliency models are not? Using a collection of natural images from the MIT saliency benchmark with ground truth eye movements, we aggregated fixations on each image to create fixation maps. By thresholding these fixation maps, we obtained a set of highly-fixated image regions. We asked participants to label these regions (in the context of the full images) using one of two tasks: (1) selecting the tags that describe each region (e.g., face, text, animal, background, etc.), or (2) answering a set of binary questions about the region (e.g., "are any of the people in the image looking at something inside the highlighted region?"). We used the crowdsourcing platform Amazon's Mechanical Turk to collect a fixed set of labels per region. To provide the first direct quantitative analysis of model mistakes, we evaluated whether saliency models made predictions in each type of region. Our analysis revealed that the best neural network models miss the same regions: including objects of gaze, locations of implied action or motion, important text regions, and unusual elements. We quantified up to 60% of the remaining model errors. To continue to approach human-level performance, we argue that saliency models will need to reason about the relative importance of image regions, such as focusing on the most important person in the room or the most informative sign on the road. More accurately tracking performance on saliency benchmarks will require finer-grained evaluations and metrics. Pushing performance further will require higher-level image understanding.

Meeting abstract presented at VSS 2017

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×