August 2023
Volume 23, Issue 9
Open Access
Vision Sciences Society Annual Meeting Abstract  |   August 2023
Benchmarking Human Mid-Level Scene Understanding
Author Affiliations & Notes
  • Yoni Friedman
    MIT
  • Thomas O'Connell
    MIT
  • Daniel Bear
    Stanford
  • Jiajun Wu
    Stanford
  • Judy Fan
    Stanford
    UCSD
  • Josh Tenenbaum
    MIT
  • Dan Yamins
    Stanford
  • Footnotes
    Acknowledgements  This work was supported by an ONR MURI grant: "Compositional Scene Understanding with Self-Supervised Object-Centric Dorso-Ventral Neural Networks" (00010803, PO #BB01540322)
Journal of Vision August 2023, Vol.23, 5798. doi:https://doi.org/10.1167/jov.23.9.5798
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Yoni Friedman, Thomas O'Connell, Daniel Bear, Jiajun Wu, Judy Fan, Josh Tenenbaum, Dan Yamins; Benchmarking Human Mid-Level Scene Understanding. Journal of Vision 2023;23(9):5798. https://doi.org/10.1167/jov.23.9.5798.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

People can perceive 3D structure in arbitrary scenes, inferring geometry, semantic information and more. Large-scale vision benchmarks have been instrumental in driving the development of computer vision models. Previous benchmarking work targeting 3D scene understanding collected sparse annotations of segmentation, depth, and surface normals for a limited set of domains. However, these datasets lack sufficient measurement resolution to ensure statistical reliability, limiting the interpretability of model-to-human comparisons. Additionally, these efforts focused on natural images without ground-truth scene geometry, making it difficult to assess human accuracy. Here, we collected a large benchmark dataset measuring humans on several mid-level scene understanding tasks – including segmentation (n=342), relative depth (n=342), and surface normals (n=335) – across a variety of synthetic and natural images. Images were sourced from synthetic indoor-scene datasets, a novel Gestalt-inspired dataset we developed, and the Natural Scenes Dataset (Allen, 2021). In the segmentation and depth tasks, two dots were placed on an image and participants judged if the dots were on the same/different objects, or which dot was closer to the camera. In the surface normal task, participants oriented an arrow with a circular base to match the orientation of a visible surface in the image. In contrast to previous work, which focused on natural scene images, the addition of several synthetic datasets permits precise comparison of human responses to ground truth. We also collect responses from multiple participants per-point in each image, allowing us to obtain precise estimates of inter-annotator reliability. We found that people are both accurate and reliable when making segmentation judgements, and are reliable, though less accurate, when making judgments about relative depth or surface orientation. We envision this dataset as a resource to drive further developments in computational models of visual scene understanding.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×