Abstract
People can perceive 3D structure in arbitrary scenes, inferring geometry, semantic information and more. Large-scale vision benchmarks have been instrumental in driving the development of computer vision models. Previous benchmarking work targeting 3D scene understanding collected sparse annotations of segmentation, depth, and surface normals for a limited set of domains. However, these datasets lack sufficient measurement resolution to ensure statistical reliability, limiting the interpretability of model-to-human comparisons. Additionally, these efforts focused on natural images without ground-truth scene geometry, making it difficult to assess human accuracy. Here, we collected a large benchmark dataset measuring humans on several mid-level scene understanding tasks – including segmentation (n=342), relative depth (n=342), and surface normals (n=335) – across a variety of synthetic and natural images. Images were sourced from synthetic indoor-scene datasets, a novel Gestalt-inspired dataset we developed, and the Natural Scenes Dataset (Allen, 2021). In the segmentation and depth tasks, two dots were placed on an image and participants judged if the dots were on the same/different objects, or which dot was closer to the camera. In the surface normal task, participants oriented an arrow with a circular base to match the orientation of a visible surface in the image. In contrast to previous work, which focused on natural scene images, the addition of several synthetic datasets permits precise comparison of human responses to ground truth. We also collect responses from multiple participants per-point in each image, allowing us to obtain precise estimates of inter-annotator reliability. We found that people are both accurate and reliable when making segmentation judgements, and are reliable, though less accurate, when making judgments about relative depth or surface orientation. We envision this dataset as a resource to drive further developments in computational models of visual scene understanding.