September 2019
Volume 19, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2019
Monocular depth discrimination in natural scenes: Humans vs. deep networks
Author Affiliations & Notes
  • Kedarnath Vilankar
    Centre for Vision Research, York University
  • Hengchao Xiang
    Centre for Vision Research, York University
  • Krista Ehinger
    Centre for Vision Research, York University
  • Wendy Adams
    Psychology Department, University of Southampton
  • Erich Graf
    Psychology Department, University of Southampton
  • James Elder
    Centre for Vision Research, York University
Journal of Vision September 2019, Vol.19, 176. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Kedarnath Vilankar, Hengchao Xiang, Krista Ehinger, Wendy Adams, Erich Graf, James Elder; Monocular depth discrimination in natural scenes: Humans vs. deep networks. Journal of Vision 2019;19(10):176.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Objective. Humans use a number of monocular cues to estimate depth, but little is known about how accuracy varies with depth, nor how we compare to recent deep network models for monocular depth estimation. Here we measure and compare monocular depth acuity for humans and deep network models. Methods. Stimuli were drawn from natural outdoor scenes of the SYNS database of spherical imagery with registered ground truth range data. From each spherical image we extracted 62×49 deg sub-images sampled at regular intervals along the horizon. Four observers viewed randomly-selected images monocularly. Two points on the image were indicated by coloured crosshairs; observers were asked to judge which was closer. The difference in depth was varied to sweep out psychometric functions at four mean depths. Four deep network models were run on the same task. Results. Absolute JNDs were found to increase with mean depth faster than a Weber law for humans and most models, possibly due to the increased foreshortening of the ground surface with depth. While humans outperformed deep network models, a kernel regression model that uses only the elevation angle (height in the image) outperformed both for nearer depths, and we found that both humans and the networks struggled when the two points were fixed to have the same elevation. This suggests that both humans and deep networks may rely largely upon this simple elevation cue, although superior human performance at greater depths indicates that humans can recruit additional image cues. While luminance, colour and spatial frequency cues were all correlated with depth, most of the variance is shared with elevation and adding these cues to the kernel regression model failed to improve its performance. Conclusions. While human monocular depth acuity surpasses current state-of-the-art deep networks, both appear to rely heavily upon gaze elevation to estimate depth.

Acknowledgement: Vision: Science to Applications (VISTA), and Intelligent Systems for Sustainable Urban Mobility (ISSUM) 

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.