Abstract
Objective. Humans use a number of monocular cues to estimate depth, but little is known about how accuracy varies with depth, nor how we compare to recent deep network models for monocular depth estimation. Here we measure and compare monocular depth acuity for humans and deep network models. Methods. Stimuli were drawn from natural outdoor scenes of the SYNS database of spherical imagery with registered ground truth range data. From each spherical image we extracted 62×49 deg sub-images sampled at regular intervals along the horizon. Four observers viewed randomly-selected images monocularly. Two points on the image were indicated by coloured crosshairs; observers were asked to judge which was closer. The difference in depth was varied to sweep out psychometric functions at four mean depths. Four deep network models were run on the same task. Results. Absolute JNDs were found to increase with mean depth faster than a Weber law for humans and most models, possibly due to the increased foreshortening of the ground surface with depth. While humans outperformed deep network models, a kernel regression model that uses only the elevation angle (height in the image) outperformed both for nearer depths, and we found that both humans and the networks struggled when the two points were fixed to have the same elevation. This suggests that both humans and deep networks may rely largely upon this simple elevation cue, although superior human performance at greater depths indicates that humans can recruit additional image cues. While luminance, colour and spatial frequency cues were all correlated with depth, most of the variance is shared with elevation and adding these cues to the kernel regression model failed to improve its performance. Conclusions. While human monocular depth acuity surpasses current state-of-the-art deep networks, both appear to rely heavily upon gaze elevation to estimate depth.
Acknowledgement: Vision: Science to Applications (VISTA), and Intelligent Systems for Sustainable Urban Mobility (ISSUM)