Abstract
With only a glance at a novel scene, we can recognize its meaning and estimate its mean volume. Here, we studied how depth layout perception of natural scenes unfolds within this glance: how does three-dimensional content emerge from the two-dimensional input of the visual image? One hypothesis is that depth layout is constructed locally: points close on the two-dimensional image will be more easily distinguishable in depth than points separated by a larger pixel distance. An alternative hypothesis is that depth layout is constructed over the global scene: points lying on the foreground and background surfaces will be distinguishable in depth earlier than surfaces at intermediary distances, independent of their proximity in the two-dimensional image. The method consisted in superimposing two colored target dots on gray level pictures of natural scenes, while participants responded which dot was on the shallowest surface. The location of the two dots was pre-cued and the scene image was displayed for various durations from 40 to 240 ms, and then masked. Results suggest that depth information is available in a coarse-to-fine, scene-based representation: when the two targets have the greatest depth disparity in the scene (irrespective of their pixels distance), participants accurately select the closer surface at shorter presentation times than when the surfaces were nearer in depth. These data support the hypothesis that the representation of depth available at a glance is based on rapidly computed global depth information, rather than local, image-based information.