Abstract
Visual spatial perception depends on integrating multiple sources of information over time to form a stable percept. The abundance of potential cues and interactions makes data-driven investigations difficult, but here we leverage natural scene image databases paired with a novel approach to crowd-sourced data collection. A series of fully preregistered studies reveals a remarkable level of systematicity in spatial perception that generalizes strongly across tasks and participants. In Experiment 1, participants (N=192) reported distances to a target embedded in 192 scene images each presented once at one of four durations (250, 500, 750, or 1000 ms). Distance estimates to individual scenes showed increasing sensitivity to distance with longer viewing durations (F = 15.32, p < 10^-9). Critically, even when the effect of the actual distance was regressed out, significant and consistent residuals for each scene replicated across durations (R2 ~ .56, p ~ 10^-35), despite no participant overlap. In Experiment 2, new participants viewed two different images sequentially, drawn from the same set of 192, with targets at varying distances and reported which target was closer. Discrimination performance was strongly predicted by signal detection analyses applied to the verbal estimates from Experiment 1 for the particular stimuli in each discrimination (R2 = .57, p < 10^-14). This generalization across tasks and participants indicates that stimulus-based cues are consistently driving 3D spatial perception in the domain of picture perception. In further experiments, we examine the impact of potential cues (e.g. familiar size, ground plane) which generalize across verbal judgements and perceptual discrimination of depth and height. Overall, these results suggest a consistent and strong relationship between verbal estimation and discrimination that allows for the creation of an explicit model relating visual cues and 3D picture perception across participants and tasks.