Abstract
Visual systems estimates three-dimensional (3D) structure of the environment from two-dimensional (2D) retinal images. To improve accuracy, visual systems use multiple sources of information. Here, we examine how human visual systems use prior information about the world to improve the estimation of 3D surface tilt. We analyzed the statistics of 3D tilts in natural scenes from a large stereo-image database with co-registered distance information at each pixel. We found a systematic pattern governing how tilts are spatially related in natural scenes. We designed a hierarchical model that pools local tilt estimates in accordance with these scene statistics. The model first computes a Bayes-optimal local estimate given three image cues (i.e. luminance, texture, and disparity). The model then computes a “global” estimate by pooling the local estimates within a neighborhood centered on the target location. The orientation and aspect ratio of each pooling neighborhood was dictated by the natural scene statistics. The model was evaluated how accurately it estimated groundtruth tilt in natural scenes and how accurately it predicted human performance. Human performance was determined in a psychophysical experiment. Humans viewed natural scenes through a stereoscopically defined circular aperture that was 3deg in diameter. The task was to estimate the surface tilt at the center of the patch via a mouse-controlled probe. Four human observers participated in two experiments; each experiment contained 3600 unique stimuli. We found that the global model provides more accurate estimates of groundtruth tilt and better predictions of human performance than the local model. We also found that the pooling neighborhood areas that maximized estimation accuracy were very similar to the pooling neighborhood areas that best predict human performance. Taken together, the results suggest that human visual systems integrate local estimates in accordance with statistics of surface tilt natural scenes.