Abstract
The derivation of three-dimensional (3D) shape from retinal projections is a classic problem of psychology and computer vision. Central to the study of this problem is the issue of how the visual system combines different two-dimensional (2D) cues to depth present in the retinal images (e.g., stereo, motion, shading). The most recent theory of depth-cue integration postulates that the human visual system combines independent estimates of 3D shape arising from separate depth modules through an optimal Bayesian estimator (Landy et al., 1995). This approach, however, has overlooked the remarkable isomorphic relation existing in real-worlds situations among 2D depth-cues. In fact, in any 2D projection of the natural environment the signals specified by different depth cues (for example, stereo and motion) are necessarily related. Here we propose a new theory of depth-cue integration based on the assumption that the visual system does make use of these natural co-variations. Specifically, we argue that: (1) the visual system reduces the dimensionality of the stimulus space defined by multiple signals in order to capture their natural co-variations, and (2) 3D properties are derived from this lower-dimensional space. New results from several experiments on stereo-motion and stereo-shading integration confirm these predictions. Current models of cue integration cannot predict these results. The general framework that we propose is consistent both with the findings unaccounted for by the modified-weak-fusion model (Landy et al., 1998), and with the previous findings reported in the depth-cue combination literature.
Supported by NSF grant 0078441