Estimating the 3D shape of our surroundings is essential for many everyday behaviors. The 3D shape at any point on a smooth surface can be closely approximated over a small neighborhood by a plane. Thus, the most local and fundamental measure of shape is local surface orientation. Local surface orientation is often specified in terms of slant and tilt (
Stevens, 1983). Slant is the angle between the surface normal (the unit vector perpendicular to the surface) and the frontoparallel plane (
Figure 1A). Tilt is the orientation of the vector formed by the projection of the surface normal onto the frontoparallel plane (
Figure 1B).
A common view of 3D shape perception is that it begins with the estimation of the local slants and tilts, which are then integrated into a representation of the 3D shape. Thus, not surprisingly, there have been a large number of studies directed at measuring and understanding the perception of 3D slant and tilt (e.g. see
Howard & Rogers, 2012). Here, we focus on perception of the 3D slant of planar surfaces.
Under natural conditions (without head or scene movement), the image information available for 3D slant estimation typically consists of the binocular cue of disparity (the differences between the images formed in the two eyes) together with various monocular cues (e.g. linear perspective). The primary goal of the current study was to measure slant discrimination under naturalistic conditions and to compare human performance in our task with that of an ideal observer, and several subideal observers, for slant estimation from binocular-disparity cues.
A number of studies have measured slant-discrimination performance from binocular disparity using sparse random dot stereograms (
Hibbard, Bradshaw, Langley, & Rogers, 2002;
Knill & Saunders, 2003;
Hillis, Watt, Landy, & Banks, 2004;
Girshick & Banks, 2009;
Burge, Girshick, & Banks, 2010). In most of these studies, the stimuli were presented in two temporal intervals. In natural viewing, it is probably more typical for humans to be comparing the 3D orientations of surfaces that are densely textured, and that are located within the same scene at different distances (
Burge, McCann, & Geisler, 2016;
Kim & Burge, 2018;
Kim & Burge, 2020). Here, we measured slant discrimination performance for surfaces that were textured with naturalistic noise (see
Figure 2 in Methods). The two planar surfaces were presented in a single-interval task, where the smaller test surface was in front of a surrounding reference surface by a distance that varied randomly from trial-to-trial by a small amount. The stimuli were accurately rendered, and hence contained both monocular and binocular cues to surface orientation. To reduce the usefulness of the monocular cues, the texture contained few regularities and the shape (i.e. silhouette) of the test surface was jittered (see Methods). A control experiment confirmed that the performance of our subjects was completely dominated by the binocular cues (see Results). This allowed us to focus on models of slant discrimination from binocular disparity. Finally, to limit and compare human and model-observer performance, we added different (uncorrelated) samples of white noise to the test region in each eye.
The modeling begins with the derivation of an approximate Bayesian ideal observer for slant discrimination of planar surfaces from binocular disparity. Ideal-observer models reveal the fundamental computational principles of the task, set a proper benchmark to compare with human performance, and can be used to evaluate the effectiveness of heuristic (suboptimal) mechanisms (
Green & Swets, 1966;
Geisler, 2011;
Burge, 2020).
In the Bayesian framework, it is convenient to divide the problem of estimating slant from binocular disparity into two problems (e.g.
Marr & Poggio, 1979). The first is the “correspondence problem”: estimating the points in the left image and the points in the right image that correspond to the same points in the 3D scene. Here, we define the transformation that maps one image into the other as the “disparity” between the two images. It is important to note that there are multiple ways to describe the same transformation. For example, a disparity might be most compactly described as a global transformation with just a few parameters, but can also be described by a list of the vertical and horizontal translations of each point in one image needed to align that point with the corresponding point in the other image. Solving the correspondence problem can be difficult because of false matches and partial occlusions.
The second problem is to translate the estimated disparity, or disparities, into an estimate of the 3D surface orientation. Solving this problem requires knowing or estimating the pose of the eyes (e.g. vergence and version), which may be estimated from image cues, oculomotor cues, head orientation cues, or some combination of these cues.
The ideal and subideal observers described here assume that the pose of the eyes is known and hence the current focus is more on the correspondence problem. This focus differs from related human vision literature, which assumes that the correspondence problem has been solved and focuses instead on the estimation of surface orientation from cues in the matched binocular images (such as the horizontal and vertical size ratio, and disparity gradients), and from other eye-pose cues (e.g. see
Backus, Banks, van Ee, & Crowell, 1999).
For every possible slant and distance of the surface, the ideal observer computes the predicted image in one eye given the image observed in the other eye and the rules of backward- and forward-projection (see Methods). The estimated slant and distance are the slant and distance pair that gives the smallest prediction error. We will call this optimal model of slant and distance estimation the “planar matching” (PM) model. This observer is optimal because it uses all of the available geometric information given planar surfaces. By generating predictions via backward and forward projection for every possible distance and slant, the PM observer is considering exactly the set of possible differences that can exist between the left and right images for a given planar surface. It then picks the distance and slant that best explains the difference between the two images, and thus it simultaneously solves the correspondence and slant-estimation problems with the estimation of just two global parameters.
Although, for simplicity, we assume that eye pose is known and that the model observers compute absolute slant and distance, we argue later that there are nearly equivalent model observers that compute relative slant and are robust to modest uncertainty in eye pose (see Methods and Discussion).
In general, human performance deviates from optimal performance. A principled approach for generating plausible suboptimal models is to replace one or more of the optimal computations with simpler more biologically plausible computations, to incorporate sources of internal noise, and/or to incorporate other plausible biological limitations (e.g. response nonlinearities, foveation, etc.).
One simplifying and more biologically plausible computation is to perform PM locally to solve the local correspondence problem, and then combine the local slant and distance estimates over the whole test region (
Jones & Malik, 1992;
Super & Klarquist, 1997; see also
Wildes, 1991). We will call this the “local planar matching” (LPM) model. The LPM model represents the correspondence between local image regions in the two eyes as a “structural disparity”—a difference in the location and spatial pattern between the corresponding regions in the two eyes (location disparities and pattern disparities). To be clear, in this paper, we regard structural disparities as the result of an initial binocular matching process, and not as second-level cues (like horizontal and vertical size ratios) computed after the correspondence problem has been solved.
A more simplifying assumption is to solve the local correspondence problem by performing “local frontoparallel matching” (LFM), which is essentially equivalent to standard cross correlation (
Tyler & Julesz, 1978;
Cormack, Stevenson, & Schor, 1991;
Banks, Gepshtein, & Landy, 2004). The LFM model represents the correspondence between local image regions in the two eyes as only a “location disparity”—the difference in the location of the corresponding regions in the two eyes. This assumption is made by most models of human stereo vision. The estimated surface slant is then computed by combining the distances specified by the location disparities. Formally, this LFM model is a special case of the LPM model where the local slant is assumed to be zero (see Methods). Models based on LFM have been successful in accounting for many aspects of human disparity discrimination (
Tyler & Julesz, 1978;
Cormack et al., 1991;
Banks et al., 2004;
Filippini & Banks, 2009), and in explaining the response properties of disparity-selective neurons in visual cortex (
Ohzawa, DeAngelis, & Freeman, 1990;
DeAngelis, Ohzawa & Freeman, 1991;
Ohzawa, DeAngelis, & Freeman, 1997;
Cumming & DeAngelis, 2001).
It has long been known that introducing an orientation or scale difference between the left and right images can produce vivid perceptions of surface slant (
Wheatstone, 1838;
Ogle, 1938;
Blakemore, 1970a). These results seem to suggest that local structural disparities are directly exploited by the visual system to estimate 3D surface orientation. However, as mentioned above, these structural disparities can also be described in terms of the location disparities between the corresponding points in each image. Thus, it has been difficult to rule out the hypothesis that location disparities are computed first and then later combined to determine the 3D surface orientation (
Fiorentini & Maffei, 1971;
Wilson, 1976;
Mitchison & McKee, 1990;
Cagenello & Rogers, 1993;
Halpern, Wilson, & Blake, 1996;
Greenwald & Knill, 2009). One aim of the current study was to discriminate between these hypotheses.
In the current study, we measured slant discrimination thresholds for the human and model observers as a function of the reference slant and the contrast of uncorrelated white noise samples added to each eye's image. As expected, we found that the PM model had the lowest thresholds, followed by the LPM model, the LFM model, and finally the human subjects. All three models capture the qualitative trends in the human thresholds, but none provide good quantitative predictions of the trends, even when their average sensitivities (d′ values) are scaled by an arbitrary efficiency parameter. However, if we include another plausible factor, a fixed level of internal estimation noise, then all three models make good quantitative predictions. Although the LPM observer does not predict the pattern of human thresholds significantly better than the LFM observer, its absolute performance is substantially better and more robust across analysis patch size (e.g. receptive field size), and thus there may have been evolutionary pressure to incorporate similar structural-disparity computations into the early visual system. We also measured depth discrimination, in addition to slant discrimination, with the same stimuli and found that there was a trend for human observers to be more efficient (relative to ideal) at slant discrimination than at depth discrimination.