We investigated the ability to use linear perspective to perceive depth from monocular images. Specifically, we focused on the information provided by convergence of parallel lines in an image due to perspective projection. Our stimuli were trapezoid-shaped projected contours, which appear as rectangles slanted in depth. If converging edges of a contour are assumed to be parallel edges of a 3D object, then it is possible in principle to recover its 3D orientation and relative dimensions. This 3D interpretation depends on projected size; hence, if an image contour were scaled, accurate use of perspective predicts changes in perceived slant and shape. We tested this prediction and measured the accuracy and precision with which observers can judge depth from perspective alone. Observers viewed monocular images of slanted rectangles and judged whether the rectangles appeared longer versus wider than a square. The projected contours had varying widths (7, 14, or 21 deg) and side angles (7 or 25 deg), and heights were varied by a staircase procedure to compute a point of subjective equality and 75% threshold for each condition. Observers were able to reliably judge aspect ratios from the monocular images: Weber fractions were 6–9% for the largest rectangles, increasing to as high as 17% for small rectangles with high simulated slant. Overall, the contours judged to be squares were taller than the projections of actual squares, consistent with perceptual underestimation of depth. Judgments were modulated by image size in the direction expected from perspective geometry, but the effect of size was only about 20–30% of what was predicted. We simulated the performance of a Bayesian ideal observer that integrated perspective information with an a priori bias toward compression of depth and which was able to qualitatively model the pattern of results.

*pictorial depth cues*(for taxonomies, see Cutting and Vishton, 1995, or Kubovy, 1986). For example, in Figure 1, depth can be inferred from the gradient of size for the square tiles or from the convergence (in the image) of lines that recede in the world. These cues are typically correlated, but each is conditional on different assumptions: that squares lying on a surface have constant size in the world or that lines on the surface are parallel in the world. In this article, we focus on the information provided by the latter cue, which we will refer to as

*perspective convergence*.

*slant*and

*tilt*(Stevens, 1983) relative to a line of sight that intersects the plane. Once the slant and tilt are known, it would then be possible to reconstruct the shape of the object up to a scale factor for distance (e.g., by back projecting onto the slanted plane).

*a*is the angle of the converging side edges relative to the tilt direction (i.e., tan(

*a*) is the slope relative to vertical),

*s*is the egocentric slant of the surface (as measured at the origin), and

*w*is the angular width measured through the center. This formulation follows that of Freeman (1966b; equivalent formulations include Braunstein & Payne, 1969, and Flock, 1962). A similar relation holds for more generic poses but would involve both the slant and tilt of the surface. We describe perspective convergence in terms of projection onto an image plane because it is convenient; the same geometric relationship could also be described in spherical coordinates.

*w*in Equation 1). Accurate computation of slant in depth from perspective would therefore require the visual system to know the absolute angular size of a projected image. One way to intuitively understand the size dependence is to think of perspective convergence as identifying the location of the horizon within a projected image. Surface slant is the complement of the visual angle between a reference point and the horizon. To obtain this visual angle from a projected contour, one must also know the visual angle subtended by the contour itself, which we will refer to as its

*projected size*. As illustrated in Figure 3, similar shapes, when scaled to different sizes, correspond to different 3D interpretations. Thus, if the visual system does use perspective convergence in a geometrically correct way, one would expect that rescaling a perspective image would change perceived 3D structure in predictable ways. As projected size increases, the perceived aspect ratio of the rectangle (length–width) should decrease, and it should appear less slanted. The experiments presented here test this prediction, as a means to identify the contribution of perspective convergence.

*both*scaled and unscaled images. Smith and Gruber (1958) compared depth judgments based on either photographs or actual views of a corridor. To the extent that perspective information contributes to perceived depth of the real scene, however, this paradigm has the same limitation as the other studies.

*foreshortening*or

*compression*of a projected contour. If the overall shape of a planar object is assumed to be

*isotropic*(Garding, 1993; Witkin, 1981), then the aspect ratio of its projected contour provides a cue to its slant. Figure 4a illustrates this cue for the case of trapezoidal projected contours.

*high-slant*and

*low-slant*conditions, respectively ( Figure 5, top and bottom rows). Both shapes were presented with a range of projected sizes. If subjects use perspective convergence correctly at a given projected size, one would expect a large difference across size conditions in the projected shape that is judged to be square, with the largest projected size having to be the tallest to appear square. Figure 6 shows the contour shapes that accurately correspond to the projections of a square, for each of the six size and slant conditions. In addition to overall shape, we also varied the presence or absence of an internal grid texture ( Figure 5, right vs. left). The textured stimuli contained an additional cue to depth: the gradient of compression of vertical spacing. This cue is effective in its own right (Andersen et al., 1998). As with perspective convergence, the 3D interpretation of the compression cue depends on projected size; hence, varying size might have a larger effect for textured than untextured rectangles.

*w*

_{bottom}+

*w*

_{top})/(2

*h*), where

*w*

_{bottom}and

*w*

_{top}are the widths of its bottom and top edges and

*h*is its projected height. This ratio is related to the aspect ratio of the corresponding 3D rectangle by a factor of the cosine of slant (Braunstein & Payne, 1969). Thus, the same Weber fraction describes discriminablility for the 3D rectangles and for their projections. Note that the “mean width” used to compute aspect ratio for the Weber fraction is not same as width measured through the center of the trapezoid.

*F*(2,55) = 19.33,

*p*< .001; low slant:

*F*(2,55) = 39.53,

*p*< .001. There was no evidence that the presence or absence of texture made a difference in the effect of size, high slant:

*F*(2,55) = 0.628,

*p*= .47 (

*ns*); low slant:

*F*(2,55) = 0.57,

*p*= .57 (

*ns*). Eleven of 12 subjects showed the size effect. There were also small but significant main effects of texture for both high- and low-slant conditions, high slant:

*F*(1,55) = 6.119,

*p*= .016; low slant:

*F*(1,55) = 10.96,

*p*= .002.

*F*(1,55) = 8.7,

*p*= .005, whereas for the low-slant conditions, there was no reliable difference for texture,

*F*(1,55) = 1.5,

*p*= .23 (

*ns*). Threshold decreased as projected size increased in the high-slant conditions,

*F*(2,55) = 12.6,

*p*< .001; this trend was not significant in the low-slant conditions,

*F*(2,55) = 1.4,

*p*= .25 (

*ns*). There was no interaction between size and the presence of texture for either slant condition, high slant:

*F*(2,55) = 0.36,

*p*= .70 (

*ns*); low slant:

*F*(2,55) = 0.14,

*p*= .87 (

*ns*).

*discrimination*more difficult, but it could not directly account for size-dependent

*biases*in what projected shapes appear to be squares.

*same*shape ( Figure 10b)? From the data from Experiment 1, we can infer that a contour that appears to be a square object at an intermediate size would appear as an elongated rectangle when presented at smaller sizes and as a shortened object at larger sizes. However, the extent to which the perceived 3D objects appear elongated or shortened cannot necessarily be determined.

*y*= 1 on the plots in Figure 13). If there was a significant interaction between perspective convergence and aspect ratio, these points might be similar across projected size conditions, even if the perceived length-to-width ratio for a given aspect ratio changed by a large amount. This is clearly not the case.

*w*), aspect ratio (

*r*

_{proj}), and side angles (

*a*

_{proj}). We assumed that slant was around a horizontal axis (vertical tilt direction); thus, the 3D object would be a symmetric trapezoid as well. The 3D object was also specified by its size, length-to-width ratio (

*r*

_{obj}), and the angle of its sides relative to its midline (

*a*

_{obj}). There is an unavoidable ambiguity with respect to the overall size and distance of the 3D object; hence, without loss of generality, we ignore its size parameter. Thus, in terms of the defined parameters, the model's task was to estimate the slant (

*s*) and shape (

*r*

_{obj},

*a*

_{obj}) of the 3D object, from a projected contour with a given projected width (

*w*) and shape (

*r*

_{proj},

*a*

_{proj}). The results of Experiment 2 suggest that perceived slant does not depend on projected aspect ratio; thus, we initially estimated

*s*and

*a*

_{obj}based solely on

*w*and

*a*

_{proj}. The estimate of slant was combined with

*r*

_{proj}to determine

*r*

_{obj}, which was what the model (and subjects) judged.

*s*and

*a*

_{obj}was the combination that maximized the posterior probability function

*P*(

*s, a*

_{obj}∣

*a*

_{proj},

*w*). By applying Bayes' rule, this can be expressed in terms of the likelihood function

*P*(

*a*

_{proj}∣

*s, a*

_{obj},

*w*) and the priors on

*s*and

*a*

_{obj}:

*P*(

*a*

_{proj}∣

*s, a*

_{obj},

*w*), we assumed that the image measures of

*w*and

*a*

_{proj}were unbiased but corrupted by noise and then marginalized over the possible true values. Details of the noise model are given in 2.

*P*(

*a*

_{proj}∣

*s, a*

_{obj},

*w*) computed for an example image trapezoid (top left). There is a range of 3D interpretations with high likelihood lying along a curve. Two particular points along the curve are marked for illustration. One is the zero slant interpretation; in this case, the 3D object has the same trapezoid shape as the projected contour. The other special case marked in the figure is where the high likelihood curve intersects the axis

*a*

_{obj}= 0, which corresponds to the 3D interpretation assuming parallel sides. The other points with high likelihood are intermediate cases, where the 3D shape is a trapezoid with less steeped sides than the 2D projected contour and is less slanted than the parallel-sides interpretation.

*P*(

*a*

_{proj}∣

*s, a*

_{obj},

*w*) with the different priors for

*s*and

*a*

_{obj}. The top middle panel shows

*P*(

*s, a*

_{obj}) assuming a uniform prior for

*s*and a Gaussian prior for the shape parameter

*a*

_{obj}, centered around zero with standard deviation

*σ*

_{persp}= 6 deg. This prior assigns higher likelihood to interpretations for which the object's sides are near parallel. The bottom middle panel shows the result of multiplying these priors with

*P*(

*a*

_{proj}∣

*s, a*

_{obj},

*w*) to obtain the posterior

*P*(

*s, a*

_{obj}∣

*a*

_{proj},

*w*). As might be expected, the maximum of this function is very close to the parallel-sides interpretation. The top right panel shows a different set of priors. The prior on

*a*

_{obj}is the same, but the prior on

*s*is weighted toward zero,

*P*(

*s*) = cos(

*s*). This particular prior has been suggested by Hillis et al. (2004), who point out that it describes the distribution of viewer-relative slants in an environment where all 3D surface orientations are equally likely. When this biased slant prior is integrated with perspective information, the peak of

*P*(

*s, a*

_{obj}∣

*a*

_{proj},

*w*) is shifted away from the correct parallel interpretation to a point with lower slant (bottom right).

*P*(

*a*

_{obj}), which in our model is specified by the single parameter

*σ*

_{persp}. Setting this parameter much higher than in our simulations can lead to qualitative discrepancies. In addition, there is one discrepancy between model performance and human data that cannot be explained within our formulation, regardless of parameters. In the 21 deg, low-slant condition, subjects' judgments on average were close to veridical, and for some subjects, they were biased slightly in the opposite direction as in the other conditions. In our model, the prior toward low slants is the only factor that produces deviations from veridical performance; hence, one would never expect biases in the direction opposite to perceptual compression of depth.

*f*(

*a*

_{obj},

*s, w*′) being inaccurate. Perceptual distortions have been observed even for a very simple shape matching task (Henriques et al., 2005); hence, the possibility of systematic distortions in 3D interpretations cannot be discounted.

*with identical perspective information*were judged to have equivalent depth structure. This is essentially a cue conflict paradigm. In the case of the experiment by Smith and Gruber (1958), which compared judgments for photos and actual scenes, the conflict would consist of any cues that differ between views of an actual scene and a photograph. In other studies, scaled and unscaled photos with matching perspective convergence were compared (Bengston et al., 1980; Lumsden, 1983; Smith, 1958a, 1958b). In this case, conflicts would arise from size-invariant monocular depth cues, such as texture compression or familiar size. One can imagine an analogous variant of our experiment, in which judgments were based on either rendered stimuli or monocular views of actual checkerboard surfaces, constructed to have matching projected images. If results were similar, it would not imply that perspective convergence was interpreted in an accurate, scale-dependent way. Rather, it would imply that perspective convergence dominated other 3D cues. Similarly, to the extent that results differed, it would imply that other cues influenced judgments. Thus, this type of design addresses the question of how much perspective contributes relative to other depth information. This is very different from asking, as in our experiment, whether perceived depth from convergence changes in a geometrically accurate way when projected size is varied.

*minimized expected entropy*staircase method.

*x*

_{ k},

*r*

_{ k}}, were used to estimate a posterior probability distribution

*P*(

*μ,σ*∣

*x*

_{1},

*r*

_{1},

*x*

_{2},

*r*

_{2},…

*x*

_{ n},

*r*

_{ n}), where

*μ*is the PSE and

*σ*is the difference between the PSE and the 75% point. The next probe

*x*

_{ n + 1}was chosen to minimize the expected entropy, −

*p*log(

*p*), of the posttrial posterior function,

*P*(

*μ,σ*∣

*x*

_{1},

*r*

_{1},

*x*

_{2},

*r*

_{2},…

*x*

_{ n + 1},

*r*

_{ n + 1}). The entropy cost function rewards probes that would be expected to result in a more peaked and concentrated posterior distribution over the space of possible combinations of

*μ*and

*σ,*consistent with the goal of estimating

*μ*and

*σ*with minimal bounds of uncertainty.

*P*(

*r*

_{ n + 1}= 0∣

*x*

_{ n + 1}) and

*P*(

*r*

_{ n + 1}= 1∣

*x*

_{ n + 1}). If

*μ*and

*s*were known, these probabilities would be directly determined by the model psychometric function. Thus, to estimate

*P*(

*r*

_{ n + 1}∣

*x*

_{ n + 1}), we marginalized over

*μ*and

*σ,*using the posterior distribution computed from previous response history as an estimate of

*P*(

*μ,σ*):

*P*(

*r*

_{ n + 1}∣

*x*

_{ n + 1},

*μ, σ*), rather than a more standard cumulative Gaussian, to simplify computation during probe selection. Also, the function was scaled to range from 0.025 to 0.975 rather than from 0 to 1, to reduce the effect of lapses of attention and guessing on the probe selection. The space of possible bias and threshold values was discretely sampled to carry out marginalization, with

*σ*sampled linearly from the set {0.05, 0.1, …, 0.8} and

*μ*sampled exponentially from the set {0.26, 0.274, …, 3.87} for low-slant conditions and from the set {0.094, 0.100, …, 1.42} for high-slant conditions.

*μ*and

*σ*have converged, the probe choices tend to oscillate between two values, symmetric around the PSE, and be anticorrelated with the previous response. This occurs because the expected entropy function at this point has two local minima that are very similar, such that a single response switches their relative depths. Consequently, probe values would tend to alternate, which could influence a subject's behavior. In the experiment reported here, there were many interleaved conditions in each block and there were a modest number of trials per staircase; hence, temporal correlations were not a concern. However, in a design with few conditions and many trials per staircase, this would be a more serious problem. A simple solution is to use a random subset of the response history to estimate the posterior function, rather than the whole history, once a sufficient number of trials are recorded. Because the method converges to a rough estimate quickly (within 15–20 trials), excluding a subset of trials has little effect on the final distribution of probe samples. Note that, with this modification, there is no need to run multiple interleaved staircases using our method, as is commonly done when using standard staircases.

*a*

_{proj}as being Gaussian, with a width parameter

*σ*

_{a}that was set based on previous psychophysical measures of 2D orientation discrimination. Discrimination of 2D orientations exhibits an oblique effect: Thresholds are higher away from the horizontal and vertical axes (Heeley et al., 1997; Regan & Price, 1986; Snippe & Koenderink, 1994). In the case of our stimuli, this would imply that orientations are encoded less reliably for our high-slant conditions than for our low-slant conditions. Uncertainty in shape measurement could alternatively be modeled as a function of corner angles of the projected figure, as opposed to the orientations of its side edges. Thresholds for 2D angle discrimination follow an m-shaped function of base angle, with a local minimum at 90 deg (Chen and Levi, 1996; Heeley and Buchanan-Smith, 1996; Regan, Gray, et al., 1996); thus, one would similarly expect greater noise for the high-slant conditions. On the basis of these various results, we estimated that the effective orientation/angle noise for the high-slant condition would be about twice as high as for the low-slant condition, with all other factors equal. Orientation discrimination for 2D lines has also been found to strongly depend on line length. For extended lines, thresholds decrease roughly with the square root of length (Heeley & Buchanan-Smith, 1998; Orban, Vandenbussche, and Vogels, 1984). Incorporating this length dependence, our model for noise in projected shape was 2.5 deg/sqrt(

*L*) for the low-slant conditions and 5 deg/sqrt(

*L*) for the high-slant conditions, where

*L*is the length of the side edges (in degrees of visual angle).

*w*) is a Gaussian with deviation

*σ*

_{ w}= 0.05 log units. This would be consistent with results from studies of interval length discrimination, which have found that thresholds increase proportionally with length, with Weber fraction of approximately 0.05, across a range of conditions (Burbeck, 1987; Toet, van Eekhout, Simons, & Koenderink, 1987; Whitaker & Latham, 1997). We found that this noise parameter could be varied somewhat without affecting the qualitative performance of the model, provided that it remains small compared with the uncertainty introduced by orientation noise.

*a*′

_{proj}is determined by perspective geometry; hence, only the projected width parameter

*w*′ needs to be explicitly marginalized. Using the noise models, the desired likelihood function becomes:

*Z*is a standard Gaussian distribution and

*f*is the projection function mapping

*a*

_{obj}to

*a*

_{proj}for a given slant and width.