**Estimating three-dimensional (3D) surface orientation (slant and tilt) is an important first step toward estimating 3D shape. Here, we examine how three local image cues from the same location (disparity gradient, luminance gradient, and dominant texture orientation) should be combined to estimate 3D tilt in natural scenes. We collected a database of natural stereoscopic images with precisely co-registered range images that provide the ground-truth distance at each pixel location. We then analyzed the relationship between ground-truth tilt and image cue values. Our analysis is free of assumptions about the joint probability distributions and yields the Bayes optimal estimates of tilt, given the cue values. Rich results emerge: (a) typical tilt estimates are only moderately accurate and strongly influenced by the cardinal bias in the prior probability distribution; (b) when cue values are similar, or when slant is greater than 40°, estimates are substantially more accurate; (c) when luminance and texture cues agree, they often veto the disparity cue, and when they disagree, they have little effect; and (d) simplifying assumptions common in the cue combination literature is often justified for estimating tilt in natural scenes. The fact that tilt estimates are typically not very accurate is consistent with subjective impressions from viewing small patches of natural scene. The fact that estimates are substantially more accurate for a subset of image locations is also consistent with subjective impressions and with the hypothesis that perceived surface orientation, at more global scales, is achieved by interpolation or extrapolation from estimates at key locations.**

*r*(

*x*,

*y*) is the average range in the neighborhood of (

*x*,

*y*), with

*x*and

*y*in degrees of visual angle. The average range is given by the convolution of the range image with the Gaussian kernel,

*r*(

*x*,

*y*) =

*rng*(

*x*,

*y*) *

*g*(

*x*,

*y*;

*σ*

_{blur}) , where

*g*(

*x*,

*y*;

*σ*

_{blur}) is an isotropic two-dimensional Gaussian with mean zero and standard deviation

*σ*

_{blur}of 0.1°. For notational convenience, we leave implicit the (

*x*,

*y*) coordinates in the right side of Equation 1. Note that blurring the range image with a Gaussian and then taking derivatives in x and y is equivalent to convolving the range image with Gaussian derivative kernels in x and y (see Figure 5). Also note that normalizing by the range in Equation 1 is necessary so that a planar surface will be assigned the same slant independent of range; however, this normalization has no effect on the definition of tilt because the normalization term cancels out. Finally, note that this definition of ground truth tilt means that ground truth tilt depends in part on the size of the analysis neighborhood (see Discussion).

*δ*(

*x*,

*y*) =

*dsp*(

*x*,

*y*) *

*g*(

*x*,

*y*;

*σ*

_{blur}). Again, normalizing by the (local average) disparity is necessary so that a planar surface will be assigned the same slant independent of viewing distance (but has no effect on the tilt estimate).

*l*(

*x*,

*y*) =

*lum*(

*x*,

*y*) *

*g*(

*x*,

*y*;

*σ*

_{blur}). Here we divide by the (average local) luminance so that the luminance gradient vector corresponds to the signed Weber contrasts in the horizontal and vertical directions. The luminance tilt cue is defined as the orientation of the luminance gradient:

*x*,

*y*). We then take the Fourier transform of the windowed image and compute the amplitude spectrum. Finally, we use singular value decomposition to find the major (principle) axis of the amplitude spectrum (the orientation along which there is the greatest variance around the origin). We define the tilt cue as the orientation of the major axis in the Fourier domain: where (

*u*,

_{x}*u*) is the unit vector defining the principle axis.

_{y}*ϕ*is the ground-truth tilt (the latent variable) and

_{r}*ϕ*is the observed vector of cue values [e.g., {

*ϕ*,

_{d}*ϕ*,

_{l}*ϕ*}].

_{t}*p*(

*ϕ*

_{r}_{,}

*ϕ*,

_{d}*ϕ*,

_{l}*ϕ*) of image cue values and ground-truth 3D tilt is four-dimensional (range, disparity, luminance, texture), and estimating it accurately would require far more data than our already quite large data set contains. However, measuring conditional means requires much less data and is practical for our size data set. The direct way to determine the conditional means is to (a) compute a running count of the number of occurrences of each unique vector of image cue values, (b) compute a running sum of the variable of interest (the unit vector in the ground truth tilt direction) for each unique vector of image cue values, and (c) compute the argument (arg) of the vector average: where Ω(

_{t}*ϕ*) indicates the set of ground-truth values that co-occur with a particular vector of cue values and

*N*(

*ϕ*) is the count of the number of occurrences of each unique vector. The circular variance of the optimal estimate (i.e., the inverse of reliability) is one minus the complex absolute value of the vector average:

*ϕ*= {

*ϕ*,

_{d}*ϕ*,

_{l}*ϕ*}. For continuous variables such as gradients, the number of possible combinations is infinite. Therefore, it is necessary to quantize the cue values. Here, we quantize each of the cue values into 64 bins, each of which is 2.8° wide. (This bin width appears to be sufficiently narrow given the smoothness of the space.) With a triplet of cue values, this quantization results in 64

_{t}^{3}total bins, which means that ∼260,000 total conditional means must be computed. Estimating 260,000 conditional means requires a substantial amount of data. Our data set contains approximately 1 billion pixels. If the image cue triplets were uniformly distributed, each bin would have approximately 4,000 samples. In practice, we find that the minimum number of samples is 618 and the maximum 86,838. This number of samples is sufficient to reliably estimate the mean and variance for each bin.

*Φ̂*

_{r}_{|}

*,*

_{δ}*Φ̂*

_{r}_{|}

*,*

_{l}*Φ̂*

_{r}_{|}

*) are the estimates from the individual cues, (*

_{t}*ρ*

_{r}_{|}

*,*

_{δ}*ρ*

_{r}_{|}

*,*

_{l}*ρ*

_{r}_{|}

*) are their relative reliabilities, and*

_{t}*w*

_{0}and

*ϕ*

_{0}are constants to correct for any overall bias because the prior contributes to each of the individual estimates. In the present case, the constants are equal to zero because there is no overall bias. There are several things to note about Equation 12. First, this vector summation rule is appropriate for circular variables. Mittelstaedt (1983, 1986) was the first to show that vector summation weighted by reliability can account for human performance in cue combination experiments (Mittelstaedt, 1983, 1986). Murray and Morgenstern (2010) showed that it is near optimal under some circumstances. Second, for circular variables, using the reliabilities instead of the relative reliabilities yields the same result. Third, the linear estimate combination is different from the linear cue combination. However, for standard (not circular) variables, and the usual conditional independence and Gaussian assumptions, the linear cue combination and linear estimate combination give the same estimates (see Supplement). The advantage here of linear estimate combination is that the individual estimates (from the conditional means) are guaranteed to be optimal, independent of the shape of the individual cue posterior probability distributions and the prior probability distribution (see above). Fourth, even so, the statistical properties that are known to guarantee optimality of linear estimate combination do not hold here. Nonetheless, one can still ask how well the linear estimate combination approximates the optimal estimates.

*E*(

*ϕ*|

_{r}*ϕ*),

_{l}*E*(

*ϕ*|

_{r}*ϕ*), and

_{t}*E*(

*ϕ*|

_{r}*ϕ*). For image cue tilt measurements near 0

_{d}^{0}and 90°, the estimates are equal to the measured cue value. However, for image cue measurements near 45° and 135°, estimates are shifted toward 0° and 90°. This shift is largely due to the effect of the prior. The prior distribution exhibits a strong cardinal bias. Surfaces slanted around horizontal axes (tilt = 90°) or vertical axes (tilt = 0°) are much more likely than other tilts. When only one image cue is measured, the information it provides is not highly reliable (Figure 7D). However, if all three cues are measured and agree, the influence of the prior on the conditional means

*E*(

*ϕ*|

_{r}*ϕ*=

_{l}*ϕ*=

_{t}*ϕ*) is nearly eliminated (Figure 7C, black curve), and the estimate reliability increases substantially (Figure 7D, black curve).

_{d}*E*(

*ϕ*|

_{r}*ϕ*,

_{l}*ϕ*), are shown in Figure 8A. The pattern of results is intuitive but complex. Depending on the particular values of the disparity and luminance cues, we see several different types of behavior: disparity dominance, cue averaging, and cue switching. For example, when disparity equals 90°,

_{d}*E*(

*ϕ*|

_{r}*ϕ*,

_{l}*ϕ*= 90), we observe disparity dominance; that is, the luminance cue exerts almost zero influence on the estimate (vertical midline of Figure 8A; see Figure 8B inset). On the other hand, when luminance equals 90°,

_{d}*E*(

*ϕ*|

_{r}*ϕ*= 90,

_{l}*ϕ*), the disparity cue exerts a strong influence on the estimate (horizontal midline of Figure 8A). When luminance and disparity agree,

_{d}*E*(

*ϕ*|

_{r}*ϕ*=

_{l}*ϕ*), the single-cue estimates are approximately averaged (positive oblique of Figure 8A). When luminance and disparity disagree by 90°,

_{d}*E*(

*ϕ*||

_{r}*ϕ*−

_{l}*ϕ*| = 90), the best estimates switch from 0° to 90° abruptly when the disparity cue approaches ∼65° and then switches abruptly back from 90° to 0° when the disparity cue approaches ∼115°. All of these effects can be seen more readily by examining the value of the estimate as a function of the disparity cue for different luminance cue values (Figure 8B) and as a function of the luminance cue for different disparity cue values (Figure 8C).

_{d}*E*(

*ϕ*|

_{r}*ϕ*,

_{d}*ϕ*=

_{l}*ϕ*) −

_{t}*E*(

*ϕ*|

_{r}*ϕ*). When luminance and texture agree, they override disparity unless disparity equals 90° (or unless luminance and texture approximately agree with disparity). Figure 11B, D shows that luminance and texture have progressively less effect as the difference between them increases. Figure 11E shows that when luminance and texture differ by 90°, they have virtually no effect on the optimal estimate (disparity dominance). That is, the difference

_{d}*E*(

*ϕ*|

_{r}*ϕ*,|

_{d}*ϕ*−

_{l}*ϕ*| = 90°) −

_{t}*E*(

*ϕ*|

_{r}*ϕ*) is near zero for virtually all values of disparity, luminance, and texture. Thus, in general, luminance and texture override disparity when they agree and have little effect when they disagree.

_{d}*δ̄*, the mean luminance

*l̄*, and the RMS contrast

*c̄*. Each of these is a weighted average computed over the same analysis area as the three main cues and can take on an arbitrary value

*v*. To evaluate the information provided by each of these auxiliary cues, we computed the single-cue estimates and their variances (and hence reliabilities), conditional on the value of the tilt cue and the value

*v*of the auxiliary cue:

*Φ̂*

_{r}_{|}

_{δ}_{,}

*=*

_{v}*E*(

*ϕ*|

_{r}*ϕ*,

_{δ}*v*),

*ϕ*|

_{r}*ϕ*,

_{δ}*v*),

*Φ̂*

_{r}_{|}

_{l}_{,}

*=*

_{v}*E*(

*ϕ*|

_{r}*ϕ*,

_{l}*v*),

*ϕ*|

_{r}*ϕ*,

_{l}*v*),

*Φ̂*

_{r}_{|}

_{t}_{,}

*=*

_{v}*E*(

*ϕ*|

_{r}*ϕ*,

_{t}*v*),

*ϕ*|

_{r}*ϕ*,

_{t}*v*). To illustrate the broad effects of these cues, Figures 13A-C plot the average relative reliability for the three main cues as a function of each of the auxiliary variables. As can be seen, the average relative reliability of tilt estimates from disparity decreases with absolute disparity and rms contrast, but the average relative reliability of the other estimators is largely unaffected by these auxiliary cues. Note that for very large distances, disparity can play no role because then changes in depth will create changes in disparity below the disparity detection threshold. Luminance has very little effect on any of the estimators. This result is intuitive. In general, the disparity gradient information should decrease with distance because of the inverse square relationship between disparity and distance. We suspect that the disparity reliability decreases with RMS contrast in natural scenes because high-contrast regions are correlated with depth discontinuities and because disparity information is generally poor at depth discontinuities (i.e., high-disparity gradients). Figure 13D shows how the variability of the individual cue values (across tilt) changes with the disparity-defined distance in meters.

*ϕ*→

_{r}*θ*→

_{r}*θ*rather than on cos (

_{r}*θ*)), the estimated marginal slant distribution is dramatically different. The slant prior distribution computed without an area-preserving projection has effectively zero probability mass near zero. Taken at face value, such a result would lead to the erroneous conclusion that surfaces with slants near zero almost never occur in natural scenes (Figure 15C, gray curves).

_{r}*maximum a posteriori*(MAP) estimates, is less appropriate for many estimation tasks because it does not give credit for being close to the correct estimate. Another limitation of this cost function is that it requires characterizing the posterior distributions sufficiently to determine the mode, which, because of data limitations, would be impossible without strong assumptions about the form of the joint distribution. However, MAP estimates are appropriate for other tasks such as recognition of specific objects or faces, in which close does not count. Also, if the likelihood distributions are symmetrical about the peak (e.g., Gaussian), the MAP and MMSE estimates are the same. Finally, for certain strong assumptions (e.g., statistical independence of the cue distributions), it is widely believed that MAP estimators are more biologically plausible.

*f̄*is computed. The result of this computation at each pixel location is a centroid-frequency image. Finally, we compute the gradient of the centroid-frequency image and define the tilt cue as the orientation of the centroid gradient:

*, 14, 257–262, http://doi.org/10.1016/j.cub.2004.01.029.*

*Current Biology**, 28, 217–242.*

*Perception**, 24, 2077–2089. Retrieved from http://doi.org/10.1523/JNEUROSCI.3852-02.2004*

*Journal of Neuroscience**, 33, 1723–1737.*

*Vision Research**, 30, 7269–7280. Retrieved from http://doi.org/10.1523/JNEUROSCI.5551-09.2010*

*Journal of Neuroscience**, 108, 16849–16854. Retrieved from http://doi.org/10.1073/pnas.1108491108*

*Proceedings of the National Academy of Sciences, USA**Optimal defocus estimates from individual images for autofocusing a digital camera*. Presented at the Proceedings of the IS&T/SPIE 47th Annual Meeting, Proceedings of SPIE. Retrieved from http://doi.org/10.1117/12.912066

*, 14 (2): 1, 1–18, doi:10.1167/14.2.1. [PubMed] [Article]*

*Journal of Vision**, 6, 7900. Retrieved from http://doi.org/10.1038/ncomms8900*

*Nature Communications**, 30, 7714–7721. Retrieved from http://doi.org/10.1523/JNEUROSCI.6427-09.2010*

*Journal of Neuroscience**. Boston: Kluwer Academic Publishers.*

*Data fusion for sensory information processing systems**, 24, 536–549.*

*IEEE Transactions on Pattern Analysis and Machine Intelligence**, 415, 429–433. Retrieved from http://doi.org/10.1038/415429a*

*Nature**, 108, 20438–20443. Retrieved from http://doi.org/10.1073/pnas.1114619109/-/DCSupplemental*

*Proceedings of the National Academy of Sciences, USA**, 4 (9): 10, 798–820, doi:10.1167/4.9.10. [PubMed] [Article]*

*Journal of Vision**, 71.1–71.10, doi:10.5244/C.21.71.*

*Proceedings of the British Machine Vision Conference**, 9 (13): 17, 1–16, doi:10.1167/9.13.17. [PubMed] [Article]*

*Journal of Vision**, 5 (11): 7, 1013–1023, doi:10.1167/5.11.7. [PubMed] [Article]*

*Journal of Vision**, 298, 1627–1630.*

*Science**, 4 (12): 1, 967–992, doi:10.1167/4.12.1. [PubMed] [Article]*

*Journal of Vision**, 16 (12): 1413, doi:10.1167/16.12.1413 [Abstract].*

*Journal of Vision**, 38, 2635–2656.*

*Vision Research**, 38, 1655–1682.*

*Vision Research**, 7 (7): 5, 1–24, doi:10.1167/7.7.5. [PubMed] [Article]*

*Journal of Vision**, 43, 2539–2558. Retrieved from http://doi.org/10.1016/S0042-6989(03)00458-9*

*Vision Research**, 35, 389–412.*

*Vision Research**94, 351–364.*

*Computer Vision—ECCV'**,*

*International Journal of Computer Vision**(2), 149–168.*

*23**, 2, 288–295.*

*Trends in Cognitive Sciences**, 76, 165. Retrieved from http://doi.org/10.1007/s11263-007-0048-x*

*International Journal of Computer Vision**, 70, 272–281.*

*Die Naturwissenschaften**, 63, 63–85.*

*Acta Psychologica**, 110, 190–203. Retrieved from http://doi.org/10.1152/jn.01055.2012*

*Journal of Neurophysiology**, 10 (11): 15, 1–11, doi:10.1167/10.11.15. [PubMed] [Article]*

*Journal of Vision**, 24, 2065–2076. Retrieved from http://doi.org/10.1523/JNEUROSCI.3887-03.2004*

*Journal of Neuroscience**, 44, 253–259. Retrieved from http://doi.org/10.1037/h0057643*

*Journal of Experimental Psychology**, 19, 77–84. Retrieved from http://doi.org/10.1111/j.1467-9280.2008.02049.x*

*Psychological Science**,*

*Cognitive Psychology**25*), 383–429. Retrieved from http://www.sciencedirect.com/science/article/pii/S0010028583710108

*,*

*Journal of the Optical Society of America. A, Optics, Image Science, and Vision**20*), 1292–1303.

*, 111, 18043–18048. Retrieved from http://doi.org/10.1073/pnas.1421131111*

*Proceedings of the National Academy of Sciences*, USA*, 33, 19352–19361. Retrieved from http://doi.org/10.1523/JNEUROSCI.3174-13.2013*

*Journal of Neuroscience**,*

*Journal of Neurophysiology**107*), 2109–2122. Retrieved from http://doi.org/10.1152/jn.00578.2011

*, 76, 53–69. Retrieved from http://doi.org/10.1007/s11263-007-0071-y*

*International Journal of Computer Vision**, 20, 2197–2203.*

*Proceedings of the International Joint Conference on Artificial Intelligence**, 33, 241–250. Retrieved from http://doi.org/10.3758/BF03202860*

*Perception & Psychophysics**, 86, 2856–2867.*

*Journal of Neurophysiology**, 18, 101–105.*

*Vision Research**, 27, 203–225.*

*International Journal of Computer Vision**, 8, 820–827. Retrieved from http://doi.org/10.1038/nn1461*

*Nature Neuroscience**, 14, 371–390.*

*Network (Bristol, England)**ϕ*is a vector of cue values,

*Φ̄*is the mean angle, and

_{r}*Ā*is the length of the mean vector. The mean angle is given by the argument of the vector sum and the length of the mean vector is given by the complex absolute value of the mean vector

_{r}*Φ̄*is the mean,

*κ*determines the circular variance,

*σ*

^{2}= 1 −

*I*

_{1}(

*κ*)/

*I*

_{0}(

*κ*), and

*I*

_{0}(

*κ*) and

*I*

_{1}(

*κ*) are modified Bessel functions of orders zero and one. Note that given an estimate of

*σ*

^{2}, we can obtain an estimate of

*κ*by solving the equation

*I*

_{1}(

*κ̂*)/I

_{0}(

*κ̂*) = 1 −

*σ̂*

^{2}.

*σ*

_{1}is the standard deviation of the first Gaussian,

*σ*

_{2}is the standard deviation of the second Gaussian, and

*α*is a mixing parameter, which is constrained to lie on [0 1]. The best-fit values are

*σ*

_{1}= 10°,

*σ*

_{2}= 42°, and

*α*= 0.5. (Note that a mixture of Von Mises distributions also provides a good approximation to the slant prior,

*κ*

_{1}and

*κ*

_{2}are the concentration parameters of the two distributions. The best-fit values are

*κ*

_{1}= 8,

*κ*

_{2}= 0.8, and

*α*= 0.5.)

*L̄*is the local mean of the left image,

*R̄*was the local mean of the right image, and

*g*(

*x*,

*y*) is a Gaussian window. Negative disparities indicate uncrossed disparities; positive disparities crossed disparities. The estimated disparity was the offset that maximized the correlation between the left and right eye patches.