**Estimating an accurate and naturalistic dense depth map from a single monocular photographic image is a difficult problem. Nevertheless, human observers have little difficulty understanding the depth structure implied by photographs. Two-dimensional (2D) images of the real-world environment contain significant statistical information regarding the three-dimensional (3D) structure of the world that the vision system likely exploits to compute perceived depth, monocularly as well as binocularly. Toward understanding how this might be accomplished, we propose a Bayesian model of monocular depth computation that recovers detailed 3D scene structures by extracting reliable, robust, depth-sensitive statistical features from single natural images. These features are derived using well-accepted univariate natural scene statistics (NSS) models and recent bivariate/correlation NSS models that describe the relationships between 2D photographic images and their associated depth maps. This is accomplished by building a dictionary of canonical local depth patterns from which NSS features are extracted as prior information. The dictionary is used to create a multivariate Gaussian mixture (MGM) likelihood model that associates local image features with depth patterns. A simple Bayesian predictor is then used to form spatial depth estimates. The depth results produced by the model, despite its simplicity, correlate well with ground-truth depths measured by a current-generation terrestrial light detection and ranging (LIDAR) scanner. Such a strong form of statistical depth information could be used by the visual system when creating overall estimated depth maps incorporating stereopsis, accommodation, and other conditions. Indeed, even in isolation, the Bayesian predictor delivers depth estimates that are competitive with state-of-the-art “computer vision” methods that utilize highly engineered image features and sophisticated machine learning algorithms.**

^{1}statistics (NSS) models have been shown to provide good descriptions of the statistical laws that govern the behavior of images of the 3D world and 2D images of it. NSS models have proven to be deeply useful tools for both understanding the evolution of human vision systems (HVS; Olshausen & Field, 1996; Simoncelli & Olshausen, 2001) and for modeling diverse visual problems (Portilla, Strela, Wainwright, & Simoncelli, 2003; Tang, Joshi, & Kapoor, 2011; Wang & Bovik, 2011; Bovik, 2013). In particular, there has been work conducted on exploring the 3D NSS of depth/disparity maps of the world, how they correlate with 2D luminance/color NSS, and how such models can be applied. For example, Potetz and Lee (2006) examined the relationships between luminance and range over multiple scales and applied their results to a shape-from-shading problem. Y. Liu, Cormack, and Bovik (2011) explored the statistical relationships between luminance and disparity in the wavelet domain, and applied the derived models to improve a canonical Bayesian stereo algorithm. Su, Cormack, and Bovik (2013) proposed new models of the marginal and conditional statistical distributions of the luminances/chrominances and the disparities/depths associated with natural images, and used these models to significantly improve a chromatic Bayesian stereo algorithm. Recently, Su, Cormack, and Bovik (2014b, 2015a) developed new bivariate and correlation NSS models that capture the dependencies between spatially adjacent bandpass responses like those of area V1 neurons, and applied them to model both natural images and depth maps. The authors further utilized these models to create a blind 3D perceptual image quality model (Su, Cormack, & Bovik, 2015b) that operates on distorted stereoscopic image pairs. An algorithm derived from this model was shown to deliver quality predictions that correlate very highly with recorded human subjective judgments of 3D picture quality.

^{2}

**x**∈ ℝ

*,*

^{N}*y*∈ ℝ

^{+}. Note that when

**x**∈ ℝ

*, which may embed dependencies in*

^{N}**x**∈ ℝ

*(i.e., the spatially neighboring bandpass image responses). In order to capture these second-order statistics, we adopt a closed-form correlation model, which is described in detail in the next subsection, to extract the corresponding NSS features. In our implementation, we model the bivariate empirical histograms of horizontally adjacent subband responses of each image patch using a bivariate generalized Gaussian distribution (BGGD) with*

^{N}**x**∈ ℝ

^{2}, and estimate the BGGD model parameters using the maximum likelihood estimator (MLE) algorithm described in (Su, Cormack, & Bovik, 2014a). In our case, the scatter matrix

*θ*

_{2}−

*θ*

_{1}=

*kπ*,

*k*∈ ℤ, yielding a three-parameter exponentiated cosine model:

*D*(

*x*+ 1,

*y*) −

*D*(

*x*− 1,

*y*)) (

*D*(

*x*,

*y*+ 1) −

*D*(

*x*,

*y*− 1))]

^{⊤}

**x**=

**f**

*∈ ℝ*

_{I}*. For each canonical depth pattern, an MGM model is created using the feature vectors extracted from all of the image patches within the pattern. Therefore, the likelihood of encountering an image patch with a specific extracted feature*

^{K}*y*-coordinate,

^{3}which reflects the performance consistencies of the examined depth estimation algorithms. Natural3D delivers more consistent performance in terms of Log10, while providing similar or better Rel. and RMS performances than Depth Transfer.

*k*-means algorithm for learning the depth prior. To demonstrate the influence of the number of canonical depth patterns on the performance of Natural3D, we trained and tested the algorithm using different numbers of clusters in the

*k*-means algorithm, and plotted in Figure 23 the three error metrics as a function of the number of canonical depth patterns. It can be seen that, while the relative error slightly drops as the number of canonical depth patterns increases, the RMS value increases adversely. This result suggests that while it may be helpful to estimate relative distances between objects using more canonical depth patterns, the increased number of depth priors may result in inferior regression performance when estimating absolute distances. This result also agrees with our observation during the prior model development that five most common canonical depth patterns exist in natural environments. Therefore, using more than five clusters in the

*k*-means algorithm may result in some redundant depth patterns, so the regression model of those redundant depth patterns will be trained with incomplete image data when estimating absolute distances, because the extracted image features belonging to similar depth patterns may be inaccurately classified into different clusters to train different regression models. As a result, to achieve the best depth estimation performance, we chose to use five canonical depth patterns: five clusters in the

*Proceedings of the IEEE Winter Conference on Applications of Computer Vision*, 145– 152.

*Proceedings of the IEEE*, 101 (9), 2008– 2024.

*Data Mining and Knowledge Discovery*, 2, 121– 167.

*ACM Transactions on Intelligent Systems and Technology (TIST)*, 2 (3): 27, 1– 27. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm/.

*Pattern Recognition*, 22 (6), 707– 717.

*Vision Research*, 34 (5), 607– 620.

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 1, 886– 893.

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 2, 2418– 2428.

*Journal of the Royal Statistical Society, Series B*, 39 (1), 1– 38.

*Vision Research*, 26 (6), 973– 990.

*Advances in Neural Information Processing Systems*, 26, 2366– 2374.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 32 (9), 1627– 1645.

*Journal of the Optical Society of America A*, 4 (12), 2379– 2394.

*Philosophical Transactions of the Royal Society of London*.

*Series A: Mathematical, Physical and Engineering Sciences*, 357 (1760), 2527– 2542.

*Proceedings of the IEEE International Conference on Computer Vision*, 3392– 3399.

*Proceedings of the Conference on Computer Vision and Pattern Recognition Workshop*, 15– 22.

*Visual Neuroscience*, 9 (2), 181– 197.

*ACM Transactions on Graphics*, 24 (3), 577– 584.

*Vision Research*, 30 (12), 1955– 1970.

*Computer Vision, Graphics, and Image Processing*, 47 (3), 292– 326.

*Proceedings of the European Conference on Computer Vision*, 7576, 775– 788.

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 89– 96.

*IEEE Journal of Selected Topics in Signal Processing*, 3 (2), 202– 211.

*Proceedings of the IEEE International Conference on Computer Vision*, 683– 691.

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 1253– 1260.

*IEEE Transactions on Image Processing*, 20 (9), 2515– 2530.

*IEEE Transactions on Information Theory*, 28 (2), 129– 137.

*Proceedings of the IEEE International Conference on Computer Vision*, 2, 1150– 1157.

*Neural Computation*, 23, 2942– 2973.

*International Journal of Computer Vision*, 48 (2), 75– 90.

*International Journal of Computer Vision*, 23 (2), 149– 168.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 11 (7), 674– 693.

*IEEE Transactions on Acoustics, Speech, and Signal Processing*, 37 (12), 2091– 2110.

*Journal of the Society for Industrial & Applied Mathematics*, 11 (2), 431– 441.

*IEEE Transactions on Image Processing*, 20 (12), 3350– 3364.

*Proceedings of the IEEE International Conference on Image Processing*, 2, 561– 564.

*International Journal of Computer Vision*, 42 (3), 145– 175.

*Network: Computation in Nerual Systems*, 7 (2), 333– 339.

*Neural Computation*, 17 (8), 1665– 1699.

*Proceedings of the IEEE International Conference on Computer Vision*(pp. 33– 40). Piscataway, NJ: IEEE Publishing.

*International Journal of Computer Vision*, 40 (1), 49– 70.

*IEEE Transactions on Image Processing*, 12 (11), 1338– 1351.

*Journal of the Optical Society of America A*, 20 (7), 1292– 1303.

*Advances in Neural Information Processing Systems*, 18, 1089– 1096.

*SPIE International Conference on Human Vision and Electronic Imaging XV*, Vol. 7527.

*Network: Computation in Neural Systems*, 5 (4), 517– 548.

*Advances in Neural Information Processing Systems*, 17, 1161– 1168.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 31 (5), 824– 840.

*Neural Computation*, 12 (5), 1207– 1245.

*Vision Research*, 19, 1303– 1314.

*Bulletin of the Psychonomic Society*, 21 (6), 456– 458.

*Nature Neuroscience*, 4, 819– 825.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 29 (3), 411– 426.

*IEEE Transactions on Circuits and Systems for Video Technology*, 5 (1), 52– 56.

*IEEE Transactions on Image Processing*, 15 (2), 430– 444.

*Proceedings of the European conference on computer vision*(Vol. 5, pp. 746–760). Berlin, Heidelberg: Springer-Verlag.

*Proceedings of SPIE, Wavelet Applications in Signal and Image Processing VII*, 3813, 188– 195.

*IEEE International Conference on Image Processing*, 3, 444– 447.

*Annual Review of Neuroscience*, 24 (1), 1193– 1216.

*IEEE Global Conference on Signal and Information Processing*, 373– 377.

*IEEE Transactions on Image Processing*, 22 (6), 2259– 2274.

*Proceedings of SPIE, Human Vision and Electronic Imaging XIX*, 9014.

*Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing*, 5362– 5366.

*IEEE Signal Processing Letters*, 22 (1), 21– 25.

*IEEE Transactions on Image Processing*, 24 (5), 1685– 1699.

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 305– 312.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 24 (24), 1226– 1238.

*Nature*, 251, 140– 142.

*Vision Research*, 15 (5), 583– 590.

*Vergence eye movements: Basic and clinical aspects*(pp. 199– 295). Oxford, UK: Butterworth-Heinemann.

*Probabilistic models of the brain: Perception and neural function*(p. 203– 222). Cambridge, MA: MIT Press.

*IEEE Signal Processing Magazine*, 28 (6), 29– 40.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 21 (8), 690– 706.

*Journal of the Society for Information Display*, 5 (1), 61– 63.