Open Access
Article  |   June 2018
Depth variation and stereo processing tasks in natural scenes
Author Affiliations
  • Arvind V. Iyer
    Department of Psychology, University of Pennsylvania, Philadelphia, PA, USA
    arvindiy@sas.upenn.edu
  • Johannes Burge
    Department of Psychology, University of Pennsylvania, Philadelphia, PA, USA
    Neuroscience Graduate Group, University of Pennsylvania, Philadelphia, PA, USA
    Bioengineering Graduate Group, University of Pennsylvania, Philadelphia, PA, USA
    jburge@sas.upenn.edu
    http://burgelab.psych.upenn.edu
Journal of Vision June 2018, Vol.18, 4. doi:https://doi.org/10.1167/18.6.4
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Arvind V. Iyer, Johannes Burge; Depth variation and stereo processing tasks in natural scenes. Journal of Vision 2018;18(6):4. https://doi.org/10.1167/18.6.4.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Local depth variation is a distinctive property of natural scenes, but its effects on perception have only recently begun to be investigated. Depth variation in natural scenes is due to depth edges between objects and surface nonuniformities within objects. Here, we demonstrate how natural depth variation impacts performance in two fundamental tasks related to stereopsis: half-occlusion detection and disparity detection. We report the results of a computational study that uses a large database of natural stereo-images and coregistered laser-based distance measurements. First, we develop a procedure for precisely sampling stereo-image patches from the stereo-images and then quantify the local depth variation in each patch by its disparity contrast. Next, we show that increased disparity contrast degrades half-occlusion detection and disparity detection performance and changes the size and shape of the spatial integration areas (“receptive fields”) that optimize performance. Then, we show that a simple image-computable binocular statistic predicts disparity contrast in natural scenes. Finally, we report the most likely spatial patterns of disparity variation and disparity discontinuities (half-occlusions) in natural scenes. Our findings motivate computational and psychophysical investigations of the mechanisms that underlie stereo processing tasks in local regions of natural scenes.

Introduction
An ultimate goal of perception science and systems neuroscience is to understand how sensory-perceptual processing works in natural conditions. In recent years, interest has increased in using natural stimuli for computational, psychophysical, and neurophysiological investigations (Adams et al., 2016; Burge, Fowlkes, & Banks, 2010; Burge & Geisler, 2011; Burge & Geisler, 2012; Burge & Geisler, 2014; Burge & Geisler, 2015; Burge & Jaini, 2017; Burge, McCann, & Geisler, 2016; Cooper & Norcia, 2015; Felsen & Dan, 2005; Field, 1987; Geisler & Perry, 2009; Geisler & Ringach, 2009; Geisler, Najemnik, & Ing, 2009; Hibbard, 2008; Hibbard & Bouzit, 2005; Jaini & Burge, 2017; Liu, Bovik, & Cormack, 2008; Maiello, Chessa, Solari, & Bex, 2014; Olshausen & Field, 1996; Potetz & Lee, 2003; Scharstein & Szeliski, 2003; Sebastian, Burge, & Geisler, 2015; Sprague, Cooper, Tosic, & Banks, 2015; van Hateren & van der Schaaf, 1998; Wilcox & Lakra, 2007; Yang & Purves, 2003). This burgeoning interest has been fueled by at least three factors. First, high-fidelity natural stimulus databases are becoming available for widespread scientific use. Second, powerful statistical, computational, and psychophysical methods are making natural stimuli increasingly tractable to work with. Third, and most importantly, the science requires it. Models of sensory and perceptual processing, from retina to behavior, that predict neurophysiological and behavioral performance with artificial stimuli often generalize poorly to natural stimuli (Felsen & Dan, 2005; Foster, 2011; Heitman et al., 2016; Kim & Burge, 2018; Talebi & Baker, 2012). High-quality measurements of natural scenes and images are needed to ground models in the data that visual systems evolved to process. 
The process by which the visual system estimates the three-dimensional structure of the environment is one of the most intensely studied questions in vision. The paradigmatic depth cue is binocular disparity. Stereopsis is the perception of depth based on binocular disparity (Cumming & DeAngelis, 2001; Gonzalez & Perez, 1998), our most precise depth cue. In the vision community, stereopsis and the estimation of binocular disparity (i.e., solving the correspondence problem) have been investigated primarily with artificial images (but see also Burge & Geisler, 2014; Hibbard, 2008). Researchers are developing psychophysical paradigms for using natural stimuli to investigate stereopsis, and computational analyses for uncovering the disparity processing mechanisms that optimize performance. Several natural stereo-image databases, some of which are accompanied by groundtruth distance measurements, have been released in recent years (Adams et al., 2016; Burge et al., 2016; Canessa et al., 2017; Scharstein & Szeliski, 2002). Research with natural stimuli is aided by methods for assigning accurate groundtruth labels to sampled stimuli. Sampling accuracy and precision must be at or above the precision of the human visual system. Otherwise, observed performance limits may be confounded with inaccuracies in the sampling procedure. 
The primary aim of this manuscript is to determine the impact of local depth variation on half-occlusion detection and disparity detection, two tasks fundamentally related to stereopsis. These tasks are equivalent to (a) determining whether a given point in one eye's image has or lacks a corresponding point in the other eye's image (i.e., half-occlusion detection) and (b) if it is binocularly visible, whether the second eye is foveating the same scene point as the first (i.e., disparity detection). Accurate performance in these tasks supports perception of depth order, da Vinci stereopsis, and fine stereo-depth discrimination (Blakemore, 1970; Cormack, Stevenson, & Schor, 1991; Kaye, 1978; Nakayama & Shimojo, 1990; Wilcox & Lakra, 2007). First, we develop a high-fidelity procedure for sampling stereo-image patches from natural stereo-images; we estimate that the procedure is as precise as the human visual system for all but the most sensitive conditions (Blakemore, 1970; Cormack et al., 1991). A MATLAB implementation of the procedure is available at http://www.github.com/BurgeLab/StereoImageSampling. Second, we show that local depth variation systematically degrades performance in both tasks, and changes the size and shape of the integration area that optimizes performance in both tasks. Then, we examine how luminance and disparity covary in natural scenes and show how local depth variation can be directly estimated from stereo-images. Finally, we report the most likely spatial patterns of disparity variation and disparity discontinuities (half-occlusions) in natural scenes. 
Results
To analyze the impact of natural depth variation on half-occlusion detection and binocular disparity detection in natural scenes, it is useful to sample a large collection of binocular image patches with groundtruth depth information. In most stereo-photographs of natural scenes, groundtruth information about the 3D-coordinates of the imaged surfaces is unavailable. In most computer-graphics-generated scenes, groundtruth information about the 3D scene is available, but it is unknown whether those scenes accurately reflect all task-relevant aspects of natural scenes and images. Therefore, it is important to obtain natural stereo-image databases accompanied by the 3D-coordinates of each imaged surface. Provided the 3D scene data are of sufficiently high quality, groundtruth binocular disparities (and corresponding points) can be computed from the 3D data using trigonometric instead of image-based methods. 
Recently, Burge et al. (2016) published a large database of calibrated stereo-images of natural scenes with precisely coregistered (±1 pixel) laser-based measurements of the groundtruth distances to the imaged objects in the scene. The laser-based distance measurements were obtained with a range scanner. During acquisition of each eye's view of the scene, the nodal points of the camera and the range scanner were positioned at identical locations. This feature of the data acquisition process ensured that each pixel in each eye's photographic image had a matched pixel in the associated range image, and vice versa. The current manuscript uses this dataset. 
Interpolating binocular corresponding points from groundtruth distance data
In this section, we introduce a new interpolation-based procedure for precisely sampling binocular image patches from stereo-images of natural scenes. The same procedure can also be used to determine whether a given point in one eye's image has, or lacks, a corresponding point in the other eye's image. Left- and right-eye image points are corresponding image points if they correspond the same surface point in a 3D scene. Accurate, precise determination of corresponding image points is necessary for accurate, precise sampling of binocular image-patches. In natural stereo-images, corresponding image points are usually estimated via image-based methods such as local cross-correlation (Banks, Gepshtein, & Landy, 2004; Cormack et al., 1991; Tyler & Julesz, 1978). We use our new procedure, along with the Burge et al. (2016) dataset, to determine groundtruth corresponding points directly from the coregistered distance data. Importantly, this procedure does not rely on image-based matching routines. 
To obtain binocular image patches such that the center pixel of each eye's patch coincides with corresponding image points, a two-stage interpolation procedure is required. First, corresponding image point locations are interpolated using ray-tracing techniques. Second, to protect against the effects of binocular sampling error, the luminance and range images are interpolated to obtain stereo-image patches in which the center pixels of the left- and right-eye images coincide with corresponding image point locations. 
Sampling a pixel center from either the left- or the right-eye luminance image initializes the interpolation procedure. The eye from whose image the pixel center is first chosen is the anchor eye. Each pixel is located in a frontoparallel projection plane 3 m from the cyclopean eye (i.e., the midpoint of the interocular axis). Left-eye (LE) and right-eye (RE) lines of sight through the centers of these pixels define a set of intersection points in 3D space (Figure 1A). These intersection points are the sampled 3D scene points. When a point on a 3D surface coincides with a sampled 3D scene point, the left- and right-eye lines of sight to this point intersect the projection plane at pixel centers (Figure 1A). However, most sampled 3D scene points do not have a 3D surface passing through them, and most 3D surface points do not coincide with sampled 3D scene points. Thus, corresponding image points do not generally coincide with pixel centers in the projection plane. The goal of our interpolation procedure is to interpolate 3D surface points and corresponding image points so that the postinterpolation pixel centers coincide with corresponding image points 
Figure 1
 
Stereo 3D sampling geometry, corresponding image-points, and interpolation procedure. (A) Top-view of 3D sampling geometry. Left-eye (LE) and right-eye (RE) luminance and range images are captured one human interocular distance apart (65 mm). Sampled 3D scene points (white squares) occur at the intersections of LE and RE lines of sight (thin lines) and usually do not lie on 3D surfaces. Samples in the projection plane (i.e., pixel centers) are a subset of these sampled 3D scene points. Sampled 3D surface points (white dots) occur at the intersections of LE or RE lines of sight with 3D surfaces (thick black curve) in the scene. Small arrows along lines of sight represent light reflected from sampled 3D surface points that determine the pixel values in the luminance and range images for each eye. Occasionally, sampled 3D surface points coincide with sampled 3D scene points (large dashed circles). Light rays from these points intersect the projection plane at pixel centers. (B) Procedure to obtain corresponding image point locations: Sample a pixel location (1) in the anchor eye's image (here, the left eye). Locate the corresponding sampled left eye 3D surface point (2). Find the right eye projection (3) from sampled 3D surface point by ray tracing. Select nearest pixel center (4) in right eye image. Locate the corresponding sampled right eye 3D surface point (5). Find sampled 3D scene point (6) nearest the left- and right-eye sampled 3D surface points. This sampled 3D scene point is the intersection point of the left- and right-eye lines of sight through the sampled 3D surface points. Find interpolated 3D surface point (7) by linear interpolation (i.e., the location of the intersection of cyclopean line of sight with chord joining sampled 3D surface points; see inset). Dashed light rays from this interpolated 3D surface point define corresponding point locations (8) in the projection plane. The vergence demand \(\def\upalpha{\unicode[Times]{x3B1}}\)\(\def\upbeta{\unicode[Times]{x3B2}}\)\(\def\upgamma{\unicode[Times]{x3B3}}\)\(\def\updelta{\unicode[Times]{x3B4}}\)\(\def\upvarepsilon{\unicode[Times]{x3B5}}\)\(\def\upzeta{\unicode[Times]{x3B6}}\)\(\def\upeta{\unicode[Times]{x3B7}}\)\(\def\uptheta{\unicode[Times]{x3B8}}\)\(\def\upiota{\unicode[Times]{x3B9}}\)\(\def\upkappa{\unicode[Times]{x3BA}}\)\(\def\uplambda{\unicode[Times]{x3BB}}\)\(\def\upmu{\unicode[Times]{x3BC}}\)\(\def\upnu{\unicode[Times]{x3BD}}\)\(\def\upxi{\unicode[Times]{x3BE}}\)\(\def\upomicron{\unicode[Times]{x3BF}}\)\(\def\uppi{\unicode[Times]{x3C0}}\)\(\def\uprho{\unicode[Times]{x3C1}}\)\(\def\upsigma{\unicode[Times]{x3C3}}\)\(\def\uptau{\unicode[Times]{x3C4}}\)\(\def\upupsilon{\unicode[Times]{x3C5}}\)\(\def\upphi{\unicode[Times]{x3C6}}\)\(\def\upchi{\unicode[Times]{x3C7}}\)\(\def\uppsy{\unicode[Times]{x3C8}}\)\(\def\upomega{\unicode[Times]{x3C9}}\)\(\def\bialpha{\boldsymbol{\alpha}}\)\(\def\bibeta{\boldsymbol{\beta}}\)\(\def\bigamma{\boldsymbol{\gamma}}\)\(\def\bidelta{\boldsymbol{\delta}}\)\(\def\bivarepsilon{\boldsymbol{\varepsilon}}\)\(\def\bizeta{\boldsymbol{\zeta}}\)\(\def\bieta{\boldsymbol{\eta}}\)\(\def\bitheta{\boldsymbol{\theta}}\)\(\def\biiota{\boldsymbol{\iota}}\)\(\def\bikappa{\boldsymbol{\kappa}}\)\(\def\bilambda{\boldsymbol{\lambda}}\)\(\def\bimu{\boldsymbol{\mu}}\)\(\def\binu{\boldsymbol{\nu}}\)\(\def\bixi{\boldsymbol{\xi}}\)\(\def\biomicron{\boldsymbol{\micron}}\)\(\def\bipi{\boldsymbol{\pi}}\)\(\def\birho{\boldsymbol{\rho}}\)\(\def\bisigma{\boldsymbol{\sigma}}\)\(\def\bitau{\boldsymbol{\tau}}\)\(\def\biupsilon{\boldsymbol{\upsilon}}\)\(\def\biphi{\boldsymbol{\phi}}\)\(\def\bichi{\boldsymbol{\chi}}\)\(\def\bipsy{\boldsymbol{\psy}}\)\(\def\biomega{\boldsymbol{\omega}}\)\(\def\bupalpha{\unicode[Times]{x1D6C2}}\)\(\def\bupbeta{\unicode[Times]{x1D6C3}}\)\(\def\bupgamma{\unicode[Times]{x1D6C4}}\)\(\def\bupdelta{\unicode[Times]{x1D6C5}}\)\(\def\bupepsilon{\unicode[Times]{x1D6C6}}\)\(\def\bupvarepsilon{\unicode[Times]{x1D6DC}}\)\(\def\bupzeta{\unicode[Times]{x1D6C7}}\)\(\def\bupeta{\unicode[Times]{x1D6C8}}\)\(\def\buptheta{\unicode[Times]{x1D6C9}}\)\(\def\bupiota{\unicode[Times]{x1D6CA}}\)\(\def\bupkappa{\unicode[Times]{x1D6CB}}\)\(\def\buplambda{\unicode[Times]{x1D6CC}}\)\(\def\bupmu{\unicode[Times]{x1D6CD}}\)\(\def\bupnu{\unicode[Times]{x1D6CE}}\)\(\def\bupxi{\unicode[Times]{x1D6CF}}\)\(\def\bupomicron{\unicode[Times]{x1D6D0}}\)\(\def\buppi{\unicode[Times]{x1D6D1}}\)\(\def\buprho{\unicode[Times]{x1D6D2}}\)\(\def\bupsigma{\unicode[Times]{x1D6D4}}\)\(\def\buptau{\unicode[Times]{x1D6D5}}\)\(\def\bupupsilon{\unicode[Times]{x1D6D6}}\)\(\def\bupphi{\unicode[Times]{x1D6D7}}\)\(\def\bupchi{\unicode[Times]{x1D6D8}}\)\(\def\buppsy{\unicode[Times]{x1D6D9}}\)\(\def\bupomega{\unicode[Times]{x1D6DA}}\)\(\def\bupvartheta{\unicode[Times]{x1D6DD}}\)\(\def\bGamma{\bf{\Gamma}}\)\(\def\bDelta{\bf{\Delta}}\)\(\def\bTheta{\bf{\Theta}}\)\(\def\bLambda{\bf{\Lambda}}\)\(\def\bXi{\bf{\Xi}}\)\(\def\bPi{\bf{\Pi}}\)\(\def\bSigma{\bf{\Sigma}}\)\(\def\bUpsilon{\bf{\Upsilon}}\)\(\def\bPhi{\bf{\Phi}}\)\(\def\bPsi{\bf{\Psi}}\)\(\def\bOmega{\bf{\Omega}}\)\(\def\iGamma{\unicode[Times]{x1D6E4}}\)\(\def\iDelta{\unicode[Times]{x1D6E5}}\)\(\def\iTheta{\unicode[Times]{x1D6E9}}\)\(\def\iLambda{\unicode[Times]{x1D6EC}}\)\(\def\iXi{\unicode[Times]{x1D6EF}}\)\(\def\iPi{\unicode[Times]{x1D6F1}}\)\(\def\iSigma{\unicode[Times]{x1D6F4}}\)\(\def\iUpsilon{\unicode[Times]{x1D6F6}}\)\(\def\iPhi{\unicode[Times]{x1D6F7}}\)\(\def\iPsi{\unicode[Times]{x1D6F9}}\)\(\def\iOmega{\unicode[Times]{x1D6FA}}\)\(\def\biGamma{\unicode[Times]{x1D71E}}\)\(\def\biDelta{\unicode[Times]{x1D71F}}\)\(\def\biTheta{\unicode[Times]{x1D723}}\)\(\def\biLambda{\unicode[Times]{x1D726}}\)\(\def\biXi{\unicode[Times]{x1D729}}\)\(\def\biPi{\unicode[Times]{x1D72B}}\)\(\def\biSigma{\unicode[Times]{x1D72E}}\)\(\def\biUpsilon{\unicode[Times]{x1D730}}\)\(\def\biPhi{\unicode[Times]{x1D731}}\)\(\def\biPsi{\unicode[Times]{x1D733}}\)\(\def\biOmega{\unicode[Times]{x1D734}}\)\(\theta \) of the interpolated scene point is the angle between the left- and right-eye lines of sight required to fixate the point. (C) Sampling error before interpolation in arcmin. Dashed vertical lines indicate the expected sampling error assuming surface point locations are uniformly distributed between sampled 3D scene points. (D) Estimated sampling error after interpolation in arcsec.
Figure 1
 
Stereo 3D sampling geometry, corresponding image-points, and interpolation procedure. (A) Top-view of 3D sampling geometry. Left-eye (LE) and right-eye (RE) luminance and range images are captured one human interocular distance apart (65 mm). Sampled 3D scene points (white squares) occur at the intersections of LE and RE lines of sight (thin lines) and usually do not lie on 3D surfaces. Samples in the projection plane (i.e., pixel centers) are a subset of these sampled 3D scene points. Sampled 3D surface points (white dots) occur at the intersections of LE or RE lines of sight with 3D surfaces (thick black curve) in the scene. Small arrows along lines of sight represent light reflected from sampled 3D surface points that determine the pixel values in the luminance and range images for each eye. Occasionally, sampled 3D surface points coincide with sampled 3D scene points (large dashed circles). Light rays from these points intersect the projection plane at pixel centers. (B) Procedure to obtain corresponding image point locations: Sample a pixel location (1) in the anchor eye's image (here, the left eye). Locate the corresponding sampled left eye 3D surface point (2). Find the right eye projection (3) from sampled 3D surface point by ray tracing. Select nearest pixel center (4) in right eye image. Locate the corresponding sampled right eye 3D surface point (5). Find sampled 3D scene point (6) nearest the left- and right-eye sampled 3D surface points. This sampled 3D scene point is the intersection point of the left- and right-eye lines of sight through the sampled 3D surface points. Find interpolated 3D surface point (7) by linear interpolation (i.e., the location of the intersection of cyclopean line of sight with chord joining sampled 3D surface points; see inset). Dashed light rays from this interpolated 3D surface point define corresponding point locations (8) in the projection plane. The vergence demand \(\def\upalpha{\unicode[Times]{x3B1}}\)\(\def\upbeta{\unicode[Times]{x3B2}}\)\(\def\upgamma{\unicode[Times]{x3B3}}\)\(\def\updelta{\unicode[Times]{x3B4}}\)\(\def\upvarepsilon{\unicode[Times]{x3B5}}\)\(\def\upzeta{\unicode[Times]{x3B6}}\)\(\def\upeta{\unicode[Times]{x3B7}}\)\(\def\uptheta{\unicode[Times]{x3B8}}\)\(\def\upiota{\unicode[Times]{x3B9}}\)\(\def\upkappa{\unicode[Times]{x3BA}}\)\(\def\uplambda{\unicode[Times]{x3BB}}\)\(\def\upmu{\unicode[Times]{x3BC}}\)\(\def\upnu{\unicode[Times]{x3BD}}\)\(\def\upxi{\unicode[Times]{x3BE}}\)\(\def\upomicron{\unicode[Times]{x3BF}}\)\(\def\uppi{\unicode[Times]{x3C0}}\)\(\def\uprho{\unicode[Times]{x3C1}}\)\(\def\upsigma{\unicode[Times]{x3C3}}\)\(\def\uptau{\unicode[Times]{x3C4}}\)\(\def\upupsilon{\unicode[Times]{x3C5}}\)\(\def\upphi{\unicode[Times]{x3C6}}\)\(\def\upchi{\unicode[Times]{x3C7}}\)\(\def\uppsy{\unicode[Times]{x3C8}}\)\(\def\upomega{\unicode[Times]{x3C9}}\)\(\def\bialpha{\boldsymbol{\alpha}}\)\(\def\bibeta{\boldsymbol{\beta}}\)\(\def\bigamma{\boldsymbol{\gamma}}\)\(\def\bidelta{\boldsymbol{\delta}}\)\(\def\bivarepsilon{\boldsymbol{\varepsilon}}\)\(\def\bizeta{\boldsymbol{\zeta}}\)\(\def\bieta{\boldsymbol{\eta}}\)\(\def\bitheta{\boldsymbol{\theta}}\)\(\def\biiota{\boldsymbol{\iota}}\)\(\def\bikappa{\boldsymbol{\kappa}}\)\(\def\bilambda{\boldsymbol{\lambda}}\)\(\def\bimu{\boldsymbol{\mu}}\)\(\def\binu{\boldsymbol{\nu}}\)\(\def\bixi{\boldsymbol{\xi}}\)\(\def\biomicron{\boldsymbol{\micron}}\)\(\def\bipi{\boldsymbol{\pi}}\)\(\def\birho{\boldsymbol{\rho}}\)\(\def\bisigma{\boldsymbol{\sigma}}\)\(\def\bitau{\boldsymbol{\tau}}\)\(\def\biupsilon{\boldsymbol{\upsilon}}\)\(\def\biphi{\boldsymbol{\phi}}\)\(\def\bichi{\boldsymbol{\chi}}\)\(\def\bipsy{\boldsymbol{\psy}}\)\(\def\biomega{\boldsymbol{\omega}}\)\(\def\bupalpha{\unicode[Times]{x1D6C2}}\)\(\def\bupbeta{\unicode[Times]{x1D6C3}}\)\(\def\bupgamma{\unicode[Times]{x1D6C4}}\)\(\def\bupdelta{\unicode[Times]{x1D6C5}}\)\(\def\bupepsilon{\unicode[Times]{x1D6C6}}\)\(\def\bupvarepsilon{\unicode[Times]{x1D6DC}}\)\(\def\bupzeta{\unicode[Times]{x1D6C7}}\)\(\def\bupeta{\unicode[Times]{x1D6C8}}\)\(\def\buptheta{\unicode[Times]{x1D6C9}}\)\(\def\bupiota{\unicode[Times]{x1D6CA}}\)\(\def\bupkappa{\unicode[Times]{x1D6CB}}\)\(\def\buplambda{\unicode[Times]{x1D6CC}}\)\(\def\bupmu{\unicode[Times]{x1D6CD}}\)\(\def\bupnu{\unicode[Times]{x1D6CE}}\)\(\def\bupxi{\unicode[Times]{x1D6CF}}\)\(\def\bupomicron{\unicode[Times]{x1D6D0}}\)\(\def\buppi{\unicode[Times]{x1D6D1}}\)\(\def\buprho{\unicode[Times]{x1D6D2}}\)\(\def\bupsigma{\unicode[Times]{x1D6D4}}\)\(\def\buptau{\unicode[Times]{x1D6D5}}\)\(\def\bupupsilon{\unicode[Times]{x1D6D6}}\)\(\def\bupphi{\unicode[Times]{x1D6D7}}\)\(\def\bupchi{\unicode[Times]{x1D6D8}}\)\(\def\buppsy{\unicode[Times]{x1D6D9}}\)\(\def\bupomega{\unicode[Times]{x1D6DA}}\)\(\def\bupvartheta{\unicode[Times]{x1D6DD}}\)\(\def\bGamma{\bf{\Gamma}}\)\(\def\bDelta{\bf{\Delta}}\)\(\def\bTheta{\bf{\Theta}}\)\(\def\bLambda{\bf{\Lambda}}\)\(\def\bXi{\bf{\Xi}}\)\(\def\bPi{\bf{\Pi}}\)\(\def\bSigma{\bf{\Sigma}}\)\(\def\bUpsilon{\bf{\Upsilon}}\)\(\def\bPhi{\bf{\Phi}}\)\(\def\bPsi{\bf{\Psi}}\)\(\def\bOmega{\bf{\Omega}}\)\(\def\iGamma{\unicode[Times]{x1D6E4}}\)\(\def\iDelta{\unicode[Times]{x1D6E5}}\)\(\def\iTheta{\unicode[Times]{x1D6E9}}\)\(\def\iLambda{\unicode[Times]{x1D6EC}}\)\(\def\iXi{\unicode[Times]{x1D6EF}}\)\(\def\iPi{\unicode[Times]{x1D6F1}}\)\(\def\iSigma{\unicode[Times]{x1D6F4}}\)\(\def\iUpsilon{\unicode[Times]{x1D6F6}}\)\(\def\iPhi{\unicode[Times]{x1D6F7}}\)\(\def\iPsi{\unicode[Times]{x1D6F9}}\)\(\def\iOmega{\unicode[Times]{x1D6FA}}\)\(\def\biGamma{\unicode[Times]{x1D71E}}\)\(\def\biDelta{\unicode[Times]{x1D71F}}\)\(\def\biTheta{\unicode[Times]{x1D723}}\)\(\def\biLambda{\unicode[Times]{x1D726}}\)\(\def\biXi{\unicode[Times]{x1D729}}\)\(\def\biPi{\unicode[Times]{x1D72B}}\)\(\def\biSigma{\unicode[Times]{x1D72E}}\)\(\def\biUpsilon{\unicode[Times]{x1D730}}\)\(\def\biPhi{\unicode[Times]{x1D731}}\)\(\def\biPsi{\unicode[Times]{x1D733}}\)\(\def\biOmega{\unicode[Times]{x1D734}}\)\(\theta \) of the interpolated scene point is the angle between the left- and right-eye lines of sight required to fixate the point. (C) Sampling error before interpolation in arcmin. Dashed vertical lines indicate the expected sampling error assuming surface point locations are uniformly distributed between sampled 3D scene points. (D) Estimated sampling error after interpolation in arcsec.
Figure 1B illustrates how interpolated 3D surface and corresponding image point locations are obtained. Consider a pair of LE and RE pixel centers that correspond to a sampled 3D scene point. Sampled 3D scene points (Figure 1B, open squares in scene) do not generally coincide with sampled 3D surface points (Figure 1B, open circles). Thus, the luminance information in these pixels (Figure 1B, open squares in projection plane) does not generally correspond to a single point on a 3D surface. The interpolated surface point (Figure 1B, black circle) occurs at the intersection between the cyclopean line of sight and a line segment connecting sampled 3D surface points. This interpolated 3D surface point, unlike the sampled 3D scene point, lies on (or extremely near) a 3D surface. The LE and RE lines of sight to the interpolated 3D surface point intersect the projection plane at corresponding image points (Figure 1B, black squares). 
This interpolation procedure is necessary to ensure that binocular sampling errors are below human disparity detection thresholds. Under optimal conditions, human disparity detection thresholds are approximately 5 arcsec (Blakemore, 1970; Cormack et al., 1991). Failing to interpolate would result in ±35 arcsec binocular sampling errors (i.e., erroneous fixation disparities), which are large relative to disparity detection thresholds. Assuming that surfaces are uniformly distributed between sampled 3D scene points, the vergence demand difference of the interpolated 3D surface point and the nearest 3D sampled scene point should be uniformly distributed. Figure 1C confirms this prediction; the vergence demand differences indeed tend to lie between ±35 arcsec, indicating that the assumptions of the interpolation procedure are valid. (Vergence demand Display Formula\(\theta \) is the angle between the LE and RE lines of sight required fixating a given 3D point; vergence demand difference Display Formula\(\Delta \theta = {\theta _2} - {\theta _1}\) is the difference between two vergence demands.) 
Unfortunately, interpolated corresponding image points returned by this procedure are not guaranteed to be true corresponding image points. If a sampled surface point is half-occluded, then corresponding image points do not exist, and the procedure returns invalid points. We screen for bad points by repeating the interpolation procedure twice, with a different eye as anchor eye on each repeat. When interpolated 3D surface points from each anchor eye match, their associated vergence demands will match on both repeats, indicating that the interpolated corresponding points are valid. Figure 1D shows that after interpolation, approximately 80% of interpolated 3D surface points had vergence demand differences of less than ±5 arcsec across repeats. For subsequent analyses of binocularly visible scene points, interpolated points with vergence demand differences larger than ±5 arcsec are discarded, ensuring that residual sampling errors are smaller than human stereo-detection thresholds for all but the very most sensitive conditions (Blakemore, 1970; Cormack et al., 1991). Visual inspection of hundreds of interpolated points corroborates the numerical results. 
To understand why half-occluded points can yield invalid corresponding image points, and why vergence demand differences can help screen for them, consider the scenario depicted in Figure 2A. When the left eye is the anchor eye, the left-eye image point is associated with a far surface point having vergence demand Display Formula\({\theta _L}\), and the right-eye image point returned by the interpolation is invalid because no true corresponding point exists. When the right eye is the anchor eye, the same right-eye image point is associated with a near surface point having vergence demand Display Formula\({\theta _R}\) not equal to Display Formula\({\theta _L}\). In other words, the vergence demand difference Display Formula\(\Delta \theta = {\theta _R} - {\theta _L}\) does not equal zero. Also note that when the right eye is the anchor eye, the left-eye image point (middle black square) returned by the procedure does not match the original left-eye image point. For cases in which the surface point is binocularly visible, both repeats of the interpolation procedure yield the same vergence demands, surface points, and interpolated image points (Figure 2B). The vergence demand of a surface point is computed in the epipolar plane defined by the surface point and the left- and right-eye nodal points (Figure 2C). 
Figure 2
 
Half-occluded scene points, binocularly visible scene points, and vergence demand. (A) Half-occluded 3D surface point. The scene point on the far surface (black circle) is visible to the left eye and occluded from the right eye. Arrows indicate the ray tracing performed by the interpolation routine (see Figure 1). Squares represent interpolated image points returned by the interpolation procedure. When 3D surface points are half-occluded, the interpolation procedure returns invalid points. (B) Binocularly visible surface point (black circle) and corresponding image points (black squares) in the projection plane. When the scene point is binocularly visible, the vergence demand \(\theta \) of the surface point is the same, regardless of the anchor eye. The vergence demand is identical whether the left or the right eye is used as the anchor eye. (C) Vergence demand is computed within the epipolar plane defined by a 3D surface point and the left- and right-eye nodal points.
Figure 2
 
Half-occluded scene points, binocularly visible scene points, and vergence demand. (A) Half-occluded 3D surface point. The scene point on the far surface (black circle) is visible to the left eye and occluded from the right eye. Arrows indicate the ray tracing performed by the interpolation routine (see Figure 1). Squares represent interpolated image points returned by the interpolation procedure. When 3D surface points are half-occluded, the interpolation procedure returns invalid points. (B) Binocularly visible surface point (black circle) and corresponding image points (black squares) in the projection plane. When the scene point is binocularly visible, the vergence demand \(\theta \) of the surface point is the same, regardless of the anchor eye. The vergence demand is identical whether the left or the right eye is used as the anchor eye. (C) Vergence demand is computed within the epipolar plane defined by a 3D surface point and the left- and right-eye nodal points.
Results of the sampling and interpolation procedure are depicted in each of two natural scenes (Figure 3A and B). Left- and right-eye luminance images (upper row) and range images (lower row) are shown. 500 randomly sampled corresponding image points, associated with 500 scene points, are overlaid onto each stereo-image; 250 were sampled with the left eye as the anchor eye, and 250 were sampled with right eye as the anchor eye. Divergently-fuse the left two images or cross-fuse the right two images to see the scene and the corresponding points in stereo-3D. True corresponding image points (yellow) lie on the imaged surfaces in the 3D scene. Invalid interpolated points (red) are also shown. To protect against eye-specific biases in the subsequent analyses, surface points are sampled symmetrically about the sagittal plane of the head. 
Figure 3
 
Corresponding points overlaid on stereo-images (upper row) and coregistered groundtruth distance data (lower row) for two different scenes, (A) and (B). Wall-fuse the left two images or cross-fuse the right two images to see the imaged scene in stereo-3D. True corresponding points (yellow dots) lie on imaged 3D surfaces. Candidate corresponding points that are half-occluded or are otherwise invalid (red dots) are also shown. For reference, the yellow boxes in (A) and (B) indicate 3° and 1° areas, respectively.
Figure 3
 
Corresponding points overlaid on stereo-images (upper row) and coregistered groundtruth distance data (lower row) for two different scenes, (A) and (B). Wall-fuse the left two images or cross-fuse the right two images to see the imaged scene in stereo-3D. True corresponding points (yellow dots) lie on imaged 3D surfaces. Candidate corresponding points that are half-occluded or are otherwise invalid (red dots) are also shown. For reference, the yellow boxes in (A) and (B) indicate 3° and 1° areas, respectively.
After corresponding points are determined, luminance and range values are interpolated on a uniform grid of pixels centered at the corresponding points. Left- and right-eye luminance and range stereo- patches are then cropped from the images. Maps of groundtruth disparity, relative to the center pixel, are then computed directly from the range images. 
Quantifying local depth variation with disparity contrast
The patterns of binocular disparities encountered by a behaving organism depend on the properties of objects in the environment and how the organism interacts with those objects. When an organism fixates a point on an object in a 3D scene, its image is formed on the left and right-eye foveas. These images are the inputs to the organism's foveal disparity processing mechanisms. To a first approximation, if the fixated point lies on a planar frontoparallel surface, then disparities of nearby points will be zero. However, when the fixated point lies on curved, bumpy, and/or slanted surface, the disparities of nearby points will vary more significantly. When a depth edge is near the fixated point, dramatic changes in disparity can occur in the neighborhood of the fovea. 
To quantify local depth variation, we compute the disparity contrast associated with each stereo-pair that is centered on a binocularly visible scene point. Disparity contrast is the root-mean-squared (RMS) disparity relative to the center pixel in a local neighborhood  
\begin{equation}\tag{1}{C_\delta } = \sqrt {{{\sum\limits_{{\bf{x}} \in A}^{} {\delta {{\left( {\bf{x}} \right)}^2}} } \mathord{\left/ {\vphantom {{\sum\limits_{{\bf{x}} \in A}^{} {\delta {{\left( {\bf{x}} \right)}^2}} } N}} \right. \kern-1.2pt} N}} \end{equation}
where Display Formula\(\delta \) is the groundtruth relative disparity, Display Formula\(A\) is the local spatial integration area, Display Formula\({\bf{x}}\) is the spatial location, and Display Formula\(N\) is the total number of pixels in the local area. Example stereo-images with different amounts of disparity contrast are shown in Figure 4. The upper row shows the luminance stereo image. The lower row shows the groundtruth disparity map, computed directly from the laser-measured distance data at each stereo-image pixel. Thus, each stereo-image patch corresponds to the left- and right-eye retinal image that would be formed if an observer fixated the surface point in the scene. The distribution of disparity contrast in natural scenes is shown in Supplementary Figure S1.  
Figure 4
 
Natural stereo-image patches and corresponding groundtruth disparity maps, sampled from natural scenes. Free-fuse to see in stereo-3D. (A–D) Local disparity contrast \({C_\delta }\) (e.g., local depth variation) increases in the subplots from left to right. Groundtruth disparity at each pixel (bottom row) is computed directly from groundtruth distance data. Disparity contrast is computed under a window that defines the spatial integration area (see Methods). The colorbar indicates the disparity in arcmin relative to the disparity (i.e., vergence demand) at the center pixel.
Figure 4
 
Natural stereo-image patches and corresponding groundtruth disparity maps, sampled from natural scenes. Free-fuse to see in stereo-3D. (A–D) Local disparity contrast \({C_\delta }\) (e.g., local depth variation) increases in the subplots from left to right. Groundtruth disparity at each pixel (bottom row) is computed directly from groundtruth distance data. Disparity contrast is computed under a window that defines the spatial integration area (see Methods). The colorbar indicates the disparity in arcmin relative to the disparity (i.e., vergence demand) at the center pixel.
Half-occlusion detection in natural stereo-images
Half-occlusion detection is the task of detecting if a particular scene point visible to one eye is occluded to the other eye. This task is equivalent to determining whether a given point in one eye's image lacks or has a corresponding point in the other eye. Half-occlusion detection is important because disparity is defined only when a given point is binocularly visible, and because half-occluded points can mediate da Vinci stereopsis (Harris & Wilcox, 2009; Kaye, 1978; Nakayama & Shimojo, 1990). 
What image cues provide information about whether scene points are half-occluded or binocularly visible, and how does local depth variation impact the information? First, consider half-occluded scene points (Figure 5A). If the eyes are verged on (i.e., pointed towards) a half-occluded point (see Figure 2A), the scene point at the center of one eye's image is different than the scene point at the center of the other eye's image, the left- and right-eye images will be centered on different points in the scene, and the left- and right-eye images should be very different (Figure 5B). Now, consider binocularly visible scene points. If the eyes are verged (i.e., fixated) on a binocularly visible scene point, the left- and right-eye images should be very similar. However, if local depth variation near a binocularly visible scene point is high, left- and right-eye images centered on that point should be less similar. 
Figure 5
 
Effect of disparity contrast on half-occlusion performance. (A) Three example stereo-image patches centered on scene points that are half-occluded to the left eye, binocularly visible, and half-occluded to the right eye. Spatial integration areas of different sizes (1° and 3°) are shown as dashed circles. (B) The half-occlusion detection task is to distinguish half-occluded versus binocularly visible points with 0.0 arcmin of disparity. Performance is compared for scene points with low, medium, and high disparity contrasts. (C) Conditional probability distributions of the decision variable (i.e., the binocular correlation of the left- and right-eye image patches). The dashed black curve represents the distribution of the decision variable for half-occluded points. Solid curves show the decision variable distributions for patches with binocularly visible centers having low (blue; 0.05–1.00 arcmin), medium (red; 0.2–4.0 arcmin), and high (green; 0.75–15.0 arcmin) disparity contrasts. Binocular image correlation and disparity contrast are computed with spatial integration areas of 1.0° (i.e., 0.5° at half-height). (D) Receiver operating characteristic (ROC) curves for the half-occlusion task. Higher disparity contrasts decrease half-occlusion detection performance. (E) Half-occlusion detection sensitivity (d′) as a function of spatial integration area for different disparity contrasts. Arrows mark the spatial integration area at half-height for which half-occlusion detection performance is optimized.
Figure 5
 
Effect of disparity contrast on half-occlusion performance. (A) Three example stereo-image patches centered on scene points that are half-occluded to the left eye, binocularly visible, and half-occluded to the right eye. Spatial integration areas of different sizes (1° and 3°) are shown as dashed circles. (B) The half-occlusion detection task is to distinguish half-occluded versus binocularly visible points with 0.0 arcmin of disparity. Performance is compared for scene points with low, medium, and high disparity contrasts. (C) Conditional probability distributions of the decision variable (i.e., the binocular correlation of the left- and right-eye image patches). The dashed black curve represents the distribution of the decision variable for half-occluded points. Solid curves show the decision variable distributions for patches with binocularly visible centers having low (blue; 0.05–1.00 arcmin), medium (red; 0.2–4.0 arcmin), and high (green; 0.75–15.0 arcmin) disparity contrasts. Binocular image correlation and disparity contrast are computed with spatial integration areas of 1.0° (i.e., 0.5° at half-height). (D) Receiver operating characteristic (ROC) curves for the half-occlusion task. Higher disparity contrasts decrease half-occlusion detection performance. (E) Half-occlusion detection sensitivity (d′) as a function of spatial integration area for different disparity contrasts. Arrows mark the spatial integration area at half-height for which half-occlusion detection performance is optimized.
To examine the impact of local disparity variation on half-occlusion detection in natural scenes, we first sampled 10,000 stereo-image patches from the natural scene database using the procedure discussed above. We found that 86.5% of the sampled stereo-image pairs were centered on binocularly visible scene points, and that 13.5% were centered on half-occluded scene points. We determined which patches had half-occluded centers directly from the range measurements by determining which patches had centers where the horizontal disparity gradient (Display Formula\(DG = \Delta \theta /\Delta X\)) with any other point was 2.0 or higher (a disparity gradient of 2.0 corresponds to Panum's limiting case; see Bülthoff, Fahle, & Wegmann, 1991). The disparity gradient in the half-occlusion scenario depicted in Figure 2A is somewhat larger than 2.0. Second, to quantify local depth variation, we computed the disparity contrast of all patches with binocularly visible centers. For all analyses, disparity contrast was computed over a local integration area of 1.0° (0.5° full-width at half-height; see Equation 1); results are robust to this choice (see Supplementary Figure S1). Third, the similarity of the left- and right-eye image patches was quantified with the correlation coefficient  
\begin{equation}\tag{2}{\rho _{LR}} = {{\sum\limits_{{\bf{x}} \in A} {{\bf c}_L^W\left( {\bf{x}} \right){\bf c}_R^W\left( {\bf{x}} \right)} } \over {\left\| {{\bf c}_L^W\left( {\bf{x}} \right)} \right\|\left\| {{\bf c}_R^W\left( {\bf{x}} \right)} \right\|}}\end{equation}
where Display Formula\({\bf c}_L^W\)and Display Formula\({\bf c}_R^W\) are windowed left- and right-eye Weber contrast images (see Methods) and where Display Formula\(\left\| {\;{\bf{c}}\left( {\bf{x}} \right)\;} \right\| = \sqrt {\sum\nolimits_{{\bf{x}} \in A}^N {{\bf{c}}{{\left( {\bf{x}} \right)}^2}} } \) is the L2 norm of the contrast image in a local integration area A. The integration area is determined by the size of a cosine windowing function Display Formula\(W\) (see Methods); the windowing function determines the size of the spatial integration area within which binocular correlation is computed. Fourth, under the assumption that the correlation coefficient is the decision variable, we used standard methods from signal detection theory to determine how well half-occlusions can be detected in natural images. Specifically, we determined the conditional probability of the decision variable given (a) that the center pixel was binocularly visible for each disparity contrast Display Formula\(p\left( {{\rho _{LR}}|bino,{C_\delta }} \right)\) and (b) that the center pixel was half-occluded Display Formula\(p\left( {{\rho _{LR}}|mono} \right)\) (Figure 5C), swept out an ROC curve (Figure 5D), computed the area underneath it to determine percent correct, and then converted to d′. Finally, we repeated the steps for different spatial integration areas. Half-occlusion detection performance (d′) changes significantly as a function of the spatial integration area for each of several disparity contrasts (Figure 5E). Clearly, local depth variation reduces how well binocularly visible points can be discriminated from half-occluded points. (Note that the same procedure could be adapted to work in the retinal periphery with one straightforward extension. For any given patch in one eye's image, a cross-correlation could be performed to determine the peripheral locations in the other eye to compare. The correlation of the two patches yielding the maximum correlation could then be used as input to the procedure described above.)  
Figure 6A summarizes half-occlusion detection performance with the integration area that optimizes performance, for more finely spaced bins of disparity contrast (also see Supplementary Figure S2A). Figure 6B summarizes how increasing disparity contrast decreases the size of the spatial integration area that optimizes performance. Compared to when the best-fixed integration area is used across all stimuli, d′ is 8% higher when the optimal spatial integration area is used for each stimulus (see Methods). Thus, in half-occlusion detection, the visual system would benefit from mechanisms that adapt their spatial integration areas to the local depth variation in the scene. 
Figure 6
 
Effect of local disparity variation on optimal processing size for half-occlusion detection. (A) Sensitivity as a function of disparity contrast, assuming the optimal size of the integration area. Sensitivity decreases monotonically with disparity contrast. For each disparity contrast, sensitivities were measured with the optimal integration area. (B) Optimal window size as a function of disparity contrast. The optimal window size decreases approximately linearly as disparity contrast increases on a log-log scale. Results are highly robust to changes in the bin width.
Figure 6
 
Effect of local disparity variation on optimal processing size for half-occlusion detection. (A) Sensitivity as a function of disparity contrast, assuming the optimal size of the integration area. Sensitivity decreases monotonically with disparity contrast. For each disparity contrast, sensitivities were measured with the optimal integration area. (B) Optimal window size as a function of disparity contrast. The optimal window size decreases approximately linearly as disparity contrast increases on a log-log scale. Results are highly robust to changes in the bin width.
Disparity detection in natural stereo-images
Binocular disparity is our most precise depth cue. Binocular disparity detection is the task of detecting whether a particular binocularly visible point is perfectly foveated (i.e., fixated) or not. When a target point is fixated accurately, the point is imaged on the foveas of both eyes. When a target point is not fixated accurately, the point's image will not fall on the foveas, and nonzero disparities occur (Figure 7A). Just as local depth variation impacts the ability to detect whether a point in one eye's image is half-occluded or binocularly visible, it should also impact the detection of nonzero binocular disparities in natural scenes (Figure 7B). 
Figure 7
 
Effect of disparity contrast on disparity detection performance. (A) Stereo-image patches centered on binocularly visible scene points with 0.0 and 1.0 arcmin of fixation disparity. The eyes are fixated 1 arcmin in front of the target in the right image. (B) The disparity detection task simulated here is to distinguish scene points with 0.0 arcmin versus 1.0 arcmin of fixation disparity. Performance is compared for scene points with low, medium, and high disparity contrasts. (C) Conditional probability distributions of the decision variable. The decision-variable is the disparity that maximizes the local cross-correlation function (Equation 3). Results are presented for a spatial integration area of size 1.0°. The solid and dashed curves show the decision variable for scene points fixated with 0.0 and 1.0 arcmin of disparity, respectively, for patches having low (blue), medium (red), and high (green) disparity contrasts. (D) ROC curves for disparity detection. (E) Disparity detection sensitivity (i.e., d′) as a function of spatial integration area for different disparity contrasts.
Figure 7
 
Effect of disparity contrast on disparity detection performance. (A) Stereo-image patches centered on binocularly visible scene points with 0.0 and 1.0 arcmin of fixation disparity. The eyes are fixated 1 arcmin in front of the target in the right image. (B) The disparity detection task simulated here is to distinguish scene points with 0.0 arcmin versus 1.0 arcmin of fixation disparity. Performance is compared for scene points with low, medium, and high disparity contrasts. (C) Conditional probability distributions of the decision variable. The decision-variable is the disparity that maximizes the local cross-correlation function (Equation 3). Results are presented for a spatial integration area of size 1.0°. The solid and dashed curves show the decision variable for scene points fixated with 0.0 and 1.0 arcmin of disparity, respectively, for patches having low (blue), medium (red), and high (green) disparity contrasts. (D) ROC curves for disparity detection. (E) Disparity detection sensitivity (i.e., d′) as a function of spatial integration area for different disparity contrasts.
Local windowed cross-correlation is the standard model of disparity estimation (Banks et al., 2004; Cormack et al., 1991; Tyler & Julesz, 1978). Under this model, the estimated disparity is the disparity that maximizes the interocular correlation between a reference patch in one eye's image and a test patch in the other eye's image.  
\begin{equation}\tag{3}\hat \delta = \mathop {\arg \max }\limits_\delta \left[ {{{\sum\limits_{{\bf{x}} \in A} {{\bf c}_L^W\left( {\bf{x}} \right){\bf c}_R^W\left( {{\bf{x}} - \delta } \right)} } \over {\left\| {{\bf c}_L^W\left( {\bf{x}} \right)} \right\|\left\| {{\bf c}_R^W\left( {{\bf{x}} - \delta } \right)} \right\|}}} \right]\end{equation}
where Display Formula\(\hat \delta \) is the disparity estimate, and Display Formula\(\delta \) is the disparity between a patch in the anchor eye and a patch in the other eye. Equation 3 is written assuming that the left eye is the anchor eye.  
To examine the impact of local depth variation on disparity detection thresholds, we performed an analysis that is nearly identical to the half-occlusion detection analysis presented above. First, we randomly sampled 10,000 stereo-image patches having zero absolute disparity at the center pixel. Second, we estimated disparity from the stereo-image patches using local windowed cross-correlation (Equation 3). This disparity estimate is the decision variable for the disparity detection task. The conditional probability of the disparity estimates for each disparity contrast Display Formula\(p\left( {\hat \delta |\delta = 0,{C_\delta }} \right)\) is symmetric and centered at zero (Figure 7C). Third, assuming that the distribution of estimates for small nonzero disparities Display Formula\(p\left( {\hat \delta |\delta \ne 0,{C_\delta }} \right)\) is a shifted version of the distribution for zero disparities, we swept out an ROC curve (Figure 7D), computed the area underneath it to determine percent correct, and then converted to d′. For each disparity contrast, we computed sensitivity (i.e., d′) for detecting a target with 1.0 arcmin of disparity as a function of different local integration areas (Figure 7E). The integration area is defined to be the width of the cosine window used for computing the windowed cross-correlation. 
Results for the disparity detection task are similar to results for the half-occlusion task. Local depth variation reduces disparity detection sensitivity (Figure 7D and E; Figure 8A; Supplementary Figure S2B), and decreases the size of the spatial integration area that optimizes performance (Figure 7E; Figure 8B; Supplementary Figure S2B). When the integration area is too large, the depth variation within the integration area prevents reliable estimates. When the integration area is too small, the luminance variation within the integration area is insufficient to obtain a reliable estimate. Thus, the visual system should adapt its spatial integration area to the local depth variation in the scene. Compared to when the best-fixed integration area is used across all stimuli, d′ is 12% higher when the optimal spatial integration area is used (see Methods). Unlike half-occlusion detection, however, the optimal integration area for disparity detection shrinks and then plateaus (Figure 8B), and does not decrease below 0.4° (0.2° at half-height). 
Figure 8
 
Effect of local disparity variation on size of optimal integration area for disparity detection. (A) Sensitivity as a function of disparity contrast. Sensitivity drops monotonically with disparity contrast. For each disparity contrast, sensitivities were measured with the optimal integration area. (B) Optimal integration area as a function of disparity contrast. The optimal integration area decreases as disparity contrast increases and then plateaus at a minimum value (0.4° window; 0.2° at half-height).
Figure 8
 
Effect of local disparity variation on size of optimal integration area for disparity detection. (A) Sensitivity as a function of disparity contrast. Sensitivity drops monotonically with disparity contrast. For each disparity contrast, sensitivities were measured with the optimal integration area. (B) Optimal integration area as a function of disparity contrast. The optimal integration area decreases as disparity contrast increases and then plateaus at a minimum value (0.4° window; 0.2° at half-height).
Interestingly, these results are closely related to the literature on human stereopsis. Local depth variation hurts human performance in depth discrimination, disparity detection, and stereo-resolution tasks (Banks et al., 2004; Ernst & Banks, 2002; Harris, McKee, & Smallman, 1997; Kane, Guan, & Banks, 2014). (Spatial stereo-resolution tasks measure the finest detectable spatial modulation of binocular disparity.) Two separate groups have argued that human spatial stereo-resolution is limited by the smallest disparity selective receptive fields available to the human visual system (Banks et al., 2004; Harris et al., 1997). Harris et al. estimated that the smallest disparity-selective receptive fields available to the human visual system are 0.07°–0.13° in diameter (Harris et al., 1997). Banks et al. estimated that the smallest receptive fields are 0.13° in diameter (Banks et al., 2004). 
Our estimate of the smallest useful disparity receptive field in natural scenes (0.4° integration area, 0.2° width at half-height) is within a factor of two to the psychophysical estimates of the smallest receptive field available to the human visual system (∼0.1°). Thus, just as the sampling rate of the foveal cone photoreceptors is determined by the cut-off spatial frequency of human optical point spread function, the smallest disparity selective fields available to the human visual system may be determined by the smallest receptive fields that are useful for estimating disparity in natural binocular images. The logic is that there is little point in developing receptive fields that select for information that is not useful or available. 
Effect of depth variation on optimal shape of integration region
The previous sections demonstrate that the spatial integration area that optimizes half-occlusion and disparity detection performance decreases in size with increases in disparity contrast. Does disparity contrast also impact the shape of the spatial integration areas that optimize performance? To check, we performed the following steps, starting with the half-occlusion task. First, for a given disparity contrast, we found the optimally sized integration area (see Figures 6B and 8B). Second, we varied the aspect ratio of the integration area while holding the size of the integration area fixed (Figure 9A), computed the task-relevant decision variable (Equation 2) for each aspect ratio, and determined d′ using the procedures described above. Third, we repeated the previous steps across different disparity contrasts and plotted sensitivity. Performance is optimized by vertically elongated integration areas at high disparity contrasts, and by (slightly) horizontally elongated areas at low disparity contrasts (Figure 9B and C). We repeated this analysis for the disparity detection task and found that the same patterns hold (Figure 9D through F). The secondary effect of optimizing aspect ratio is modest (∼0.1 in d′ units) compared to the primary effect of optimizing size. However, given that evolution tends to push organisms towards the optimal solutions in critical tasks, one might expect biological systems to have developed mechanisms that adapt both the size and the shape of their receptive fields to the local depth structure of stimuli. Indeed, receptive fields in the visual cortex span the range of sizes and shapes necessary to optimize performance in natural scenes (Harris et al., 1997; Ringach, 2002). Investigating whether visual systems have developed such mechanisms will be an interesting topic for future research. 
Figure 9
 
Effect of local depth variation on the shape of the spatial integration area that optimizes performance. (A) Integration areas with the same size but different aspect ratios within which to compute the decision variable for the half-occlusion task (i.e., binocular image correlation). (B) Change in half-occlusion detection sensitivity (i.e., d′) as a function of aspect ratio for different disparity contrasts. Arrows indicate the aspect ratio that maximizes half-occlusion detection performance. The maxima were determined using a polynomial fit (not shown) to the raw data. Aspect ratios less than 1.0 are horizontally elongated. Aspect ratios larger than 1.0 are vertically elongated. Colors indicate low (blue; 0.05–1.00 arcmin), medium (red; 0.2–4.0 arcmin), and high (green; 0.75–15.0 arcmin) disparity contrasts (C) Optimal aspect ratio as a function of disparity contrast. The optimal window for half-occlusion detection is more vertically elongated for higher disparity contrasts. The best-fixed aspect ratio across all disparity contrasts is also shown. (D) Same as (A), but for the disparity detection task. (E–F) Same as (B–C), but for the disparity detection task.
Figure 9
 
Effect of local depth variation on the shape of the spatial integration area that optimizes performance. (A) Integration areas with the same size but different aspect ratios within which to compute the decision variable for the half-occlusion task (i.e., binocular image correlation). (B) Change in half-occlusion detection sensitivity (i.e., d′) as a function of aspect ratio for different disparity contrasts. Arrows indicate the aspect ratio that maximizes half-occlusion detection performance. The maxima were determined using a polynomial fit (not shown) to the raw data. Aspect ratios less than 1.0 are horizontally elongated. Aspect ratios larger than 1.0 are vertically elongated. Colors indicate low (blue; 0.05–1.00 arcmin), medium (red; 0.2–4.0 arcmin), and high (green; 0.75–15.0 arcmin) disparity contrasts (C) Optimal aspect ratio as a function of disparity contrast. The optimal window for half-occlusion detection is more vertically elongated for higher disparity contrasts. The best-fixed aspect ratio across all disparity contrasts is also shown. (D) Same as (A), but for the disparity detection task. (E–F) Same as (B–C), but for the disparity detection task.
Why does the shape of the optimal integration area change with disparity contrast? From visual inspection of numerous individual examples we speculate that, at high disparity contrasts, vertical elongation improves performance because large disparity contrasts are most often caused by vertically oriented depth edges (e.g., Figure 9A). For such cases, vertically oriented integration areas increase the number of pooled spatial locations over which the disparity is more nearly constant (Kanade & Okutomi, 1994). We are less clear about why, at low disparity contrasts, integration areas with slight horizontal elongation improve performance. We speculate that this is because low disparity contrasts are often associated with the ground plane (e.g., Figure 9D), and horizontally oriented integration areas maximize the number of pooled spatial locations with the same disparity. 
Estimating disparity contrast
Local depth variation in the region around fixation makes disparity-related tasks more difficult (Figures 5 through 8). A visual system with access to estimates of local disparity contrast can, in principle, improve half-occlusion and disparity detection performance by adapting the size and shape and shape of its receptive fields to local disparity contrast. How might the visual system estimate disparity contrast from information in the left- and right-eye images? One approach is to estimate disparity at each spatial location (pixel) in a local area with generic receptive fields, compute the contrast (i.e., local root-mean-squared disparity) of those estimates, and then re-estimate the disparities with optimized receptive fields. A second more direct approach is to compute a simple binocular image statistic that predicts disparity contrast at each spatial location, and then estimate the disparities with optimized receptive fields. 
Interestingly, the contrast Display Formula\({C_B}\) of the binocular difference image is a good predictor of disparity contrast Display Formula\({C_\delta }\). The binocular difference image is the pixel-wise difference between the left- and right-eye Weber contrast images Display Formula\({\bf{c}}_B^W\left( {\bf{x}} \right) = {\bf{c}}_R^W\left( {\bf{x}} \right) - {\bf{c}}_L^W\left( {\bf{x}} \right)\). The binocular difference image has featured in previously proposed stereo-coding schemes (Li & Atick, 1994). Figure 10A shows a stereo-image patch with low disparity variation and low binocular difference image contrast. Figure 10B shows a stereo-image patch with high disparity contrast and high binocular difference image contrast. Figure 10C shows that difference image contrast predicts disparity contrast across thousands of patches (n = 10,000). 
Figure 10
 
Joint statistics of disparity contrast and binocular difference image contrast in natural scenes. (A) Stereo-image with low groundtruth disparity contrast and low binocular difference image contrast. The upper row shows the stereo-image; the circle indicates the 1° spatial integration area from which the statistics were computed. The lower row shows the groundtruth disparities and the binocular difference image. (B) Stereo-image with high groundtruth disparity contrast has high binocular difference image contrast. (C) Disparity contrast and binocular difference image contrast in natural scenes are jointly distributed as a log-Gaussian and are significantly correlated. Points labeled in yellow indicate the disparity contrast and binocular difference image contrast of the stereo-images (A) and (B). Statistics were computed for a spatial integration area of 1.0° (0.5° width at half-height). Similar results hold for other spatial integration areas.
Figure 10
 
Joint statistics of disparity contrast and binocular difference image contrast in natural scenes. (A) Stereo-image with low groundtruth disparity contrast and low binocular difference image contrast. The upper row shows the stereo-image; the circle indicates the 1° spatial integration area from which the statistics were computed. The lower row shows the groundtruth disparities and the binocular difference image. (B) Stereo-image with high groundtruth disparity contrast has high binocular difference image contrast. (C) Disparity contrast and binocular difference image contrast in natural scenes are jointly distributed as a log-Gaussian and are significantly correlated. Points labeled in yellow indicate the disparity contrast and binocular difference image contrast of the stereo-images (A) and (B). Statistics were computed for a spatial integration area of 1.0° (0.5° width at half-height). Similar results hold for other spatial integration areas.
Binocular difference image contrast and disparity contrast are jointly log-Gaussian distributed and are strongly correlated (Display Formula\(r = 0.60\); Figure 10C). The correlation is nearly independent of viewing distance, although the most likely disparity contrasts decrease as distance increases. The relationship is well fit by a line in the log domain and a power law Display Formula\({C_\delta } = \alpha C_B^p\) in the linear domain where Display Formula\(p\) is the power and Display Formula\(\alpha \) is a proportionality constant (i.e., Weber Fraction); the best-fit power is Display Formula\(p \cong 2.0\). Thus, the visual system could obtain a relatively precise estimate of local disparity variation (and local depth variation) directly from the local contrast of the binocular difference image, and use this estimate to select the integration area that is best suited for a given level of disparity contrast (Chen & Qian, 2004). These findings motivate a series of investigations on the mechanisms for estimating local disparity variation, and assessing its impact on disparity detection performance in natural scenes. 
Natural disparity statistics: Surface-based variation
Local disparity variation negatively impacts half-occlusion and disparity detection performance. What is the most likely spatial pattern of disparity variation in natural scenes? Assuming that fixations occur only on 3D surface points, foveal disparities are always zero. At nonfoveal retinal locations, disparities vary with the depth structure (and distance) of the fixated stimulus. Here, we characterize depth-induced disparity variation in two ways. First, we compute the most likely spatial pattern of disparity variation in the region near the fovea. Second, we compute the probability of half-occluded points in the region near the fovea. Together, these two computations quantify disparity variation occurring within individual surfaces, and depth variation occurring between surfaces separated by depth edges. Both sources of local depth variation are important for developing optimal methods for disparity detection and estimation in spatially varying natural scenes. 
Pooling signals over regions with low variance and high correlation will result in more reliable disparity detection and estimation. Figure 11 shows how the pattern of disparity variation for binocularly visible surfaces changes with retinal eccentricity near the fovea (±0.5°). Disparities are zero at the foveas, by definition, and become more variable at retinal positions farther from fixation. Across all binocularly visible surface points, the region of minimum variation is vertically elongated (Figure 11A). Disparity contrast decreases the size and changes the shape of this low variance region (Figure 11B). Stimuli with low disparity contrast tend to have large horizontally oriented regions of least disparity variation around the fovea. Stimuli with high disparity contrast tend to have small more vertically oriented regions of least disparity variation around the fovea. 
Figure 11
 
Disparity variation associated with binocularly visible surfaces. (A) The standard deviation of natural disparity signals increases systematically with retinal eccentricity. Disparities are more variable at retinal locations farther from fixated points. (B) Same as (A), but conditioned on five different disparity contrast bins: 0.1–1.0, 0.2–2.0, 0.4–4.0, 0.75–7.5, 1.5–15.0 arcmin. At low disparity contrasts, disparities are nearly homogeneous within 1° of the fovea. At high disparity contrasts, disparity variation increases rapidly with eccentricity, and the region of low variability is smaller and more vertically elongated. Ellipses (fit by hand) indicate iso-disparity-variation contours. (C) Disparity correlation as a function of retinal position. (D) Same as (C), but conditioned on the disparity contrast bins in (B). At low disparity contrasts, disparities are more highly correlated across space. At high disparity contrasts, the region of high correlation is smaller and more vertically elongated. (E) and (F) Horizontal and vertical slices through plots in (C) and (D), solid and dashed curves, respectively.
Figure 11
 
Disparity variation associated with binocularly visible surfaces. (A) The standard deviation of natural disparity signals increases systematically with retinal eccentricity. Disparities are more variable at retinal locations farther from fixated points. (B) Same as (A), but conditioned on five different disparity contrast bins: 0.1–1.0, 0.2–2.0, 0.4–4.0, 0.75–7.5, 1.5–15.0 arcmin. At low disparity contrasts, disparities are nearly homogeneous within 1° of the fovea. At high disparity contrasts, disparity variation increases rapidly with eccentricity, and the region of low variability is smaller and more vertically elongated. Ellipses (fit by hand) indicate iso-disparity-variation contours. (C) Disparity correlation as a function of retinal position. (D) Same as (C), but conditioned on the disparity contrast bins in (B). At low disparity contrasts, disparities are more highly correlated across space. At high disparity contrasts, the region of high correlation is smaller and more vertically elongated. (E) and (F) Horizontal and vertical slices through plots in (C) and (D), solid and dashed curves, respectively.
A closely related patch-based disparity correlation analysis yields similar results. For a given disparity map, the disparity correlation at location Display Formula\({x_{}}\) is given by the cosine-similarity of a patch centered at x with the patch centered at Display Formula\({{\bf x}_0}\). The patch size is an open parameter that we fixed to the optimal integration area for disparity detection (0.5°; see Figure 8). The spatial pattern of near-foveal disparity correlations in Figure 11C represents the average of the spatial correlations computed for each of a large sample of patches. Figure 11D shows how disparity correlation changes with disparity contrast. Far from the fovea, disparities are weakly correlated with the disparities at the fovea, and the region of high correlation decreases in size and becomes more vertically elongated with disparity contrast. Figure 11E and F shows vertical and horizontal slices through the plots in Figure 11C and D. Similar results are obtained with pixel-based analyses (Supplementary Figure S3). These results justify the systematic changes in the size and shape of the task-optimal spatial integration areas with disparity contrast (Figures 5 through 9). 
Natural disparity statistics: Discontinuity-based variation
Local disparity variation in natural scenes is due to both continuous variation within a surface, and to the occurrence of depth discontinuities (edges) between surfaces. Half-occluded points are reliable indicators of many, but not all, depth discontinuities. Thus, the statistics of half-occluded points contribute to an understanding of the statistics of natural depth variation. For every stereo-image in the dataset, we identified all half-occluded points directly from the range data (Figure 12A, bottom row). First, we computed the groundtruth horizontal disparity gradient (Display Formula\({{DG = \Delta \theta } \mathord{\left/ {\vphantom {{DG = \Delta \theta } {\Delta X}}} \right. \kern-1.2pt} {\Delta X}}\)) for all pairs of points in a given epipolar plane (see Figure 2C). Next, we labeled a given point as half-occluded if the horizontal disparity gradient between it and any other point was 2.0 or higher. For each retinal location, we computed the half-occlusion probability; i.e., the proportion of stereo-image patches with a half-occlusion at that retinal location (Figure 12B). The impact of disparity contrast on half-occlusion probability is similar to the impact of disparity contrast on disparity standard deviation (Figure 12C; c.f. Figure 11B); the spatial region where half-occlusion probabilities are lowest decreases in size and becomes more vertically elongated with disparity contrast. The sizes and shapes of the optimal integration region for disparity estimation (Figures 5 through 9) are compatible with both the statistics of surface-based and discontinuity-based depth variation in natural scenes. 
Figure 12
 
Half-occlusion statistics in natural scenes (A) Example natural stereo-image (top), and binocular visibility map (bottom). Half-occluded points are black. Binocularly visible points are white. Points in one eye's image that are invisible in the other eye's image (i.e., half-occluded points) are shown in black. Inset shows stereo-image patch with half-occluded points overlaid in black. (B) Half-occlusion probability at each spatial location near the fovea. (C) Same as (B), but conditioned on five different disparity contrast bins: 0.1–1.0, 0.2–2.0, 0.4–4.0, 0.75–7.5, 1.5–15.0 arcmin. At low disparity contrasts, half-occlusion probability is near-zero throughout the 1° region near the fovea. At high disparity contrasts, half-occlusion probability increases dramatically with eccentricity, and the region of low probability is smaller and more vertically elongated. (D) Distribution of horizontal sizes of contiguous binocularly visible and half-occluded regions in natural scenes (solid and dashed curves, respectively). The sizes of contiguous binocularly visible and half-occluded regions are approximately distributed as power laws with mean horizontal sizes of 0.44° and 0.06°, respectively.
Figure 12
 
Half-occlusion statistics in natural scenes (A) Example natural stereo-image (top), and binocular visibility map (bottom). Half-occluded points are black. Binocularly visible points are white. Points in one eye's image that are invisible in the other eye's image (i.e., half-occluded points) are shown in black. Inset shows stereo-image patch with half-occluded points overlaid in black. (B) Half-occlusion probability at each spatial location near the fovea. (C) Same as (B), but conditioned on five different disparity contrast bins: 0.1–1.0, 0.2–2.0, 0.4–4.0, 0.75–7.5, 1.5–15.0 arcmin. At low disparity contrasts, half-occlusion probability is near-zero throughout the 1° region near the fovea. At high disparity contrasts, half-occlusion probability increases dramatically with eccentricity, and the region of low probability is smaller and more vertically elongated. (D) Distribution of horizontal sizes of contiguous binocularly visible and half-occluded regions in natural scenes (solid and dashed curves, respectively). The sizes of contiguous binocularly visible and half-occluded regions are approximately distributed as power laws with mean horizontal sizes of 0.44° and 0.06°, respectively.
Over what visual angles do binocularly visible surfaces typically extend in natural scenes? How frequently do half-occlusions occur within a given visual angle in natural scenes? To address these questions, we measured the statistics of contiguous binocularly visible regions and contiguous half-occluded regions in natural scenes. Figure 12A shows an example natural scene (upper row) with corresponding binocularly visible and half-occluded points (lower row; white and black pixels, respectively). We measured the size (in visual angle) of each contiguous horizontal region of binocularly visible points. We also measured the size of each contiguous region of half-occluded points. The distributions of these sizes are shown in Figure 12D. Both distributions are well described by a power law for sizes larger than a certain critical size: larger than 0.3° for binocularly visible regions, and larger than 0.1° for half-occluded regions (see Supplementary Figure S4). Power laws have previously provided good descriptions of size statistics in a variety of contexts (Lu & Hamilton, 1991; Reed & McKelvey, 2002). 
The distribution of binocularly visible region sizes (Figure 12D) bounds the distribution of binocularly visible surface sizes in natural scenes. (A contiguous binocularly visible region may be comprised of multiple binocularly visible surfaces). Information about the distribution of surface sizes (in visual angle) is important for determining the optimal rules for segmenting surfaces in depth in real-world scenes. Optimal grouping of local disparity estimates from the same surfaces and optimal segmentation of local disparity estimates between the two surfaces depends on this information. 
Natural disparity statistics: Dependence on distance
The pattern of disparity variation near the fovea depends not just on the local depth-structure of the fixated object in the environment, but also on the distance of the fixated object. Here, we examine how the foveal pattern of disparity variation and half-occlusion probabilities changes with viewing distance. Figure 13A shows the most likely pattern of disparity variation in our dataset (data identical to Figure 11A). Outside the central ±1/8°, disparity variance grows linearly with retinal eccentricity, and increases more rapidly with changes in azimuth than with changes in elevation (Figure 13B). Figure 13C through E shows that disparity variance increases less rapidly with eccentricity at far than at near distances (Figure 13C through E). This effect occurs because the magnitude of a disparity signal decreases with the square of distance for a given depth difference. Thus, when a far surface is fixated, the disparities in the immediate neighborhood of the fovea are more likely to be near zero. Given that disparity variability decreases with distance (also see Supplementary Figure S5), and given that the Burge et al. (2016) dataset only contains objects 3 m and farther away, the estimates of disparity variability that we report are likely to be conservative. Figure 13F shows the spatial pattern of half-occlusion probabilities near the fovea (same data as Figure 12B). Figure 13G shows how these probabilities change with distance. At large distances, the central region of least half-occlusion probability shrinks and is vertically elongated. 
Figure 13
 
Near-foveal disparities as a function of viewing distance and spatial integration region. (A) Disparity standard deviation across all patches in database (data identical to Figure 11A). (B) Disparity variance as a function of azimuth and elevation. Disparity variance increases linearly outside the central ±1/8°. Variance increases more rapidly in azimuth \(C_\delta ^2 = 60.1a + 7.0\) than in elevation \(C_\delta ^2 = 47.4\it e + 7.6\) where \(a\) and \(e\) are azimuth and elevation in degree, respectively. Curves correspond to the squared standard deviation along horizontal and vertical slices through the plot in 11A. (C) Disparity standard deviation at each retinal location, but conditioned on five different viewing distances (4.0–20.0 m). For each viewing distance, data is pooled in 0.1 diopter bins centered on the viewing distance. For far distances, disparities near the fovea are more likely to be small. (D) and (E) Disparity variance in azimuth and elevation as a function of distance (colors). Best-fit lines in azimuth range from \(C_\delta ^2 = 83.4a + 17.8\) to \(C_\delta ^2 = 47.4a + 4.8\) at view distances from 4.0 m to 20.0 m and best fit lines in elevation range from \(C_\delta ^2 = 83.8e + 11.5\) to \(C_\delta ^2 = 36.0e + 5.0\). Variance increases more rapidly in the upper than lower visual field. (F) Half-occlusion probability as a function of retinal location (data identical to Figure 12B). (G) Half-occlusion probability conditioned on viewing distance. For far distances, the region of least half-occlusion probability shrinks to a vertically elongated zone centered on the fovea.
Figure 13
 
Near-foveal disparities as a function of viewing distance and spatial integration region. (A) Disparity standard deviation across all patches in database (data identical to Figure 11A). (B) Disparity variance as a function of azimuth and elevation. Disparity variance increases linearly outside the central ±1/8°. Variance increases more rapidly in azimuth \(C_\delta ^2 = 60.1a + 7.0\) than in elevation \(C_\delta ^2 = 47.4\it e + 7.6\) where \(a\) and \(e\) are azimuth and elevation in degree, respectively. Curves correspond to the squared standard deviation along horizontal and vertical slices through the plot in 11A. (C) Disparity standard deviation at each retinal location, but conditioned on five different viewing distances (4.0–20.0 m). For each viewing distance, data is pooled in 0.1 diopter bins centered on the viewing distance. For far distances, disparities near the fovea are more likely to be small. (D) and (E) Disparity variance in azimuth and elevation as a function of distance (colors). Best-fit lines in azimuth range from \(C_\delta ^2 = 83.4a + 17.8\) to \(C_\delta ^2 = 47.4a + 4.8\) at view distances from 4.0 m to 20.0 m and best fit lines in elevation range from \(C_\delta ^2 = 83.8e + 11.5\) to \(C_\delta ^2 = 36.0e + 5.0\). Variance increases more rapidly in the upper than lower visual field. (F) Half-occlusion probability as a function of retinal location (data identical to Figure 12B). (G) Half-occlusion probability conditioned on viewing distance. For far distances, the region of least half-occlusion probability shrinks to a vertically elongated zone centered on the fovea.
The generality of the above conclusions may be limited because the most likely pattern of disparity variation depends not just on the depth structure of natural scenes but on which scene points are fixated. One weakness of the Burge et al. (2016) dataset, upon which this manuscript is based, is that it has no information about human eye movements. Other datasets do (Gibaldi, Canessa, & Sabatini, 2017; Liu et al., 2008; Sprague et al., 2015). At near distances, humans preferentially fixate objects nearer than random fixations. At far distances (i.e., beyond 3 m), human fixations and random fixations are hard to distinguish (Sprague et al., 2015). Thus, the results presented in the current manuscript are likely to be representative of disparity variability for human fixations when objects are farther than 3 m. However, the results are also likely to underestimate disparity variability across all distances encountered in natural viewing (also see Supplementary Figure S1 and Supplementary Figure S6). 
Discussion
We developed a high-precision stereo-image sampling procedure, and used it along with a recently published dataset (Burge et al., 2016), to demonstrate how natural depth variation impacts performance in two tasks fundamentally related to stereopsis. In the first set of analyses, we analyzed natural binocular images and determined the receptive field sizes and shapes that optimize performance in half-occlusion detection and disparity detection in natural scenes. In the second set of analyses, we analyzed groundtruth range data and determined how disparity statistics and half-occlusion probabilities change as a function of retinal eccentricity. The latter analyses justify the findings of the former. Here, in the discussion section, we discuss the connections to other topics in the literature, limitations of the current results, and directions for future work. 
Relationship to previous work
The dataset leveraged in this manuscript has some advantages and some disadvantages compared to other recently published datasets. We compare four recently published datasets, and consider the advantages and disadvantages of each with respect to six factors: (a) the presence or absence of eye movements, (b) the presence or absence of groundtruth half-occlusions and groundtruth disparities, (c) the spatial resolution of the images, (d) the range of object distances represented in the dataset, (e) the diversity of the sampled scenes, and (f) the appropriateness of the dataset for use in psychophysical experiments. Each dataset was collected with a different purpose (or set of purposes) in mind, and each is limited by choices made by the researchers and by the technology used to collect the data. 
Sprague et al. (2015) affixed human observers with a mobile binocular eye tracker and collected binocular image movies of natural scenes as human observers performed everyday tasks around the University of California, Berkeley. The dataset contains objects ranging in distance from 0.5 m to infinity. The principal aim in collecting the dataset was to estimate the prior probability distribution of binocular disparities encountered by humans in natural viewing. Absolute disparity depends on the 3D structure of the scene, where the observer is located in the scene, and where the observer gazes in the scene. Collecting stereo-images in concert with matched binocular eye movements is therefore necessary to estimate the distribution of disparities encountered by humans, and the dataset is well suited for this aim. There are two primary disadvantages associated with the dataset. The first disadvantage is that groundtruth disparities and groundtruth occlusions are not known. Disparities were instead estimated from the left- and right-eye images via image-based routines. A second disadvantage is that the stereo-images are low spatial resolution (∼9 pix/deg). Thus, while this dataset is well-suited for estimating disparity statistics in natural viewing, it is ill-suited for examining the accuracy of disparity estimation algorithms, for investigating the impact of local disparity variation on disparity estimation performance, or for obtaining natural stimuli for use in psychophysical experiments. 
Gibaldi et al. (2017) tracked binocular eye movements of head-fixed human observers viewing two computer generated 3D scenes from different viewpoints on a stereo-display (Gibaldi et al., 2017). The dataset contains objects ranging in distance from only 0.5 to 1.5 m. This paper also had the aim of characterizing disparity statistics in natural viewing. Gibaldi et al.'s computer-generated scenes afford access to groundtruth disparities and groundtruth occlusions. The rendered images had comparatively high spatial resolution (∼44 pix/deg) and, with appropriate calibration, could be suitable for use in psychophysical experiments. All of these features represent important improvements on the weaknesses of the Sprague et al. (2015) dataset. The first disadvantage of the Gibaldi et al. dataset is that the eye movements were not collected during observer interaction with the environment; eye movements were instead collected during free viewing of static disparity-specified scenes, presented on a haploscope in a laboratory. A second disadvantage is that the dataset contains only two types of scenes—an office desk and a kitchen table—raising the specter of statistical undersampling. A third disadvantage is that the images were constructed and rendered in software. Although the authors undertook a heroic effort to map natural textures onto high-resolution 3D models of real objects, the possibility remains that the resulting stimuli do not accurately capture all relevant aspects of real scenes. Those caveats aside, this dataset has tremendous potential value, and it provides computer-generated stimuli for both computational and psychophysical studies, especially if it can be expanded. 
Adams et al. (2016) collected multiple stereo-image pairs, and wide-field (i.e., 360°) laser range scans and high-dynamic-range images of 76 outdoor scenes near Hampshire, UK (Adams et al., 2016). The dataset contained objects ranging from 1 m to infinity. The authors also expended considerable effort to ensure that imaged scenes were sampled randomly throughout the English countryside. This dataset was collected with the immediate aim of characterizing the statistics of 3D surface orientation as a function of viewing elevation in natural scenes, and the authors developed a sophisticated procedure for estimating local surface orientation from the distance data. The dataset is also very well suited for other applications not relevant to the topic of this manuscript. The stereo-images have very high spatial resolution (∼160 pix/deg). One disadvantage of this dataset is that it does not include eye movement data, so the impact of natural eye movements on disparity statistics cannot be estimated. A second disadvantage is that only one range scan was captured per scene. With only one range scan, stereo-parallax precludes precise pixel-wise coregistration of the groundtruth distance data with the left- and right-eye photographic images. Thus, although groundtruth disparity could be computed from the distance data, it is impossible to precisely coregister the stereo-image data with the range data at each pixel in both the left- and right-eye images. 
Burge et al. (2016) collected 99 stereo-images of natural scenes with laser range scans coregistered to each eye's photographic image around the University of Texas at Austin campus. The dataset contains objects ranging in distance from 3 m to infinity. A robotic gantry aligned the nodal points of the camera and the scanner during data acquisition. As a result, every pixel in each eye's photographic image contains groundtruth distance data from the corresponding range scan from which groundtruth disparities and groundtruth occlusions can be directly computed. The images in the dataset also have comparatively high spatial resolution (∼52 pix/deg). These features of the dataset make it particularly well suited for performing analyses of the impact of local disparity variation on disparity estimation. The primary disadvantage of this dataset is that it does not contain eye movement data, although the technique used by Gibaldi et al. (2017) could be applied to get comparable data (also see Liu, Cormack, & Bovik, 2010). However, because the data has high spatial resolution and coregistered groundtruth distance information, the dataset should prove useful as a source for stimuli in perceptual experiments and for future computational studies. 
Adaptive filtering in psychophysics and neuroscience
The computational results reported here predict that human performance in disparity-related tasks can benefit from adapting the size and shape of receptive fields to the disparity contrast of each stimulus. Is stimulus-based adaptive filtering neurophysiologically plausible? Yes. Increases in luminance contrast are associated with decreases in the spatial size of receptive fields in macaque V1 (Cavanaugh, Bair, & Movshon, 2002; Sceniak, Ringach, Hawken, & Shapley, 1999), and increases in luminance contrast are associated with decreases in the temporal integration period in macaque V1 and MT (Bair & Movshon, 2004). Developing psychophysical paradigms that can address this issue is an important direction for future work. 
The influence of priors in perception
In recent years, the impact of stimulus priors on perceptual biases (Burge et al., 2010; Burge, Peterson, & Palmer, 2005; Girshick, Landy, & Simoncelli, 2011; Kim & Burge, 2018; Parise, Knorre, & Ernst, 2014; Stocker & Simoncelli, 2006; Weiss, Simoncelli, & Adelson, 2002) and on the design of neural systems (Liu et al., 2008; Sprague et al., 2015) have been extensively investigated. However, Bayesian estimation theory predicts that priors should significantly impact perceptual estimates only when measurements are highly unreliable (Knill & Richards, 1996). In many (most?) viewing situations, factors other than the prior are likely to be more important determinants of performance (Burge & Jaini, 2017). 
Psychophysics is principally concerned with understanding the lawful relationships between stimulus properties and human performance in critical tasks. Human performance in natural tasks varies from stimulus to stimulus because stimuli differ in their task-relevant properties. The prior probability distribution alone cannot account for stimulus-to-stimulus performance variation. For example, the median stimulus in the natural scene database is near-planar (Supplementary Figure S7), and performance with near-planar stimuli is quite good, but not representative of performance with stimuli having more depth variation (Figure 4). Thus, it is necessary to characterize stimulus variability and develop models that predict its impact on psychophysical performance. A great deal of previous work has examined the impact of external noise on performance in simple tasks (Geisler & Davila, 1985; Pelli, 1985). Comparatively little work has examined the impact of natural stimulus variability on performance in critical tasks (but also see Burge & Geisler, 2011; Burge & Geisler, 2014; Burge & Geisler, 2015; Geisler & Perry, 2009; Hibbard, 2008). The current paper examines the impact of natural stimulus variability on two tasks fundamental to stereopsis. 
Stereo-image patch sampling for psychophysics
Task-specific computational analyses, like those presented here, are useful for determining the optimal solutions to sensory-perceptual problems, and for developing targeted hypotheses about the processing rules of biological visual systems. However, to determine whether computational results, like those presented here, are in fact relevant to biological visual systems, psychophysical experiments are ultimately required. The stereo-image sampling and interpolation procedure developed here can be used to obtain an abundant supply of test stimuli with known groundtruth disparities for future experiments on human disparity processing and stereopsis with natural stimuli. 
Change-point statistics for optimal grouping and segregation
A grand problem in perception and neuroscience research is to understand the principles that drive how noisy local estimates are grouped across space and time into more accurate global estimates (Yuille & Grzywacz, 1988). The spatial patterns of estimates, the precision (i.e., reliability) of those estimates, and the change-point statistics of natural scenes play important roles in determining the optimal rules for grouping and segmenting local estimates. (In this context, change-point statistics quantify the probability that spatially adjacent locations correspond to the same or different surfaces; Figure 12, Supplementary Figure S4). Probability-based modeling frameworks, and the careful compilation of natural image and scene statistics, should provide a strong foundation for understanding the principles that should drive local-global processing in natural scenes. 
Conclusion
In this manuscript, we developed a high-fidelity stereo-image sampling and interpolation procedure and then used it to investigate the impact of natural depth variation on two tasks fundamental to stereopsis: half-occlusion detection and disparity detection. Local depth variation decreases the size and changes the shape of the spatial integration area that optimizes performance in both tasks. We also showed how disparity variation and half-occlusion probability changes as a function of retinal eccentricity, and presented the first data on the distributions of half-occluded and binocularly visible region sizes in natural scenes. The tools reported here can facilitate the use of natural stimuli in psychophysical studies of stereovision, and supply a strong empirical foundation for the future development of models of optimal grouping of disparity signals in natural scenes. 
Methods
Contrast images and binocular difference images
The inputs to the human visual system are the left- and right-eye retinal images. Disparity-processing mechanisms are widely modeled to operate on local contrast signals, the output of luminance normalization mechanisms in the retina. The Weber contrast image Display Formula\({\bf{c}}\) is obtained from a luminance images Display Formula\(I\) by subtracting off and dividing by the mean  
\begin{equation}\tag{4}{\bf{c}}\left( {\bf{x}} \right) = {{I\left( {\bf{x}} \right) - \bar I} \over {\bar I}}\end{equation}
where Display Formula\(\bar I\) is the local windowed mean and Display Formula\({{\bf{x}}_0} = \left( {{x_0},{y_0}} \right)\) is the location of the central pixel. The local windowed mean is given by  
\begin{equation}\tag{5}\bar I = {{\left( {\sum\limits_{{\bf x} \in A} {I\left( {\bf{x}} \right)W\left(\bf x \right)} } \right)} \mathord{\left/ {\vphantom {{\left( {\sum\limits_{x \in A} {I\left( {\bf{x}} \right)W\left( x \right)} } \right)} {\left( {\sum\limits_{x \in A} {W\left( {\bf{x}} \right)} } \right)}}} \right. \kern-1.2pt} {\left( {\sum\limits_{{\bf x} \in A} {W\left( {\bf{x}} \right)} } \right)}}\end{equation}
where Display Formula\(W\left( {\bf{x}} \right)\) is a spatial windowing function. We have used Gaussian or raised-cosine windowing functions; results are highly robust to the specific type of window. The windowed Weber contrast image  
\begin{equation}{{\bf{c}}^W}\left( {\bf{x}} \right) = {\bf{c}}\left( {\bf{x}} \right)W\left( {\bf{x}} \right)\end{equation}
is obtained by point-wise multiplying the Weber contrast image by the window.  
Binocular difference image contrast
The binocular difference image is given by the point-wise difference of the two retinal images  
\begin{equation}\tag{6}{\bf{c}}_B^W\left( {\bf{x}} \right) = {\bf{c}}_R^W\left( {\bf{x}} \right) - {\bf{c}}_L^W\left( {\bf{x}} \right)\end{equation}
where Display Formula\({{\bf{c}}_L}\) and Display Formula\({{\bf{c}}_R}\) are the windowed left- and right-eye contrast images where the center pixels of each image are centered on candidate corresponding points (see Figure 2A and B). Thus, the binocular difference image is the point-wise difference of the left-and right-eye contrast images. The RMS contrast of the binocular difference image Display Formula\({C_B}\) is given by  
\begin{equation}\tag{7}{C_B} = \sqrt {{{\left( {\sum\limits_{{\bf{x}} \in A} {{{{{\left( {{\bf{c}}_B^W\left( {\bf{x}} \right)} \right)}^2}} \mathord{\left/ {\vphantom {{{{\left( {c_B^W\left( {\bf{x}} \right)} \right)}^2}} {W\left( {\bf{x}} \right)}}} \right. \kern-1.2pt} {W\left( {\bf{x}} \right)}}} } \right)} \mathord{\left/ {\vphantom {{\left( {\sum\limits_{x \in A} {{{{{\left( {c_B^W\left( {\bf{x}} \right)} \right)}^2}} \mathord{\left/ {\vphantom {{{{\left( {c_B^W\left( {\bf{x}} \right)} \right)}^2}} {W\left( {\bf{x}} \right)}}} \right. \kern-1.2pt} {W\left( {\bf{x}} \right)}}} } \right)} {\left( {\sum\limits_{x \in A} {W\left( x \right)} } \right)}}} \right. \kern-1.2pt} {\left( {\sum\limits_{{\bf{x}} \in A} {W\left( {\bf{x}} \right)} } \right)}}} \end{equation}
where Display Formula\(W\left( {\bf{x}} \right)\) is the window that imposes the spatial integration area.  
Binocular disparity contrast
Our sampling procedure ensures that the center of each sampled stereo-image patch corresponds to the same surface point in the scene, assuming that the surface point is binocularly visible. If the surface point is half-occluded (i.e., visible to only one eye), its image only falls on the fovea of the anchor eye and a point on the occluding surface will be imaged at the fovea of the other eye. Disparity is undefined at half-occluded points, so we compute disparity contrast only for binocularly visible points. 
To compute disparity, a point of reference must be assumed. We compute disparity relative to the center pixel of the anchor eye's image patch (see Results). This computation is equivalent to computing absolute disparity, assuming that the center pixel of the anchor eye's image corresponds to a binocularly visible scene point and that the eyes are fixating it. It is also equivalent to computing relative disparity where the point of reference is the center pixel of the anchor eye's image. To compute the groundtruth disparity pattern from groundtruth distance, we first compute the vergence demand at each pixel from the distance data, and then subtract the vergence demand at the central pixel from the vergence demand of every other pixel in the patch. All vergence angles are computed in the epipolar plane. The result is the pattern of absolute near-foveal disparities Display Formula\(\delta \left( {\bf{x}} \right)\) that would result from fixating the surface point in the scene corresponding to the center pixel of the anchor eye's image. 
Root-mean-squared (RMS) disparity contrast is a scalar measure of variation about the mean in a local spatial area. The RMS disparity contrast is given by  
\begin{equation}\tag{8}{C_\delta } = \sqrt {{{\left( {\sum\limits_{\bf{x}} {{\bf c}_\delta ^2\left( {\bf{x}} \right)W\left( {\bf{x}} \right)} } \right)} \mathord{\left/ {\vphantom {{\left( {\sum\limits_x {c_\delta ^2\left( {\bf{x}} \right)W\left( {\bf{x}} \right)} } \right)} {\left( {\sum\limits_x {W\left( {\bf{x}} \right)} } \right)}}} \right. \kern-1.2pt} {\left( {\sum\limits_{\bf{x}} {W\left( {\bf{x}} \right)} } \right)}}} \end{equation}
where Display Formula\({{\bf c}_\delta }\left( {\bf{x}} \right) = \delta \left( {\bf{x}} \right) - \bar \delta \) is the mean-centered disparity map.  
Comparing detection sensitivities for fixed and adaptive spatial integration areas
In a two-presentation forced choice task, proportion correct Display Formula\(P\) is given by the area under the ROC curve (e.g., Figure 5D). The corresponding sensitivity Display Formula\(d^{\prime} \) is given by  
\begin{equation}\tag{9}{d^{\prime} _{}} = \sqrt 2 {\Phi ^{ - 1}}\left( P \right)\end{equation}
where Display Formula\({\Phi ^{ - 1}}\left( \cdot \right)\) is the inverse cumulative normal. With fixed spatial integration areas, window size is fixed to maximize sensitivity across all stimuli regardless of disparity contrast. With adaptive filtering, window size changes to optimize sensitivity at each disparity contrast. Overall proportion correct with adaptive filtering is given by a weighted sum of the proportion correct Display Formula\({P_i}\) in each nonoverlapping disparity-contrast bin.  
\begin{equation}\tag{10}{P_{adaptive}} = {1 \over N}\sum\limits_i {{N_i}{P_i}} \end{equation}
where Display Formula\({N_i}\) is the number of stimuli in disparity-contrast bin Display Formula\(i\).  
Acknowledgments
This work was supported by startup funds to JB from the University of Pennsylvania, and by NIH grant R01-EY028571 to JB from the National Eye Institute and the Office of Behavioral and Social Sciences Research. 
Commercial relationships: none. 
Corresponding author: Arvind V. Iyer. 
Address: Department of Psychology, University of Pennsylvania, Philadelphia, PA, USA. 
References
Adams, W. J., Elder, J. H., Graf, E. W., Leyland, J., Lugtigheid, A. J., & Murry, A. (2016). The Southampton-York Natural Scenes (SYNS) dataset: Statistics of surface attitude. Nature Publishing Group, 6, 35805, https://doi.org/10.1038/srep35805.
Bair, W., & Movshon, J. A. (2004). Adaptive temporal integration of motion in direction-selective neurons in macaque visual cortex. Journal of Neuroscience, 24 (33), 7305–7323, https://doi.org/10.1523/JNEUROSCI.0554-04.2004.
Banks, M. S., Gepshtein, S., & Landy, M. S. (2004). Why is spatial stereoresolution so low? Journal of Neuroscience, 24 (9), 2077–2089, https://doi.org/10.1523/JNEUROSCI.3852-02.2004.
Blakemore, C. (1970). The range and scope of binocular depth discrimination in man. The Journal of Physiology, 211 (3), 599–622.
Burge, J., Fowlkes, C. C., & Banks, M. S. (2010). Natural-scene statistics predict how the figure-ground cue of convexity affects human depth perception. Journal of Neuroscience, 30 (21), 7269–7280, https://doi.org/10.1523/JNEUROSCI.5551-09.2010.
Burge, J., & Geisler, W. S. (2011). Optimal defocus estimation in individual natural images. Proceedings of the National Academy of Sciences, USA, 108 (40), 16849–16854, https://doi.org/10.1073/pnas.1108491108.
Burge, J., & Geisler, W. S. (2012). Optimal defocus estimates from individual images for autofocusing a digital camera. Proceedings of the IS&T/SPIE 47th Annual Meeting, https://doi.org/10.1117/12.912066.
Burge, J., & Geisler, W. S. (2014). Optimal disparity estimation in natural stereo images. Journal of Vision, 14 (2): 1, 1–18, https://doi.org/10.1167/14.2.1. [PubMed] [Article]
Burge, J., & Geisler, W. S. (2015). Optimal speed estimation in natural image movies predicts human performance. Nature Communications, 6: 7900, https://doi.org/10.1038/ncomms8900.
Burge, J., & Jaini, P. (2017). Accuracy maximization analysis for sensory-perceptual tasks: Computational improvements, filter robustness, and coding advantages for scaled additive noise. PLoS Computational Biology, 13 (2), e1005281, https://doi.org/10.1371/journal.pcbi.1005281.
Burge, J., McCann, B. C., & Geisler, W. S. (2016). Estimating 3D tilt from local image cues in natural scenes. Journal of Vision, 16 (13): 2, 1–25, https://doi.org/10.1167/16.13.2. [PubMed] [Article]
Burge, J., Peterson, M. A., & Palmer, S. E. (2005). Ordinal configural cues combine with metric disparity in depth perception. Journal of Vision, 5 (6): 5, 534–542, https://doi.org/10.1167/5.6.5. [PubMed] [Article]
Bülthoff, H., Fahle, M., & Wegmann, M. (1991). Perceived depth scales with disparity gradient. Perception, 20 (2), 145–153, https://doi.org/10.1068/p200145.
Canessa, A., Gibaldi, A., Chessa, M., Fato, M., Solari, F., & Sabatini, S. P. (2017). A dataset of stereoscopic images and ground-truth disparity mimicking human fixations in peripersonal space. Scientific Data, 4: 170034, https://doi.org/10.1038/sdata.2017.34.
Cavanaugh, J. R., Bair, W., & Movshon, J. A. (2002). Nature and interaction of signals from the receptive field center and surround in macaque V1 neurons. Journal of Neurophysiology, 88 (5), 2530–2546, https://doi.org/10.1152/jn.00692.2001.
Chen, Y., & Qian, N. (2004). A coarse-to-fine disparity energy model with both phase-shift and position-shift receptive field mechanisms. Neural Computation, 16 (8), 1545–1577, https://doi.org/10.1162/089976604774201596.
Cooper, E. A., & Norcia, A. M. (2015). Predicting cortical dark/bright asymmetries from natural image statistics and early visual transforms. PLoS Computational Biology, 11 (5): e1004268, https://doi.org/10.1371/journal.pcbi.1004268.
Cormack, L. K., Stevenson, S. B., & Schor, C. M. (1991). Interocular correlation, luminance contrast and cyclopean processing. Vision Research, 31 (12), 2195–2207.
Cumming, B. G., & DeAngelis, G. C. (2001). The physiology of stereopsis. Annual Review of Neuroscience, 24, 203–238, https://doi.org/10.1146/annurev.neuro.24.1.203.
Ernst, M. O., & Banks, M. S. (2002, January 24). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415 (6870), 429–433, https://doi.org/10.1038/415429a.
Felsen, G., & Dan, Y. (2005). A natural approach to studying vision. Nature Neuroscience, 8 (12), 1643–1646, https://doi.org/10.1038/nn1608.
Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America. A, Optics and Image Science, 4 (12), 2379–2394.
Foster, D. H. (2011). Color constancy. Vision Research, 51 (7), 674–700, https://doi.org/10.1016/j.visres.2010.09.006.
Geisler, W. S., & Davila, K. D. (1985). Ideal discriminators in spatial vision: Two-point stimuli. Journal of the Optical Society of America. A, Optics and Image Science, 2 (9), 1483–1497.
Geisler, W. S., Najemnik, J., & Ing, A. D. (2009). Optimal stimulus encoders for natural tasks. Journal of Vision, 9 (13): 17, 1–16, https://doi.org/10.1167/9.13.17. [PubMed] [Article]
Geisler, W. S., & Perry, J. S. (2009). Contour statistics in natural images: Grouping across occlusions. Visual Neuroscience, 26 (1), 109–121, https://doi.org/10.1017/S0952523808080875.
Geisler, W. S., & Ringach, D. (2009). Natural systems analysis. Visual Neuroscience, 26 (1), 1–3, https://doi.org/10.1017/S0952523808081005.
Gibaldi, A., Canessa, A., & Sabatini, S. P. (2017). The active side of stereopsis: Fixation strategy and adaptation to natural environments. Nature Publishing Group, 7: 44800, https://doi.org/10.1038/srep44800.
Girshick, A. R., Landy, M. S., & Simoncelli, E. P. (2011). Cardinal rules: Visual orientation perception reflects knowledge of environmental statistics. Nature Neuroscience, 14 (7), 926–932, https://doi.org/10.1038/nn.2831.
Gonzalez, F., & Perez, R. (1998). Neural mechanisms underlying stereoscopic vision. Progress in Neurobiology, 55 (3), 191–224.
Harris, J. M., McKee, S. P., & Smallman, H. S. (1997). Fine-scale processing in human binocular stereopsis. Journal of the Optical Society of America. A, Optics, Image Science, and Vision, 14 (8), 1673–1683.
Harris, J. M., & Wilcox, L. M. (2009). The role of monocularly visible regions in depth and surface perception. Vision Research, 49 (22), 2666–2685, https://doi.org/10.1016/j.visres.2009.06.021.
Heitman, A., Brackbill, N., Greschner, M., Sher, A., Litke, A. M., & Chichilnisky, E. J. (2016). Testing pseudo-linear models of responses to natural scenes in primate retina. Manuscript in preparation, bioRxiv, https://doi.org/10.1101/045336.
Hibbard, P. B. (2008). Binocular energy responses to natural images. Vision Research, 48 (12), 1427–1439, https://doi.org/10.1016/j.visres.2008.03.013.
Hibbard, P. B., & Bouzit, S. (2005). Stereoscopic correspondence for ambiguous targets is affected by elevation and fixation distance. Spatial Vision, 18 (4), 399–411.
Jaini, P., & Burge, J. (2017). Linking normative models of natural tasks to descriptive models of neural response. Journal of Vision, 17 (12): 16, 1–26, https://doi.org/10.1167/17.12.16. [PubMed] [Article]
Kanade, T., & Okutomi, M. (1994). A stereo matching algorithm with an adaptive window: Theory and experiment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16 (9), 920–932.
Kane, D., Guan, P., & Banks, M. S. (2014). The limits of human stereopsis in space and time. Journal of Neuroscience, 34 (4), 1397–1408, https://doi.org/10.1523/JNEUROSCI.1652-13.2014.
Kaye, M. (1978). Stereopsis without binocular correlation. Vision Research, 18 (8), 1013–1022.
Kim, S., & Burge, J. (2018). The lawful imprecision of human surface tilt estimation in natural scenes. eLife, 7: e31448, https://doi.org/10.7554/eLife.31448.
Knill, D. C., & Richards, W. (1996). Perception as Bayesian inference. New York: Cambridge University Press.
Li, Z., & Atick, J. J. (1994). Efficient stereo coding in the multiscale representation. Network: Computation in Neural Systems, 5, 157–174.
Liu, Y., Bovik, A. C., & Cormack, L. K. (2008). Disparity statistics in natural scenes. Journal of Vision, 8 (11): 19, 1–14, https://doi.org/10.1167/8.11.19. [PubMed] [Article]
Liu, Y., Cormack, L. K., & Bovik, A. C. (2010). Dichotomy between luminance and disparity features at binocular fixations. Journal of Vision, 10 (12): 23, 1–17, https://doi.org/10.1167/10.12.23. [PubMed] [Article]
Lu, E. T., & Hamilton, R. J. (1991). Avalanches and the distribution of solar-flares. Astrophysical Journal, 380 (2), L89–L92.
Maiello, G., Chessa, M., Solari, F., & Bex, P. J. (2014). Simulated disparity and peripheral blur interact during binocular fusion. Journal of Vision, 14 (8): 13, 1–14, https://doi.org/10.1167/14.8.13. [PubMed] [Article]
Nakayama, K., & Shimojo, S. (1990). da Vinci stereopsis: Depth and subjective occluding contours from unpaired image points. Vision Research, 30 (11), 1811–1825.
Olshausen, B. A., & Field, D. J. (1996, June 13). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381 (6583), 607–609, https://doi.org/10.1038/381607a0.
Parise, C. V., Knorre, K., & Ernst, M. O. (2014). Natural auditory scene statistics shapes human spatial hearing. Proceedings of the National Academy of Sciences, 111 (16), 6104–6108. https://doi.org/10.1073/pnas.1322705111.
Pelli, D. G. (1985). Uncertainty explains many aspects of visual contrast detection and discrimination. Journal of the Optical Society of America. A, Optics and Image Science, 2 (9), 1508–1532.
Potetz, B., & Lee, T. S. (2003). Statistical correlations between two-dimensional images and three-dimensional structures in natural scenes. Journal of the Optical Society of America. A, Optics, Image Science, and Vision, 20 (7), 1292–1303.
Reed, W. J., & McKelvey, K. S. (2002). Power-law behaviour and parametric models for the size-distribution of forest fires. Ecological Modelling, 150 (3), 239–254.
Ringach, D. L. (2002). Spatial structure and symmetry of simple-cell receptive fields in macaque primary visual cortex. Journal of Neurophysiology, 88 (1), 455–463, https://doi.org/10.1152/jn.00881.2001.
Sceniak, M. P., Ringach, D. L., Hawken, M. J., & Shapley, R. (1999). Contrast's effect on spatial summation by macaque V1 neurons. Nature Neuroscience, 2 (8), 733–739, https://doi.org/10.1038/11197.
Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47 (1-3), 7–42.
Scharstein, D., & Szeliski, R. (2003). High-accuracy stereo depth maps using structured light. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference (Vol. 1, pp. I-I). IEEE.
Sebastian, S., Burge, J., & Geisler, W. S. (2015). Defocus blur discrimination in natural images with natural optics. Journal of Vision, 15 (5): 16, 1–17, https://doi.org/10.1167/15.5.16. [PubMed] [Article]
Sprague, W. W., Cooper, E. A., Tosic, I., & Banks, M. S. (2015). Stereopsis is adaptive for the natural environment. Science Advances, 1 (4), e1400254–e1400254, https://doi.org/10.1126/sciadv.1400254.
Stocker, A. A., & Simoncelli, E. P. (2006). Noise characteristics and prior expectations in human visual speed perception. Nature Neuroscience, 9 (4), 578–585, https://doi.org/10.1038/nn1669.
Talebi, V., & Baker, C. L. (2012). Natural versus synthetic stimuli for estimating receptive field models: A comparison of predictive robustness. Journal of Neuroscience, 32 (5), 1560–1576, https://doi.org/10.1523/JNEUROSCI.4661-12.2012.
Tyler, C. W., & Julesz, B. (1978). Binocular cross-correlation in time and space. Vision Research, 18 (1), 101–105.
van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London. Series B: Biological Sciences, 265 (1394), 359–366, https://doi.org/10.1098/rspb.1998.0303.
Weiss, Y., Simoncelli, E. P., & Adelson, E. H. (2002). Motion illusions as optimal percepts. Nature Neuroscience, 5 (6), 598–604, https://doi.org/10.1038/nn858.
Wilcox, L. M., & Lakra, D. C. (2007). Depth from binocular half-occlusions in stereoscopic images of natural scenes. Perception, 36 (6), 830–839, https://doi.org/10.1068/p5708.
Yang, Z., & Purves, D. (2003). Image/source statistics of surfaces in natural scenes. Network, 14 (3), 371–390.
Yuille, A. L., & Grzywacz, N. M. (1988, May 5). A computational theory for the perception of coherent visual motion. Nature, 333 (6168), 71–74, https://doi.org/10.1038/333071a0.
Figure 1
 
Stereo 3D sampling geometry, corresponding image-points, and interpolation procedure. (A) Top-view of 3D sampling geometry. Left-eye (LE) and right-eye (RE) luminance and range images are captured one human interocular distance apart (65 mm). Sampled 3D scene points (white squares) occur at the intersections of LE and RE lines of sight (thin lines) and usually do not lie on 3D surfaces. Samples in the projection plane (i.e., pixel centers) are a subset of these sampled 3D scene points. Sampled 3D surface points (white dots) occur at the intersections of LE or RE lines of sight with 3D surfaces (thick black curve) in the scene. Small arrows along lines of sight represent light reflected from sampled 3D surface points that determine the pixel values in the luminance and range images for each eye. Occasionally, sampled 3D surface points coincide with sampled 3D scene points (large dashed circles). Light rays from these points intersect the projection plane at pixel centers. (B) Procedure to obtain corresponding image point locations: Sample a pixel location (1) in the anchor eye's image (here, the left eye). Locate the corresponding sampled left eye 3D surface point (2). Find the right eye projection (3) from sampled 3D surface point by ray tracing. Select nearest pixel center (4) in right eye image. Locate the corresponding sampled right eye 3D surface point (5). Find sampled 3D scene point (6) nearest the left- and right-eye sampled 3D surface points. This sampled 3D scene point is the intersection point of the left- and right-eye lines of sight through the sampled 3D surface points. Find interpolated 3D surface point (7) by linear interpolation (i.e., the location of the intersection of cyclopean line of sight with chord joining sampled 3D surface points; see inset). Dashed light rays from this interpolated 3D surface point define corresponding point locations (8) in the projection plane. The vergence demand \(\def\upalpha{\unicode[Times]{x3B1}}\)\(\def\upbeta{\unicode[Times]{x3B2}}\)\(\def\upgamma{\unicode[Times]{x3B3}}\)\(\def\updelta{\unicode[Times]{x3B4}}\)\(\def\upvarepsilon{\unicode[Times]{x3B5}}\)\(\def\upzeta{\unicode[Times]{x3B6}}\)\(\def\upeta{\unicode[Times]{x3B7}}\)\(\def\uptheta{\unicode[Times]{x3B8}}\)\(\def\upiota{\unicode[Times]{x3B9}}\)\(\def\upkappa{\unicode[Times]{x3BA}}\)\(\def\uplambda{\unicode[Times]{x3BB}}\)\(\def\upmu{\unicode[Times]{x3BC}}\)\(\def\upnu{\unicode[Times]{x3BD}}\)\(\def\upxi{\unicode[Times]{x3BE}}\)\(\def\upomicron{\unicode[Times]{x3BF}}\)\(\def\uppi{\unicode[Times]{x3C0}}\)\(\def\uprho{\unicode[Times]{x3C1}}\)\(\def\upsigma{\unicode[Times]{x3C3}}\)\(\def\uptau{\unicode[Times]{x3C4}}\)\(\def\upupsilon{\unicode[Times]{x3C5}}\)\(\def\upphi{\unicode[Times]{x3C6}}\)\(\def\upchi{\unicode[Times]{x3C7}}\)\(\def\uppsy{\unicode[Times]{x3C8}}\)\(\def\upomega{\unicode[Times]{x3C9}}\)\(\def\bialpha{\boldsymbol{\alpha}}\)\(\def\bibeta{\boldsymbol{\beta}}\)\(\def\bigamma{\boldsymbol{\gamma}}\)\(\def\bidelta{\boldsymbol{\delta}}\)\(\def\bivarepsilon{\boldsymbol{\varepsilon}}\)\(\def\bizeta{\boldsymbol{\zeta}}\)\(\def\bieta{\boldsymbol{\eta}}\)\(\def\bitheta{\boldsymbol{\theta}}\)\(\def\biiota{\boldsymbol{\iota}}\)\(\def\bikappa{\boldsymbol{\kappa}}\)\(\def\bilambda{\boldsymbol{\lambda}}\)\(\def\bimu{\boldsymbol{\mu}}\)\(\def\binu{\boldsymbol{\nu}}\)\(\def\bixi{\boldsymbol{\xi}}\)\(\def\biomicron{\boldsymbol{\micron}}\)\(\def\bipi{\boldsymbol{\pi}}\)\(\def\birho{\boldsymbol{\rho}}\)\(\def\bisigma{\boldsymbol{\sigma}}\)\(\def\bitau{\boldsymbol{\tau}}\)\(\def\biupsilon{\boldsymbol{\upsilon}}\)\(\def\biphi{\boldsymbol{\phi}}\)\(\def\bichi{\boldsymbol{\chi}}\)\(\def\bipsy{\boldsymbol{\psy}}\)\(\def\biomega{\boldsymbol{\omega}}\)\(\def\bupalpha{\unicode[Times]{x1D6C2}}\)\(\def\bupbeta{\unicode[Times]{x1D6C3}}\)\(\def\bupgamma{\unicode[Times]{x1D6C4}}\)\(\def\bupdelta{\unicode[Times]{x1D6C5}}\)\(\def\bupepsilon{\unicode[Times]{x1D6C6}}\)\(\def\bupvarepsilon{\unicode[Times]{x1D6DC}}\)\(\def\bupzeta{\unicode[Times]{x1D6C7}}\)\(\def\bupeta{\unicode[Times]{x1D6C8}}\)\(\def\buptheta{\unicode[Times]{x1D6C9}}\)\(\def\bupiota{\unicode[Times]{x1D6CA}}\)\(\def\bupkappa{\unicode[Times]{x1D6CB}}\)\(\def\buplambda{\unicode[Times]{x1D6CC}}\)\(\def\bupmu{\unicode[Times]{x1D6CD}}\)\(\def\bupnu{\unicode[Times]{x1D6CE}}\)\(\def\bupxi{\unicode[Times]{x1D6CF}}\)\(\def\bupomicron{\unicode[Times]{x1D6D0}}\)\(\def\buppi{\unicode[Times]{x1D6D1}}\)\(\def\buprho{\unicode[Times]{x1D6D2}}\)\(\def\bupsigma{\unicode[Times]{x1D6D4}}\)\(\def\buptau{\unicode[Times]{x1D6D5}}\)\(\def\bupupsilon{\unicode[Times]{x1D6D6}}\)\(\def\bupphi{\unicode[Times]{x1D6D7}}\)\(\def\bupchi{\unicode[Times]{x1D6D8}}\)\(\def\buppsy{\unicode[Times]{x1D6D9}}\)\(\def\bupomega{\unicode[Times]{x1D6DA}}\)\(\def\bupvartheta{\unicode[Times]{x1D6DD}}\)\(\def\bGamma{\bf{\Gamma}}\)\(\def\bDelta{\bf{\Delta}}\)\(\def\bTheta{\bf{\Theta}}\)\(\def\bLambda{\bf{\Lambda}}\)\(\def\bXi{\bf{\Xi}}\)\(\def\bPi{\bf{\Pi}}\)\(\def\bSigma{\bf{\Sigma}}\)\(\def\bUpsilon{\bf{\Upsilon}}\)\(\def\bPhi{\bf{\Phi}}\)\(\def\bPsi{\bf{\Psi}}\)\(\def\bOmega{\bf{\Omega}}\)\(\def\iGamma{\unicode[Times]{x1D6E4}}\)\(\def\iDelta{\unicode[Times]{x1D6E5}}\)\(\def\iTheta{\unicode[Times]{x1D6E9}}\)\(\def\iLambda{\unicode[Times]{x1D6EC}}\)\(\def\iXi{\unicode[Times]{x1D6EF}}\)\(\def\iPi{\unicode[Times]{x1D6F1}}\)\(\def\iSigma{\unicode[Times]{x1D6F4}}\)\(\def\iUpsilon{\unicode[Times]{x1D6F6}}\)\(\def\iPhi{\unicode[Times]{x1D6F7}}\)\(\def\iPsi{\unicode[Times]{x1D6F9}}\)\(\def\iOmega{\unicode[Times]{x1D6FA}}\)\(\def\biGamma{\unicode[Times]{x1D71E}}\)\(\def\biDelta{\unicode[Times]{x1D71F}}\)\(\def\biTheta{\unicode[Times]{x1D723}}\)\(\def\biLambda{\unicode[Times]{x1D726}}\)\(\def\biXi{\unicode[Times]{x1D729}}\)\(\def\biPi{\unicode[Times]{x1D72B}}\)\(\def\biSigma{\unicode[Times]{x1D72E}}\)\(\def\biUpsilon{\unicode[Times]{x1D730}}\)\(\def\biPhi{\unicode[Times]{x1D731}}\)\(\def\biPsi{\unicode[Times]{x1D733}}\)\(\def\biOmega{\unicode[Times]{x1D734}}\)\(\theta \) of the interpolated scene point is the angle between the left- and right-eye lines of sight required to fixate the point. (C) Sampling error before interpolation in arcmin. Dashed vertical lines indicate the expected sampling error assuming surface point locations are uniformly distributed between sampled 3D scene points. (D) Estimated sampling error after interpolation in arcsec.
Figure 1
 
Stereo 3D sampling geometry, corresponding image-points, and interpolation procedure. (A) Top-view of 3D sampling geometry. Left-eye (LE) and right-eye (RE) luminance and range images are captured one human interocular distance apart (65 mm). Sampled 3D scene points (white squares) occur at the intersections of LE and RE lines of sight (thin lines) and usually do not lie on 3D surfaces. Samples in the projection plane (i.e., pixel centers) are a subset of these sampled 3D scene points. Sampled 3D surface points (white dots) occur at the intersections of LE or RE lines of sight with 3D surfaces (thick black curve) in the scene. Small arrows along lines of sight represent light reflected from sampled 3D surface points that determine the pixel values in the luminance and range images for each eye. Occasionally, sampled 3D surface points coincide with sampled 3D scene points (large dashed circles). Light rays from these points intersect the projection plane at pixel centers. (B) Procedure to obtain corresponding image point locations: Sample a pixel location (1) in the anchor eye's image (here, the left eye). Locate the corresponding sampled left eye 3D surface point (2). Find the right eye projection (3) from sampled 3D surface point by ray tracing. Select nearest pixel center (4) in right eye image. Locate the corresponding sampled right eye 3D surface point (5). Find sampled 3D scene point (6) nearest the left- and right-eye sampled 3D surface points. This sampled 3D scene point is the intersection point of the left- and right-eye lines of sight through the sampled 3D surface points. Find interpolated 3D surface point (7) by linear interpolation (i.e., the location of the intersection of cyclopean line of sight with chord joining sampled 3D surface points; see inset). Dashed light rays from this interpolated 3D surface point define corresponding point locations (8) in the projection plane. The vergence demand \(\def\upalpha{\unicode[Times]{x3B1}}\)\(\def\upbeta{\unicode[Times]{x3B2}}\)\(\def\upgamma{\unicode[Times]{x3B3}}\)\(\def\updelta{\unicode[Times]{x3B4}}\)\(\def\upvarepsilon{\unicode[Times]{x3B5}}\)\(\def\upzeta{\unicode[Times]{x3B6}}\)\(\def\upeta{\unicode[Times]{x3B7}}\)\(\def\uptheta{\unicode[Times]{x3B8}}\)\(\def\upiota{\unicode[Times]{x3B9}}\)\(\def\upkappa{\unicode[Times]{x3BA}}\)\(\def\uplambda{\unicode[Times]{x3BB}}\)\(\def\upmu{\unicode[Times]{x3BC}}\)\(\def\upnu{\unicode[Times]{x3BD}}\)\(\def\upxi{\unicode[Times]{x3BE}}\)\(\def\upomicron{\unicode[Times]{x3BF}}\)\(\def\uppi{\unicode[Times]{x3C0}}\)\(\def\uprho{\unicode[Times]{x3C1}}\)\(\def\upsigma{\unicode[Times]{x3C3}}\)\(\def\uptau{\unicode[Times]{x3C4}}\)\(\def\upupsilon{\unicode[Times]{x3C5}}\)\(\def\upphi{\unicode[Times]{x3C6}}\)\(\def\upchi{\unicode[Times]{x3C7}}\)\(\def\uppsy{\unicode[Times]{x3C8}}\)\(\def\upomega{\unicode[Times]{x3C9}}\)\(\def\bialpha{\boldsymbol{\alpha}}\)\(\def\bibeta{\boldsymbol{\beta}}\)\(\def\bigamma{\boldsymbol{\gamma}}\)\(\def\bidelta{\boldsymbol{\delta}}\)\(\def\bivarepsilon{\boldsymbol{\varepsilon}}\)\(\def\bizeta{\boldsymbol{\zeta}}\)\(\def\bieta{\boldsymbol{\eta}}\)\(\def\bitheta{\boldsymbol{\theta}}\)\(\def\biiota{\boldsymbol{\iota}}\)\(\def\bikappa{\boldsymbol{\kappa}}\)\(\def\bilambda{\boldsymbol{\lambda}}\)\(\def\bimu{\boldsymbol{\mu}}\)\(\def\binu{\boldsymbol{\nu}}\)\(\def\bixi{\boldsymbol{\xi}}\)\(\def\biomicron{\boldsymbol{\micron}}\)\(\def\bipi{\boldsymbol{\pi}}\)\(\def\birho{\boldsymbol{\rho}}\)\(\def\bisigma{\boldsymbol{\sigma}}\)\(\def\bitau{\boldsymbol{\tau}}\)\(\def\biupsilon{\boldsymbol{\upsilon}}\)\(\def\biphi{\boldsymbol{\phi}}\)\(\def\bichi{\boldsymbol{\chi}}\)\(\def\bipsy{\boldsymbol{\psy}}\)\(\def\biomega{\boldsymbol{\omega}}\)\(\def\bupalpha{\unicode[Times]{x1D6C2}}\)\(\def\bupbeta{\unicode[Times]{x1D6C3}}\)\(\def\bupgamma{\unicode[Times]{x1D6C4}}\)\(\def\bupdelta{\unicode[Times]{x1D6C5}}\)\(\def\bupepsilon{\unicode[Times]{x1D6C6}}\)\(\def\bupvarepsilon{\unicode[Times]{x1D6DC}}\)\(\def\bupzeta{\unicode[Times]{x1D6C7}}\)\(\def\bupeta{\unicode[Times]{x1D6C8}}\)\(\def\buptheta{\unicode[Times]{x1D6C9}}\)\(\def\bupiota{\unicode[Times]{x1D6CA}}\)\(\def\bupkappa{\unicode[Times]{x1D6CB}}\)\(\def\buplambda{\unicode[Times]{x1D6CC}}\)\(\def\bupmu{\unicode[Times]{x1D6CD}}\)\(\def\bupnu{\unicode[Times]{x1D6CE}}\)\(\def\bupxi{\unicode[Times]{x1D6CF}}\)\(\def\bupomicron{\unicode[Times]{x1D6D0}}\)\(\def\buppi{\unicode[Times]{x1D6D1}}\)\(\def\buprho{\unicode[Times]{x1D6D2}}\)\(\def\bupsigma{\unicode[Times]{x1D6D4}}\)\(\def\buptau{\unicode[Times]{x1D6D5}}\)\(\def\bupupsilon{\unicode[Times]{x1D6D6}}\)\(\def\bupphi{\unicode[Times]{x1D6D7}}\)\(\def\bupchi{\unicode[Times]{x1D6D8}}\)\(\def\buppsy{\unicode[Times]{x1D6D9}}\)\(\def\bupomega{\unicode[Times]{x1D6DA}}\)\(\def\bupvartheta{\unicode[Times]{x1D6DD}}\)\(\def\bGamma{\bf{\Gamma}}\)\(\def\bDelta{\bf{\Delta}}\)\(\def\bTheta{\bf{\Theta}}\)\(\def\bLambda{\bf{\Lambda}}\)\(\def\bXi{\bf{\Xi}}\)\(\def\bPi{\bf{\Pi}}\)\(\def\bSigma{\bf{\Sigma}}\)\(\def\bUpsilon{\bf{\Upsilon}}\)\(\def\bPhi{\bf{\Phi}}\)\(\def\bPsi{\bf{\Psi}}\)\(\def\bOmega{\bf{\Omega}}\)\(\def\iGamma{\unicode[Times]{x1D6E4}}\)\(\def\iDelta{\unicode[Times]{x1D6E5}}\)\(\def\iTheta{\unicode[Times]{x1D6E9}}\)\(\def\iLambda{\unicode[Times]{x1D6EC}}\)\(\def\iXi{\unicode[Times]{x1D6EF}}\)\(\def\iPi{\unicode[Times]{x1D6F1}}\)\(\def\iSigma{\unicode[Times]{x1D6F4}}\)\(\def\iUpsilon{\unicode[Times]{x1D6F6}}\)\(\def\iPhi{\unicode[Times]{x1D6F7}}\)\(\def\iPsi{\unicode[Times]{x1D6F9}}\)\(\def\iOmega{\unicode[Times]{x1D6FA}}\)\(\def\biGamma{\unicode[Times]{x1D71E}}\)\(\def\biDelta{\unicode[Times]{x1D71F}}\)\(\def\biTheta{\unicode[Times]{x1D723}}\)\(\def\biLambda{\unicode[Times]{x1D726}}\)\(\def\biXi{\unicode[Times]{x1D729}}\)\(\def\biPi{\unicode[Times]{x1D72B}}\)\(\def\biSigma{\unicode[Times]{x1D72E}}\)\(\def\biUpsilon{\unicode[Times]{x1D730}}\)\(\def\biPhi{\unicode[Times]{x1D731}}\)\(\def\biPsi{\unicode[Times]{x1D733}}\)\(\def\biOmega{\unicode[Times]{x1D734}}\)\(\theta \) of the interpolated scene point is the angle between the left- and right-eye lines of sight required to fixate the point. (C) Sampling error before interpolation in arcmin. Dashed vertical lines indicate the expected sampling error assuming surface point locations are uniformly distributed between sampled 3D scene points. (D) Estimated sampling error after interpolation in arcsec.
Figure 2
 
Half-occluded scene points, binocularly visible scene points, and vergence demand. (A) Half-occluded 3D surface point. The scene point on the far surface (black circle) is visible to the left eye and occluded from the right eye. Arrows indicate the ray tracing performed by the interpolation routine (see Figure 1). Squares represent interpolated image points returned by the interpolation procedure. When 3D surface points are half-occluded, the interpolation procedure returns invalid points. (B) Binocularly visible surface point (black circle) and corresponding image points (black squares) in the projection plane. When the scene point is binocularly visible, the vergence demand \(\theta \) of the surface point is the same, regardless of the anchor eye. The vergence demand is identical whether the left or the right eye is used as the anchor eye. (C) Vergence demand is computed within the epipolar plane defined by a 3D surface point and the left- and right-eye nodal points.
Figure 2
 
Half-occluded scene points, binocularly visible scene points, and vergence demand. (A) Half-occluded 3D surface point. The scene point on the far surface (black circle) is visible to the left eye and occluded from the right eye. Arrows indicate the ray tracing performed by the interpolation routine (see Figure 1). Squares represent interpolated image points returned by the interpolation procedure. When 3D surface points are half-occluded, the interpolation procedure returns invalid points. (B) Binocularly visible surface point (black circle) and corresponding image points (black squares) in the projection plane. When the scene point is binocularly visible, the vergence demand \(\theta \) of the surface point is the same, regardless of the anchor eye. The vergence demand is identical whether the left or the right eye is used as the anchor eye. (C) Vergence demand is computed within the epipolar plane defined by a 3D surface point and the left- and right-eye nodal points.
Figure 3
 
Corresponding points overlaid on stereo-images (upper row) and coregistered groundtruth distance data (lower row) for two different scenes, (A) and (B). Wall-fuse the left two images or cross-fuse the right two images to see the imaged scene in stereo-3D. True corresponding points (yellow dots) lie on imaged 3D surfaces. Candidate corresponding points that are half-occluded or are otherwise invalid (red dots) are also shown. For reference, the yellow boxes in (A) and (B) indicate 3° and 1° areas, respectively.
Figure 3
 
Corresponding points overlaid on stereo-images (upper row) and coregistered groundtruth distance data (lower row) for two different scenes, (A) and (B). Wall-fuse the left two images or cross-fuse the right two images to see the imaged scene in stereo-3D. True corresponding points (yellow dots) lie on imaged 3D surfaces. Candidate corresponding points that are half-occluded or are otherwise invalid (red dots) are also shown. For reference, the yellow boxes in (A) and (B) indicate 3° and 1° areas, respectively.
Figure 4
 
Natural stereo-image patches and corresponding groundtruth disparity maps, sampled from natural scenes. Free-fuse to see in stereo-3D. (A–D) Local disparity contrast \({C_\delta }\) (e.g., local depth variation) increases in the subplots from left to right. Groundtruth disparity at each pixel (bottom row) is computed directly from groundtruth distance data. Disparity contrast is computed under a window that defines the spatial integration area (see Methods). The colorbar indicates the disparity in arcmin relative to the disparity (i.e., vergence demand) at the center pixel.
Figure 4
 
Natural stereo-image patches and corresponding groundtruth disparity maps, sampled from natural scenes. Free-fuse to see in stereo-3D. (A–D) Local disparity contrast \({C_\delta }\) (e.g., local depth variation) increases in the subplots from left to right. Groundtruth disparity at each pixel (bottom row) is computed directly from groundtruth distance data. Disparity contrast is computed under a window that defines the spatial integration area (see Methods). The colorbar indicates the disparity in arcmin relative to the disparity (i.e., vergence demand) at the center pixel.
Figure 5
 
Effect of disparity contrast on half-occlusion performance. (A) Three example stereo-image patches centered on scene points that are half-occluded to the left eye, binocularly visible, and half-occluded to the right eye. Spatial integration areas of different sizes (1° and 3°) are shown as dashed circles. (B) The half-occlusion detection task is to distinguish half-occluded versus binocularly visible points with 0.0 arcmin of disparity. Performance is compared for scene points with low, medium, and high disparity contrasts. (C) Conditional probability distributions of the decision variable (i.e., the binocular correlation of the left- and right-eye image patches). The dashed black curve represents the distribution of the decision variable for half-occluded points. Solid curves show the decision variable distributions for patches with binocularly visible centers having low (blue; 0.05–1.00 arcmin), medium (red; 0.2–4.0 arcmin), and high (green; 0.75–15.0 arcmin) disparity contrasts. Binocular image correlation and disparity contrast are computed with spatial integration areas of 1.0° (i.e., 0.5° at half-height). (D) Receiver operating characteristic (ROC) curves for the half-occlusion task. Higher disparity contrasts decrease half-occlusion detection performance. (E) Half-occlusion detection sensitivity (d′) as a function of spatial integration area for different disparity contrasts. Arrows mark the spatial integration area at half-height for which half-occlusion detection performance is optimized.
Figure 5
 
Effect of disparity contrast on half-occlusion performance. (A) Three example stereo-image patches centered on scene points that are half-occluded to the left eye, binocularly visible, and half-occluded to the right eye. Spatial integration areas of different sizes (1° and 3°) are shown as dashed circles. (B) The half-occlusion detection task is to distinguish half-occluded versus binocularly visible points with 0.0 arcmin of disparity. Performance is compared for scene points with low, medium, and high disparity contrasts. (C) Conditional probability distributions of the decision variable (i.e., the binocular correlation of the left- and right-eye image patches). The dashed black curve represents the distribution of the decision variable for half-occluded points. Solid curves show the decision variable distributions for patches with binocularly visible centers having low (blue; 0.05–1.00 arcmin), medium (red; 0.2–4.0 arcmin), and high (green; 0.75–15.0 arcmin) disparity contrasts. Binocular image correlation and disparity contrast are computed with spatial integration areas of 1.0° (i.e., 0.5° at half-height). (D) Receiver operating characteristic (ROC) curves for the half-occlusion task. Higher disparity contrasts decrease half-occlusion detection performance. (E) Half-occlusion detection sensitivity (d′) as a function of spatial integration area for different disparity contrasts. Arrows mark the spatial integration area at half-height for which half-occlusion detection performance is optimized.
Figure 6
 
Effect of local disparity variation on optimal processing size for half-occlusion detection. (A) Sensitivity as a function of disparity contrast, assuming the optimal size of the integration area. Sensitivity decreases monotonically with disparity contrast. For each disparity contrast, sensitivities were measured with the optimal integration area. (B) Optimal window size as a function of disparity contrast. The optimal window size decreases approximately linearly as disparity contrast increases on a log-log scale. Results are highly robust to changes in the bin width.
Figure 6
 
Effect of local disparity variation on optimal processing size for half-occlusion detection. (A) Sensitivity as a function of disparity contrast, assuming the optimal size of the integration area. Sensitivity decreases monotonically with disparity contrast. For each disparity contrast, sensitivities were measured with the optimal integration area. (B) Optimal window size as a function of disparity contrast. The optimal window size decreases approximately linearly as disparity contrast increases on a log-log scale. Results are highly robust to changes in the bin width.
Figure 7
 
Effect of disparity contrast on disparity detection performance. (A) Stereo-image patches centered on binocularly visible scene points with 0.0 and 1.0 arcmin of fixation disparity. The eyes are fixated 1 arcmin in front of the target in the right image. (B) The disparity detection task simulated here is to distinguish scene points with 0.0 arcmin versus 1.0 arcmin of fixation disparity. Performance is compared for scene points with low, medium, and high disparity contrasts. (C) Conditional probability distributions of the decision variable. The decision-variable is the disparity that maximizes the local cross-correlation function (Equation 3). Results are presented for a spatial integration area of size 1.0°. The solid and dashed curves show the decision variable for scene points fixated with 0.0 and 1.0 arcmin of disparity, respectively, for patches having low (blue), medium (red), and high (green) disparity contrasts. (D) ROC curves for disparity detection. (E) Disparity detection sensitivity (i.e., d′) as a function of spatial integration area for different disparity contrasts.
Figure 7
 
Effect of disparity contrast on disparity detection performance. (A) Stereo-image patches centered on binocularly visible scene points with 0.0 and 1.0 arcmin of fixation disparity. The eyes are fixated 1 arcmin in front of the target in the right image. (B) The disparity detection task simulated here is to distinguish scene points with 0.0 arcmin versus 1.0 arcmin of fixation disparity. Performance is compared for scene points with low, medium, and high disparity contrasts. (C) Conditional probability distributions of the decision variable. The decision-variable is the disparity that maximizes the local cross-correlation function (Equation 3). Results are presented for a spatial integration area of size 1.0°. The solid and dashed curves show the decision variable for scene points fixated with 0.0 and 1.0 arcmin of disparity, respectively, for patches having low (blue), medium (red), and high (green) disparity contrasts. (D) ROC curves for disparity detection. (E) Disparity detection sensitivity (i.e., d′) as a function of spatial integration area for different disparity contrasts.
Figure 8
 
Effect of local disparity variation on size of optimal integration area for disparity detection. (A) Sensitivity as a function of disparity contrast. Sensitivity drops monotonically with disparity contrast. For each disparity contrast, sensitivities were measured with the optimal integration area. (B) Optimal integration area as a function of disparity contrast. The optimal integration area decreases as disparity contrast increases and then plateaus at a minimum value (0.4° window; 0.2° at half-height).
Figure 8
 
Effect of local disparity variation on size of optimal integration area for disparity detection. (A) Sensitivity as a function of disparity contrast. Sensitivity drops monotonically with disparity contrast. For each disparity contrast, sensitivities were measured with the optimal integration area. (B) Optimal integration area as a function of disparity contrast. The optimal integration area decreases as disparity contrast increases and then plateaus at a minimum value (0.4° window; 0.2° at half-height).
Figure 9
 
Effect of local depth variation on the shape of the spatial integration area that optimizes performance. (A) Integration areas with the same size but different aspect ratios within which to compute the decision variable for the half-occlusion task (i.e., binocular image correlation). (B) Change in half-occlusion detection sensitivity (i.e., d′) as a function of aspect ratio for different disparity contrasts. Arrows indicate the aspect ratio that maximizes half-occlusion detection performance. The maxima were determined using a polynomial fit (not shown) to the raw data. Aspect ratios less than 1.0 are horizontally elongated. Aspect ratios larger than 1.0 are vertically elongated. Colors indicate low (blue; 0.05–1.00 arcmin), medium (red; 0.2–4.0 arcmin), and high (green; 0.75–15.0 arcmin) disparity contrasts (C) Optimal aspect ratio as a function of disparity contrast. The optimal window for half-occlusion detection is more vertically elongated for higher disparity contrasts. The best-fixed aspect ratio across all disparity contrasts is also shown. (D) Same as (A), but for the disparity detection task. (E–F) Same as (B–C), but for the disparity detection task.
Figure 9
 
Effect of local depth variation on the shape of the spatial integration area that optimizes performance. (A) Integration areas with the same size but different aspect ratios within which to compute the decision variable for the half-occlusion task (i.e., binocular image correlation). (B) Change in half-occlusion detection sensitivity (i.e., d′) as a function of aspect ratio for different disparity contrasts. Arrows indicate the aspect ratio that maximizes half-occlusion detection performance. The maxima were determined using a polynomial fit (not shown) to the raw data. Aspect ratios less than 1.0 are horizontally elongated. Aspect ratios larger than 1.0 are vertically elongated. Colors indicate low (blue; 0.05–1.00 arcmin), medium (red; 0.2–4.0 arcmin), and high (green; 0.75–15.0 arcmin) disparity contrasts (C) Optimal aspect ratio as a function of disparity contrast. The optimal window for half-occlusion detection is more vertically elongated for higher disparity contrasts. The best-fixed aspect ratio across all disparity contrasts is also shown. (D) Same as (A), but for the disparity detection task. (E–F) Same as (B–C), but for the disparity detection task.
Figure 10
 
Joint statistics of disparity contrast and binocular difference image contrast in natural scenes. (A) Stereo-image with low groundtruth disparity contrast and low binocular difference image contrast. The upper row shows the stereo-image; the circle indicates the 1° spatial integration area from which the statistics were computed. The lower row shows the groundtruth disparities and the binocular difference image. (B) Stereo-image with high groundtruth disparity contrast has high binocular difference image contrast. (C) Disparity contrast and binocular difference image contrast in natural scenes are jointly distributed as a log-Gaussian and are significantly correlated. Points labeled in yellow indicate the disparity contrast and binocular difference image contrast of the stereo-images (A) and (B). Statistics were computed for a spatial integration area of 1.0° (0.5° width at half-height). Similar results hold for other spatial integration areas.
Figure 10
 
Joint statistics of disparity contrast and binocular difference image contrast in natural scenes. (A) Stereo-image with low groundtruth disparity contrast and low binocular difference image contrast. The upper row shows the stereo-image; the circle indicates the 1° spatial integration area from which the statistics were computed. The lower row shows the groundtruth disparities and the binocular difference image. (B) Stereo-image with high groundtruth disparity contrast has high binocular difference image contrast. (C) Disparity contrast and binocular difference image contrast in natural scenes are jointly distributed as a log-Gaussian and are significantly correlated. Points labeled in yellow indicate the disparity contrast and binocular difference image contrast of the stereo-images (A) and (B). Statistics were computed for a spatial integration area of 1.0° (0.5° width at half-height). Similar results hold for other spatial integration areas.
Figure 11
 
Disparity variation associated with binocularly visible surfaces. (A) The standard deviation of natural disparity signals increases systematically with retinal eccentricity. Disparities are more variable at retinal locations farther from fixated points. (B) Same as (A), but conditioned on five different disparity contrast bins: 0.1–1.0, 0.2–2.0, 0.4–4.0, 0.75–7.5, 1.5–15.0 arcmin. At low disparity contrasts, disparities are nearly homogeneous within 1° of the fovea. At high disparity contrasts, disparity variation increases rapidly with eccentricity, and the region of low variability is smaller and more vertically elongated. Ellipses (fit by hand) indicate iso-disparity-variation contours. (C) Disparity correlation as a function of retinal position. (D) Same as (C), but conditioned on the disparity contrast bins in (B). At low disparity contrasts, disparities are more highly correlated across space. At high disparity contrasts, the region of high correlation is smaller and more vertically elongated. (E) and (F) Horizontal and vertical slices through plots in (C) and (D), solid and dashed curves, respectively.
Figure 11
 
Disparity variation associated with binocularly visible surfaces. (A) The standard deviation of natural disparity signals increases systematically with retinal eccentricity. Disparities are more variable at retinal locations farther from fixated points. (B) Same as (A), but conditioned on five different disparity contrast bins: 0.1–1.0, 0.2–2.0, 0.4–4.0, 0.75–7.5, 1.5–15.0 arcmin. At low disparity contrasts, disparities are nearly homogeneous within 1° of the fovea. At high disparity contrasts, disparity variation increases rapidly with eccentricity, and the region of low variability is smaller and more vertically elongated. Ellipses (fit by hand) indicate iso-disparity-variation contours. (C) Disparity correlation as a function of retinal position. (D) Same as (C), but conditioned on the disparity contrast bins in (B). At low disparity contrasts, disparities are more highly correlated across space. At high disparity contrasts, the region of high correlation is smaller and more vertically elongated. (E) and (F) Horizontal and vertical slices through plots in (C) and (D), solid and dashed curves, respectively.
Figure 12
 
Half-occlusion statistics in natural scenes (A) Example natural stereo-image (top), and binocular visibility map (bottom). Half-occluded points are black. Binocularly visible points are white. Points in one eye's image that are invisible in the other eye's image (i.e., half-occluded points) are shown in black. Inset shows stereo-image patch with half-occluded points overlaid in black. (B) Half-occlusion probability at each spatial location near the fovea. (C) Same as (B), but conditioned on five different disparity contrast bins: 0.1–1.0, 0.2–2.0, 0.4–4.0, 0.75–7.5, 1.5–15.0 arcmin. At low disparity contrasts, half-occlusion probability is near-zero throughout the 1° region near the fovea. At high disparity contrasts, half-occlusion probability increases dramatically with eccentricity, and the region of low probability is smaller and more vertically elongated. (D) Distribution of horizontal sizes of contiguous binocularly visible and half-occluded regions in natural scenes (solid and dashed curves, respectively). The sizes of contiguous binocularly visible and half-occluded regions are approximately distributed as power laws with mean horizontal sizes of 0.44° and 0.06°, respectively.
Figure 12
 
Half-occlusion statistics in natural scenes (A) Example natural stereo-image (top), and binocular visibility map (bottom). Half-occluded points are black. Binocularly visible points are white. Points in one eye's image that are invisible in the other eye's image (i.e., half-occluded points) are shown in black. Inset shows stereo-image patch with half-occluded points overlaid in black. (B) Half-occlusion probability at each spatial location near the fovea. (C) Same as (B), but conditioned on five different disparity contrast bins: 0.1–1.0, 0.2–2.0, 0.4–4.0, 0.75–7.5, 1.5–15.0 arcmin. At low disparity contrasts, half-occlusion probability is near-zero throughout the 1° region near the fovea. At high disparity contrasts, half-occlusion probability increases dramatically with eccentricity, and the region of low probability is smaller and more vertically elongated. (D) Distribution of horizontal sizes of contiguous binocularly visible and half-occluded regions in natural scenes (solid and dashed curves, respectively). The sizes of contiguous binocularly visible and half-occluded regions are approximately distributed as power laws with mean horizontal sizes of 0.44° and 0.06°, respectively.
Figure 13
 
Near-foveal disparities as a function of viewing distance and spatial integration region. (A) Disparity standard deviation across all patches in database (data identical to Figure 11A). (B) Disparity variance as a function of azimuth and elevation. Disparity variance increases linearly outside the central ±1/8°. Variance increases more rapidly in azimuth \(C_\delta ^2 = 60.1a + 7.0\) than in elevation \(C_\delta ^2 = 47.4\it e + 7.6\) where \(a\) and \(e\) are azimuth and elevation in degree, respectively. Curves correspond to the squared standard deviation along horizontal and vertical slices through the plot in 11A. (C) Disparity standard deviation at each retinal location, but conditioned on five different viewing distances (4.0–20.0 m). For each viewing distance, data is pooled in 0.1 diopter bins centered on the viewing distance. For far distances, disparities near the fovea are more likely to be small. (D) and (E) Disparity variance in azimuth and elevation as a function of distance (colors). Best-fit lines in azimuth range from \(C_\delta ^2 = 83.4a + 17.8\) to \(C_\delta ^2 = 47.4a + 4.8\) at view distances from 4.0 m to 20.0 m and best fit lines in elevation range from \(C_\delta ^2 = 83.8e + 11.5\) to \(C_\delta ^2 = 36.0e + 5.0\). Variance increases more rapidly in the upper than lower visual field. (F) Half-occlusion probability as a function of retinal location (data identical to Figure 12B). (G) Half-occlusion probability conditioned on viewing distance. For far distances, the region of least half-occlusion probability shrinks to a vertically elongated zone centered on the fovea.
Figure 13
 
Near-foveal disparities as a function of viewing distance and spatial integration region. (A) Disparity standard deviation across all patches in database (data identical to Figure 11A). (B) Disparity variance as a function of azimuth and elevation. Disparity variance increases linearly outside the central ±1/8°. Variance increases more rapidly in azimuth \(C_\delta ^2 = 60.1a + 7.0\) than in elevation \(C_\delta ^2 = 47.4\it e + 7.6\) where \(a\) and \(e\) are azimuth and elevation in degree, respectively. Curves correspond to the squared standard deviation along horizontal and vertical slices through the plot in 11A. (C) Disparity standard deviation at each retinal location, but conditioned on five different viewing distances (4.0–20.0 m). For each viewing distance, data is pooled in 0.1 diopter bins centered on the viewing distance. For far distances, disparities near the fovea are more likely to be small. (D) and (E) Disparity variance in azimuth and elevation as a function of distance (colors). Best-fit lines in azimuth range from \(C_\delta ^2 = 83.4a + 17.8\) to \(C_\delta ^2 = 47.4a + 4.8\) at view distances from 4.0 m to 20.0 m and best fit lines in elevation range from \(C_\delta ^2 = 83.8e + 11.5\) to \(C_\delta ^2 = 36.0e + 5.0\). Variance increases more rapidly in the upper than lower visual field. (F) Half-occlusion probability as a function of retinal location (data identical to Figure 12B). (G) Half-occlusion probability conditioned on viewing distance. For far distances, the region of least half-occlusion probability shrinks to a vertically elongated zone centered on the fovea.
Supplementary Material
Supplementary Figure S1
Supplementary Figure S2
Supplementary Figure S3
Supplementary Figure S4
Supplementary Figure S5
Supplementary Figure S6
Supplementary Figure S7
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×