**Abstract**:

**Abstract**
**A great challenge of systems neuroscience is to understand the computations that underlie perceptual constancies, the ability to represent behaviorally relevant stimulus properties as constant even when irrelevant stimulus properties vary. As signals proceed through the visual system, neural states become more selective for properties of the environment, and more invariant to irrelevant features of the retinal images. Here, we describe a method for determining the computations that perform these transformations optimally, and apply it to the specific computational task of estimating a powerful depth cue: binocular disparity. We simultaneously determine the optimal receptive field population for encoding natural stereo images of locally planar surfaces and the optimal nonlinear units for decoding the population responses into estimates of disparity. The optimal processing predicts well-established properties of neurons in cortex. Estimation performance parallels important aspects of human performance. Thus, by analyzing the photoreceptor responses to natural images, we provide a normative account of the neurophysiology and psychophysics of absolute disparity processing. Critically, the optimal processing rules are not arbitrarily chosen to match the properties of neurophysiological processing, nor are they fit to match behavioral performance. Rather, they are dictated by the task-relevant statistical properties of complex natural stimuli. Our approach reveals how selective invariant tuning—especially for properties not trivially available in the retinal images—could be implemented in neural systems to maximize performance in particular tasks.**

**r**(see Figure 2a), is determined by the luminance (Figure 1b) and depth structure of natural scenes, projective viewing geometry, and the optics, sensors, and noise in the vision system. We model each of these factors for the human visual system (see Methods details). We generate a large number of noisy, sampled, stereo-images of small (1° × 1°), fronto-parallel surface patches for each of a large number of disparity levels within Panum's fusional range (Panum, 1858) (−16.875 to 16.875 arcmin in 1.875 arcmin steps). This range covers approximately 80% of disparities that occur at or near the fovea in natural viewing (Liu, Bovik, & Cormack, 2008), and these image patches represent the information available to the vision system for processing. Although we focus first on fronto-parallel patches, we later show that the results are robust to surface slant and are likely to be robust to other depth variations occurring in natural scenes. We emphasize that in this paper we simulate retinal images by projecting natural image patches onto planar surfaces (Figure 1c), rather than simulating retinal images from stereo photographs. Later, we evaluate the effects of this simplification (see Discussion).

*α*and

_{L}*α*are the angles between the retinal projections of a target and fixation point in the left and right eyes. This is the definition of absolute retinal disparity. The specific pattern of binocular disparities depends on the distance between the eyes, the distance and direction that the eyes are fixated, and the distance and depth structure of the surfaces in the scene. We consider a viewing situation in which the eyes are separated by 6.5 cm (a typical interocular separation) and are fixated on a point 40 cm straight ahead (a typical arm's length) (Figure 1c).

_{R}*c*(

**x**) by subtracting off and dividing by the mean and then multiplying by a raised cosine of 0.5° at half height. The window limits the maximum possible size of the binocular filters; it places no restriction on minimum size. The size of the cosine window approximately matches the largest V1 binocular receptive field sizes near the fovea (Nienborg et al., 2004). Next, each eye's sampled image patch is averaged vertically to obtain what are henceforth referred to as left and right eye signals. Vertical averaging is tantamount to considering only vertically oriented filters, for this is the operation that vertically oriented filters perform on images. All orientations can provide information about binocular disparity (Chen & Qian, 2004; DeAngelis et al., 1991); however, because canonical disparity receptive fields are vertically oriented, we focus our analysis on them. An example stereo pair and corresponding left and right eye signals are shown in Figure 1d. Finally, the signals are contrast normalized to a vector magnitude of 1.0:

*c*(

_{norm}**x**) =

*c*(

**x**)/‖

*c*(

**x**)‖. This normalization is a simplified version of the contrast normalization seen in cortical neurons (Albrecht & Geisler, 1991; Albrecht & Hamilton, 1982; Heeger, 1992) (see Supplement). Seventy-six hundred normalized left and right eye signals (400 Natural Inputs × 19 Disparity Levels) constituted the training set for AMA.

*should*cover for maximally accurate disparity estimation. Some filters are excited by nonzero disparities, some are excited by zero (or near-zero) disparities, and still others are suppressed by zero disparity. The left and right eye filter components are approximately log-Gabor (Gaussian on a log spatial frequency axis). The spatial frequency selectivity of each filter's left and right eye components are similar, but differ between filters (Figure 3b). Filter tuning and bandwidth range between 1.2–4.7 c/° and 0.9–4.2 c/°, respectively, with an average bandwidth of approximately 1.5 octaves. Thus, the spatial extent of the filters (see Figure 3a) is inversely related to its spatial frequency tuning: As the tuned frequency increases, the spatial extent of the filter decreases. The filters also exhibit a mixture of phase and position coding (Figure 3c), suggesting that a mixture of phase and position coding is optimal. Similar filters result (Figure S1a–c) when the training set contains surfaces having a distribution of different slants (see Discussion). (Note that the cosine windows bias the filters more toward phase than position encoding. Additional analyses have nevertheless shown that windows having a nonzero position offset, i.e., position disparity, do not qualitatively change the filters.)

**f**

_{5}(see Figure 3, Supplementary Figure S5) does not respond because anticorrelated intensity patterns in the left- and right-eye images are very unlikely in natural images. A complex cell with this binocular receptive field would produce a disparity tuning curve similar to the tuning curve produced by a classic tuned-inhibitory cell (Figure S5; Poggio, Gonzalez, Krause, 1988).

*δ*. (The dot product between each filter and a contrast normalized left- and right-eye signal gives each filter's response. The filter responses are represented by the vector

_{k}**R**; see Figure 2b.) The conditional response distributions

*p*(

**R**|

*δ*) are approximately Gaussian (83% of the marginal distributions conditioned on disparity are indistinguishable from Gaussian; K-S test,

_{k}*p*> 0.01, see Figure 4a). (The approximately Gaussian form is largely due to the contrast normalization of the training stimuli; see Discussion.) Each of these distributions was fit with a multidimensional Gaussian

*gauss*(

**R**;

**u**

*,*

_{k}**C**

*) estimated from the sample mean vector*

_{k}**û**

*and the sample covariance matrix*

_{k}**Ĉ**

*. Figure 4a shows sample filter response distributions (conditioned on several disparity levels) for the first two AMA filters. Much of the information about disparity is contained in the covariance between the filter responses, indicating that a nonlinear decoder is required for optimal performance.*

_{k}*p*(

*δ*) cancel out. (This assumption yields a lower bound on performance; performance will increase somewhat in natural conditions when the prior probabilities are not flat, a scenario that is considered later.)

**u**

*and*

_{k}**C**

*, which are the mean and covariance matrices of the AMA filter responses to a large collection of natural stereo-image patches having disparity*

_{k}*δ*(see Figure 4a). By multiplying through and collecting terms, Equation 4 can be expressed as where

_{k}*R*is the response of the

_{i}*i*th AMA filter, and

*w*,

_{ik}*w*, and

_{iik}*w*are the weights for disparity

_{ijk}*δ*. The weights are simple functions of the mean and covariance matrices from Equation 4 (see Supplement). Thus, a neuron which responds according to the log likelihood of a given disparity (an LL neuron)—that is,

_{k}*gauss*(

**R**|

*δ*)]—can be obtained by weighted summation of the linear (first term), squared (second term), and pairwise-sum-squared (third term) AMA filter responses (Equation 5). The implication is that a large collection of LL neurons, each with a different preferred disparity, can be constructed from a small, fixed set of linear filters simply by changing the weights on the linear filter responses, squared-filter responses, and pairwise-sum-squared filter responses.

_{k}**R**

*in Figure 2c) must be read out (see Figure 2d). The optimal read-out rule depends on the observer's goal (the cost function). A common goal is to pick the disparity having the maximum a posteriori probability. If the prior probability of the different possible disparities is uniform, then the optimal MAP decoding rule reduces to finding the LL neuron with the maximum response. Nonuniform prior probabilities can be taken into account by adding a disparity-dependent constant to each LL neuron response before finding the peak response. There are elegant proposals for how the peak of a population response can be computed in noisy neural systems (Jazayeri & Movshon, 2006; Ma et al., 2006). Other commonly assumed cost functions (e.g., MMSE) yield similar performance.*

^{LL}*c*(

_{norm}**x**) =

*c*(

**x**)/

*n*is the dimensionality of the vector and

*c*

_{50}is the half-saturation constant (Albrecht & Geisler, 1991; Albrecht & Hamilton, 1982; Heeger, 1992). (The half-saturation constant is so-called because, in a neuron with an output squaring nonlinearity, the response rate will equal half its maximum when the contrast of the stimulus equals the value of

*c*

_{50}.)

*c*

_{50}. The conditional response distributions are not. For large values of

*c*

_{50}the response distributions have tails much heavier than Gaussians. When

*c*

_{50}= 0.0 (as it did throughout the paper), the distributions are well approximated but somewhat lighter-tailed than Gaussians. The distributions (e.g., see Figure 4a) are most Gaussian on average when

*c*

_{50}= 0.1 (Figure S9). On the basis of this finding, we hypothesize a new function for cortical contrast normalization in addition to the many already proposed (Carandini & Heeger, 2012): Contrast normalization may help create conditional filter response distributions that are Gaussian, thereby making simpler the encoding and decoding of high-dimensional subspaces of retinal image information.

*d*is the fixation distance to the surface center,

_{fixation}*IPD*is the interpupillary distance, and Δ is the depth. Depth is given by Δ =

*d*−

_{surface}*d*where

_{fixation}*d*is the distance to the surface. The disparity of other points on the surface patch varies slightly across the surface patch. The change in disparity across the surface is greater when the surface patches are slanted.

_{surface}*e. I*(

_{e}**x**) represents an eye-specific luminance image of the light striking the sensor array at each location

**x**= (

*x*,

*y*) in a hypothetical optical system that causes zero degradation in image quality. Each eye's optics degrade the idealized image; the optics are represented by a polychromatic point-spread function

*psf*(

_{e}**x**) that contains the effects of defocus, chromatic aberration, and photopic wavelength sensitivity. The sensor arrays are represented by a spatial sampling functions

*samp*(

**x**). Finally, the sensor responses are corrupted by noise

*η*; the noise level was set just high enough to remove retinal image detail that is undetectable by the human visual system (Williams, 1985). (In pilot studies, we found that the falloff in contrast sensitivity at low frequencies had a negligible effect on results; for simplicity we did not model its effects.) Note that Equation 7 represents a luminance (photopic) approximation of the images that would be captured by the photoreceptors. This approximation is sufficiently accurate for the present purposes (Burge & Geisler, 2011).

*Visual Neuroscience**,*7 (6), 531–546. [CrossRef] [PubMed]

*Journal of Neurophysiology**,*48 (1), 217–237. [PubMed]

*, 82 (2), 891–908. [PubMed]*

*Journal of Neurophysiology**, 2 (7), 1211–1215. [CrossRef]*

*Journal of the Optical Society of America A**, 24 (9), 2077–2089. [CrossRef] [PubMed]*

*Journal of Neuroscience*

*Proceedings of the National Academy of Sciences, USA**,*108 (11), 4423. [CrossRef]

*The Journal of Physiology**,*211 (3), 599–622. [CrossRef] [PubMed]

*Proceedings of the National Academy of Sciences, USA**,*108 (40), 16849–16854. [CrossRef]

*Optimal defocus estimates from individual images for autofocusing a digital camera*. Proceedings of the IS&T/SPIE 47th Annual Meeting, January, 2012, Burlingame, CA.

*Nature Reviews Neuroscience**,*13 (1), 51–62.

*Neural Computation**,*16 (8), 1545–1577. [CrossRef] [PubMed]

*, 31 (12), 2195–2207. [CrossRef] [PubMed]*

*Vision Research**, 418 (6898), 633–636. [CrossRef] [PubMed]*

*Nature*

*Annual Review of Neuroscience**,*24

*,*203–238. [CrossRef] [PubMed]

*, 27 (11), 1367–1377. [CrossRef] [PubMed]*

*Perception*

*The Journal of Comparative Neurology**,*292 (4), 497–523. [CrossRef] [PubMed]

*, 22 (5), 545–559. [CrossRef] [PubMed]*

*Vision Research**, 352 (6331), 156–159. [CrossRef] [PubMed]*

*Nature*

*Annual Review of Neuroscience**,*23 (1), 441–471. [CrossRef] [PubMed]

*The Journal of Physiology**,*137 (3), 488–508. [CrossRef] [PubMed]

*, 36 (12), 1839–1857. [CrossRef] [PubMed]*

*Vision Research*

*Advances in Neural Information Processing Systems (NIPS)**,*23

*,*658–666.

*Journal of Vision**,*9 (13): 17, 1–16, http://www.journalofvision.org/content/9/13/17, 10.1167/9.13.17. [PubMed] [Article] [PubMed]

*, 14 (7), 926–932. [CrossRef] [PubMed]*

*Nature Neuroscience*

*Advances in Neural Information Processing Systems**,*23

*,*1–9.

*, 57 (1), 147–158. [CrossRef] [PubMed]*

*Neuron*

*Journal of the Optical Society of America A, Optics, Image Science, and Vision**,*14 (8), 1673–1683. [CrossRef] [PubMed]

*The Journal of General Physiology**,*25 (6), 819–840. [CrossRef] [PubMed]

*, 9 (2), 181–197. [CrossRef] [PubMed]*

*Visual Neuroscience**, 48 (12), 1427–1439. [CrossRef] [PubMed]*

*Vision Research**, 11 (3), 191–210. [CrossRef]*

*Network: Computation in Neural Systems**, 9 (5), 690–696. [CrossRef] [PubMed]*

*Nature Neuroscience**, 18 (8), 1013–1022. [CrossRef] [PubMed]*

*Vision Research**, 59 (2), 219–231. [CrossRef] [PubMed]*

*Perception & Psychophysics**, 10 (7), 2281–2299. [PubMed]*

*Journal of Neuroscience**, 51 (1), 48–57. [CrossRef] [PubMed]*

*Vision Research**, 5 (4), 356–363. [CrossRef] [PubMed]*

*Nature Neuroscience*

*Network: Computation in Neural Systems**,*5

*,*157–174. [CrossRef]

*, 8 (11): 19, 1–14, http://www.journalofvision.org/content/8/11/19, doi:10.1167/8.11.19 [PubMed] [Article].*

*Journal of Vision**, 9 (11), 1432–1438. [CrossRef] [PubMed]*

*Nature Neuroscience**, 40 (18), 2437–2447. [CrossRef] [PubMed]*

*Vision Research**, 194 (4262), 283–287. [CrossRef] [PubMed]*

*Science**, 30 (11), 1763–1779. [CrossRef] [PubMed]*

*Vision Research*

*Vision Research**,*30 (11), 1811–1825. [CrossRef] [PubMed]

*, 24 (9), 2065–2076. [CrossRef] [PubMed]*

*Journal of Neuroscience**, 44 (4), 253–259. [CrossRef] [PubMed]*

*Journal of Experimental Psychology**, 8 (4), 509–515. [CrossRef] [PubMed]*

*Current Opinion in Neurobiology*

*Science**,*249 (4972), 1037–1041. [CrossRef] [PubMed]

*, 77 (6), 2879–2909. [PubMed]*

*Journal of Neurophysiology**, 381 (6583), 607–609. [CrossRef] [PubMed]*

*Nature**. Kiel: Schwerssche Buchandlung.*

*Physiologische Untersuchungen über das Sehen mit zwei Augen*[Translation:*Psychological investigations on seeing with two eyes*]*, 8 (12), 4531–4550. [PubMed]*

*Journal of Neuroscience**, 18 (3), 359–368. [CrossRef] [PubMed]*

*Neuron**, 37 (13), 1811–1827. [CrossRef] [PubMed]*

*Vision Research**, 90 (5), 2795–2817. [CrossRef] [PubMed]*

*Journal of Neurophysiology**, 10 (10), 1322–1328. [CrossRef] [PubMed]*

*Nature Neuroscience**, 19 (6), 735–753. [PubMed]*

*Visual Neuroscience*

*Annual Review of Neuroscience**,*24

*,*1193–1216. [CrossRef] [PubMed]

*, 31 (7-8), 1079–1086. [PubMed]*

*Vision Research**, 32 (9), 1685–1694. [CrossRef] [PubMed]*

*Vision Research**, 40 (13), 1711–1737. [CrossRef] [PubMed]*

*Vision Research**, 28 (44), 11304–11314. [CrossRef] [PubMed]*

*Journal of Neuroscience**, 31 (22), 8295–8305. [CrossRef] [PubMed]*

*Journal of Neuroscience*

*Applied Optics**,*31 (19), 3594–3600. [CrossRef] [PubMed]

*, 38 (1), 103–114. [CrossRef] [PubMed]*

*Neuron**, 15 (5), 583–590. [CrossRef] [PubMed]*

*Vision Research**, 18 (1), 101–105. [CrossRef] [PubMed]*

*Vision Research**, 31 (27), 9814–9818. [CrossRef] [PubMed]*

*Journal of Neuroscience**, 2 (7), 1087–1093. [CrossRef]*

*Journal of the Optical Society of America A**. New York: John Wiley & Sons.*

*Color science: Concepts and methods, quantitative data and formulas*