Although the AMA filters appear similar to binocular simple cells (
Figure 3,
Figure S1), it may not be obvious how the optimal Bayesian rule for combining their responses is related to processing in visual cortex. Here, we show that the optimal computations can be implemented with neurally plausible operations—linear excitation, linear inhibition, and simple static nonlinearities (thresholding and squaring). Appropriate weighted summation of binocular simple and complex cell population responses can result in a new population of neurons having tightly tuned, unimodal disparity tuning curves that are largely invariant (see
Figure 2c).
The key step in implementing a Bayes-optimal estimation rule is to compute the likelihood—or equivalently the log likelihood—of the optimal filter responses, conditioned on each disparity level. For the present case of a uniform prior, the optimal MAP estimate is simply the disparity with the greatest log likelihood. Given that the likelihoods are Gaussian, the log likelihoods are quadratic:
where
uk and
Ck, which are the mean and covariance matrices of the AMA filter responses to a large collection of natural stereo-image patches having disparity
δk (see
Figure 4a). By multiplying through and collecting terms,
Equation 4 can be expressed as
where
Ri is the response of the
ith AMA filter, and
wik,
wiik, and
wijk are the weights for disparity
δk. The weights are simple functions of the mean and covariance matrices from
Equation 4 (see
Supplement). Thus, a neuron which responds according to the log likelihood of a given disparity (an LL neuron)—that is,
Display Formula = ln[
gauss(
R|
δk)]—can be obtained by weighted summation of the linear (first term), squared (second term), and pairwise-sum-squared (third term) AMA filter responses (
Equation 5). The implication is that a large collection of LL neurons, each with a different preferred disparity, can be constructed from a small, fixed set of linear filters simply by changing the weights on the linear filter responses, squared-filter responses, and pairwise-sum-squared filter responses.
A potential concern is that these computations could not be implemented in cortex. AMA filters are strictly linear and produce positive and negative responses (
Figure 6a), whereas real cortical neurons produce only positive responses. However, the response of each AMA filter could be obtained by subtracting the outputs of two half-wave rectified simple cells that are “on” and “off” versions of each AMA filter (see
Supplement, Figure S3a). The squared and pairwise-sum-squared responses are modeled as resulting from a linear filter followed by a static squaring nonlinearity (
Figure 6b); these responses could be obtained from cortical binocular complex cells (see
Supplement). (Note that our “complex cell” differs from the definition given by some prominent expositors of the disparity energy model, [
Figure 6d], Cumming & DeAngelis,
2001; Ohzawa,
1998; Qian,
1997.) The squared responses could be obtained by summing and squaring the responses of on and off simple cells (see
Supplement, Figure S3a, b). Finally, the LL neuron response could be obtained via a weighted sum of the AMA filter and model complex cell responses (
Figure 6c). Thus, all the optimal computations are biologically plausible.
Figure 6 shows processing schematics, disparity tuning curves, and response variability for the three filter types implied by our analysis: an AMA filter, a model complex cell, a model LL neuron, and, for comparison, a standard disparity energy neuron (Cumming & DeAngelis,
2001). (The model complex cells are labeled “complex” because in general they exhibit temporal frequency doubling and a fundamental to mean response ratio that is less than 1.0, Skottun et al.,
1991.) Response variability due to retinal image features that are irrelevant for estimating disparity is indicated by the gray area; the smaller the gray area, the more invariant the neural response to irrelevant image features. The AMA filter is poorly tuned to disparity and gives highly variable responses to different stereo-image patches with the same disparity (
Figure 6a). The model complex cell is better tuned, but it is not unimodal and its responses also vary severely with irrelevant image information (
Figure 6b). (The disparity tuning curves for all model complex cells are shown in
Figure S5.) In contrast, the LL neuron is sharply tuned, is effectively unimodal, and is strongly response invariant (
Figure 6c). That is, it responds similarly to all natural image patches of a given disparity. The disparity tuning curves for a range of LL neurons are shown in
Figure 7a. When slant varies, similar but broader LL neuron tuning curves result (
Figure S1d, e).
These results show that it is potentially misleading to refer to canonical V1 binocular simple and complex cells as disparity tuned because their responses are typically as strongly modulated by variations in contrast pattern as they are by variations in disparity (gray area,
Figure 6a, b). The LL neurons, on the other hand, are tuned to a narrow range of disparities, and respond largely independent of the spatial frequency content and contrast.
The LL neurons have several interesting properties. First, their responses are determined almost exclusively by the model complex cell inputs because the weights on the linear responses (
Equation 5,
Figure 6a,
c) are generally near zero (see
Supplement). In this regard, the LL neurons are consistent with the predictions of the standard disparity energy model (Cumming & DeAngelis,
2001; Ohzawa,
1998). However, standard disparity energy neurons are not as narrowly tuned or as invariant (
Figure 6d).
Second, each LL neuron receives strong inputs from multiple complex cells (
Figure 7b). In this regard, the LL neurons are inconsistent with the disparity energy model, which proposes that disparity-tuned cells are constructed from two binocular subunits. The potential value of more than two subunits has been previously demonstrated (Qian & Zhu,
1997). Recently, it has been shown that some disparity-tuned V1 cells are often modulated by a greater number of binocular subunits than two. Indeed, as many as 14 subunits can drive the activity of a single disparity selective cell (Tanabe, Haefner, & Cumming,
2011).
Third, the weights on the model complex cells (
Figure 7b)—which are determined by the conditional response distributions (
Figure 4a)—specify how information at different spatial frequencies should be combined.
Fourth, as preferred disparity increases, the number of strong weights on the complex cell inputs decreases (
Figure 7b). This occurs because high spatial frequencies are less useful for encoding large disparities (see below). Thus, if cells exist in cortex that behave similarly to LL neurons, the number and spatial frequency tuning of binocular subunits driving their response should decrease as a function of their preferred disparity.
Fifth, for all preferred disparities, the excitatory (positive) and inhibitory (negative) weights are in a classic push-pull relationship (Ferster & Miller,
2000) (
Figure 7b), consistent with the fact that disparity selective neurons in visual cortex contain both excitatory and suppressive subunits (Tanabe et al.,
2011).
Sixth, the LL neuron disparity tuning curves are approximately log-Gaussian in shape (i.e., Gaussian on a log-disparity axis). Because their standard deviations are approximately constant on a log disparity axis, their disparity bandwidths increase linearly with the preferred disparity (
Figure 7c). In this respect, many cortical neurons behave like LL neurons; the disparity tuning functions of cortical neurons typically have more low frequency power than is predicted from the standard energy model (Ohzawa, DeAngelis, & Freeman,
1997; Read & Cumming,
2003). Additionally, psychophysically estimated disparity channels in humans exhibit a similar relationship between bandwidth and preferred disparity (Stevenson et al.,
1992). Human disparity channels, however, differ in that they have inhibitory side-lobes (Stevenson et al.,
1992); that is, they have the center-surround organization that is a hallmark of retinal ganglion cell, LGN, and V1 receptive fields. Understanding the basis of this center-surround organization is an important direction for future work.
V1 binocular neurons are unlikely to have receptive fields exactly matching those of the simple and complex cells implied by the optimal AMA filters, but V1 neurons are likely to span the subspace spanned by the optimal filters. It is thus plausible that excitatory and inhibitory synaptic weights could develop so that a subset of the neurons in V1, or in other cortical areas, signal the log likelihood of different specific disparities (see
Figure 7b). Indeed, some cells in cortical areas V1, V2, and V3/V3a exhibit sharp tuning to the disparity of random dot stimuli (Cumming,
2002; Ohzawa et al.,
1997; G. F. Poggio et al.,
1988; Read & Cumming,
2003).
Computational models of estimation from neural populations often rely on the assumption that each neuron is invariant and unimodally tuned to the stimulus property of interest (Girshick et al.,
2011; Jazayeri & Movshon,
2006; Lehky & Sejnowski,
1990; Ma et al.,
2006). However, it is often not discussed how invariant unimodal tuning arises. For example, the binocular neurons (i.e., complex cells) predicted by the standard disparity energy model do not generally exhibit invariant unimodal tuning (
Figure 6d). Our analysis shows that neurons with unimodal tuning to stimulus properties not trivially available in the retinal images (e.g., disparity) can result from appropriate linear combination of nonlinear filter responses.
To obtain optimal disparity estimates, the LL neuron population response (represented by the vector
RLL in
Figure 2c) must be read out (see
Figure 2d). The optimal read-out rule depends on the observer's goal (the cost function). A common goal is to pick the disparity having the maximum a posteriori probability. If the prior probability of the different possible disparities is uniform, then the optimal MAP decoding rule reduces to finding the LL neuron with the maximum response. Nonuniform prior probabilities can be taken into account by adding a disparity-dependent constant to each LL neuron response before finding the peak response. There are elegant proposals for how the peak of a population response can be computed in noisy neural systems (Jazayeri & Movshon,
2006; Ma et al.,
2006). Other commonly assumed cost functions (e.g., MMSE) yield similar performance.
In sum, our analysis has several implications. First, it suggests that optimal disparity estimation is best understood in the context of a population code. Second, it shows how to linearly sum nonlinear neural responses to construct cells with invariant unimodal tuning curves. Third, it suggests that the eclectic mixture of binocular receptive field properties in cortex may play a functional role in disparity estimation. Fourth, it provides a principled hypothesis for how neurons may compute the posterior probabilities of stimulus properties (e.g., disparity, defocus, motion) not trivially available in the retinal image(s). Thus, our analysis provides a recipe for how to increase selectivity and invariance in the visual processing stream: invariance to image content variability and selectivity for the stimulus property of interest.