Abstract
Estimating the motion of objects in depth is important for behavior, and is strongly supported by binocular visual cues. However, our understanding is incomplete of how the brain does, and should, estimate motion-in-depth from binocular signals. Here, using an image-computable ideal observer, we show how to optimally estimate the 3D speed of surface patches in the environment from 250ms naturalistic binocular video clips. First, the model applies a small set of local spatio-temporal linear filters to a binocular video (analogous to simple cells). Then, local 3D speed is non-linearly decoded from the filter population response. The filters and Bayes-optimal decoder are learned to optimize performance in the task. Interestingly, the joint distributions of filter responses, conditioned on 3D speed, are well-approximated by Gaussian distributions. Therefore, optimal decoding of 3D speed requires quadratic combination of the filter responses. Thus, the natural statistics of the filter responses dictate that the normative computations for this task are a biologically-plausible generalization of the widely studied energy model: linear filtering followed by quadratic combinations of responses. Also, similar to human psychophysical behavior, the model learned to use both the time-derivatives of matching binocular features (changing disparity over time; CDOT) and binocular comparisons of time-derivatives of monocular features (interocular velocity differences; IOVD). Like humans, the model weights CDOT cues more heavily at slow speeds and IOVD cues at high speeds. Finally, using the observer model and natural disparity statistics, we propose the novel hypothesis that IOVD cues are more strongly weighted in human peripheral vision in part because, during natural viewing, disparities in the retinal periphery are more variable than those near the fovea. Our results suggest that many characteristics of 3D motion processing are accounted for by near-optimal information processing in the early visual system.