**Depth perception requires the use of an internal model of the eye-head geometry to infer distance from binocular retinal images and extraretinal 3D eye-head information, particularly ocular vergence. Similarly, for motion in depth perception, gaze angle is required to correctly interpret the spatial direction of motion from retinal images; however, it is unknown whether the brain can make adequate use of extraretinal version and vergence information to correctly transform binocular retinal motion into 3D spatial coordinates. Here we tested this hypothesis by asking participants to reconstruct the spatial trajectory of an isolated disparity stimulus moving in depth either peri-foveally or peripherally while participants' gaze was oriented at different vergence and version angles. We found large systematic errors in the perceived motion trajectory that reflected an intermediate reference frame between a purely retinal interpretation of binocular retinal motion (not accounting for veridical vergence and version) and the spatially correct motion. We quantify these errors with a 3D reference frame model accounting for target, eye, and head position upon motion percept encoding. This model could capture the behavior well, revealing that participants tended to underestimate their version by up to 17%, overestimate their vergence by up to 22%, and underestimate the overall change in retinal disparity by up to 64%, and that the use of extraretinal information depended on retinal eccentricity. Since such large perceptual errors are not observed in everyday viewing, we suggest that both monocular retinal cues and binocular extraretinal signals are required for accurate real-world motion in depth perception.**

*moving*objects in depth is unclear. Harris (2006) and Welchman and colleagues (2009) found psychophysical evidence supporting the use of binocular extraretinal signals (both static and dynamic) for motion in depth perception, but their relative contributions to the spatial 3D percept remain unclear. Here, we attempt to answer this question by asking participants to reconstruct motion-in-depth trajectories from only binocular depth cues across various vergence and horizontal version angles, then use 3D geometric modeling to compare these reconstructions directly to motion-in-depth perception predicted by relative disparity.

*intermediate*coordinates irrespective of retinal eccentricity. Extended to real world viewing, we infer that motion-in-depth estimation is an eccentricity-dependent process that explicitly requires the use of

*both*binocular and monocular depth cues for accuracy.

*x-z*) plane. All LEDs had a physical diameter of 5 mm. At the end of target motion, participants were instructed to reconstruct the motion of this target using a stylus on the touchscreen in front of them. On each trial, the FT was reflected through a mirror oriented at 45° and positioned at the level of the eyes, such that the participant perceived the FT as located in the same lateral depth plane as the MT. Other key elements in the physical setup included a stationary Chronos C-ETD 3D video-based eye tracker (Chronos Vision, Berlin, Germany) with an attached bite-bar for head stabilization to ensure stable fixation on the FT during target motion. This physical arrangement allowed us to present FTs in the MT plane while avoiding physical collisions (panel B) with FTs positioned at nine different locations, spaced in a polar grid. The FTs' spatial positions corresponded to three horizontal version angles, −30°, 0°, and 30°. The FTs were also positioned across three metric distances corresponding to three vergence angles (assuming a nominal interocular distance of 6.5 cm): 42 cm (∼8.8° vergence angle), 78 cm (∼4.8° vergence angle), and 124 cm (∼3° vergence angle). The MTs moved around these points according to 18 different motion trajectories (six orientations spaced equally from 0° to 180°, with three possible curvatures) purely in the lateral depth plane. At the chosen depths, FT LEDs subtended a maximum visual angle of 4 arcmin and a minimum of 1.3 arcmin, while the MT LED subtended a maximum visual angle of 5.5 arcmin and a minimum of 1.1 arcmin. The trajectories of the MTs were scaled with distance such that they subtended the same overall retinal angle, traversing a maximum of 22 cm in depth when centered on the near FTs, 40 cm when centered on the middistance FTs and 65 cm when centered on the far FTs.

*α*) assumed to be 5° (Blohm & Crawford, 2007), we implemented the binocular extension of Listing's Law using the following relationship between vergence angle and the “saloon door”-like tilt around the vertical axis of each eye's Listing's plane (Blohm et al., 2008):

*θ*

_{v}represents the tilt around the vertical axis and the Listing's plane tilt gain

*μ*

_{v}is assumed to be 0.25 (Blohm et al., 2008). We then computed the shortest rotation from primary position

*Q*

_{L2}to the normalized current eye-centered, head-fixed gaze position

*G*:

_{EH}*p*with corresponding quaternion representation

_{CH}*Q*to each eye was given by the following translation (based on interocular distance,

_{p,CH}*Q*:

_{p}*p*and

_{L}*p*, using the Fick convention (Blohm et al., 2008). This gave us the eye-centered, eye-fixed, target direction of the MT,

_{R}*D*:

_{MT,EE}*θ*and

*ϕ*refer to the horizontal and vertical target projection angles, respectively.

*g*) and vergence gain (

_{vs}*g*). This operation represented the motion trajectory after the eyes had moved to the inverse estimates of version and vergence angles during the encoding of retinal motion (inverse modeling stage). We then back-projected the retinal motion trajectory points into space for each eye by computing the cyclopean-centered, head-fixed transformation using the left and right eye rotation and translation quaternions representing Listing's law and interocular distance, respectively, as described above. We computed the 3D location of the rays' intersection, representing the decoded depth (spatial decoding stage). At this stage, we applied the depth gain (

_{vg}*g*) to the depth component only, representing any lateral compression effects (Fulvio et al., 2015; Rokers et al., 2018). We present this modeling framework in Figure 2 from left to right. In panel A, first we represent the retinal motion encoding stage, which we computed based on the actual geometry of the eyes and head.

_{d}*g*) and vergence (

_{vs}*g*) to the inverse model, and motion purely in depth (

_{vg}*g*). Each parameter accounted for a different aspect of the trajectory (shown in Figure 2 insets). To produce the retinal prediction, we set the version gain to 0 and used a constant vergence gain of 1. Importantly, this retinal prediction arbitrarily assumes that vergence is 100% accounted for. Note that, because our model computes the spatial intersection of the binocular back-projections, vergence gains had to be greater than 0 (otherwise the back-projections would be parallel). Second, we represent the inverse modeling stage where we varied the contributions of extraretinal signals. Third, we represent the spatial motion output stage, where we project the retinal motion, transformed by some proportion of extraretinal signals (depending on the values of

_{d}*g*,

_{vs}*g*, and

_{vg}*g*), back into space. In Figure 2B, we show a sample transformation for the retinal case where version is unaccounted for, but vergence and depth are accounted for (i.e.,

_{d}*g*= 0,

_{vs}*g*= 1, and

_{vg}*g*= 1).

_{d}*g*, and

_{vs}, g_{vg}*g*around the initialized parameters (eight linearly spaced values for each parameter). We performed this exact optimization procedure separately for each vergence angle to avoid confounding vergence effects. In total, we computed the fits of 3 × (8,000 + 512) = 25,536 total parameter combinations. Given the number of potential parameter combinations and the complex interactions with the 3D geometry, our optimization strategy was more computationally tractable than more sophisticated error gradient-based methods. This optimization provided parameter estimates that consistently accounted for behavioral variability across participants and motion conditions (see Table 1). To ensure we did not artificially introduce effects between motion conditions, we also carried out this fitting procedure with the full datasets (merged across foveal and peripheral conditions) and found qualitatively identical reference frame results. Custom Matlab scripts for generating model predictions and all data are available on the Open Science Framework website, and can be found via this link: https://osf.io/pvz97/.

_{d}*t*tests. We also performed paired

*t*tests when appropriate for comparing parameters across conditions. We also tested for differences in the residuals for different model outputs using one- and two-tailed

*F*tests for equal variance. The rest of the statistical treatment of the data consisted primarily of computing correlation coefficients and regression analyses.

*g*,

_{vs}*g*, and

_{vg}*g*inverse model parameters for the behavioral trajectories, doing separate optimizations for foveally and peripherally presented motion to observe how the parameters depend on retinal eccentricity (see Materials and methods for detailed explanation of optimization procedure). The results of this optimization are shown in Figure 5 at both the single participant level (Figure 5A) and group level (Figure 5B) for both the foveal and peripheral motion conditions.

_{d}*SD*) of horizontal version during foveal motion, compared to 96% ± 10% during peripheral motion: paired

*t*test,

*t*(35) = −5.22,

*p*< 0.01. Given that version compensation during foveal motion was incomplete, the apparent full compensation during peripheral motion could have been the result of the visual system using the retinal location of the stimulus as a cue for current horizontal eye orientation, effectively bypassing an explicit need for extraretinal signals. Next, we found that the foveal depth gains accounted for 54% ± 13% of depth speed and was significantly greater than that for peripheral motion at 36% ± 14%: paired

*t*test,

*t*(35) = 8.70,

*p*< 0.01, indicating that motion in depth was perceived to be faster when foveal. Finally, participants used a foveal vergence gain of 1.22 ± 0.18. In contrast, participants used a significantly smaller (and more accurate) peripheral vergence gain of 0.98 ± 0.15: paired

*t*test,

*t*(35) = 6.30,

*p*< 0.0.

*R*

^{2}values in Table 1. In this table, the computed

*R*

^{2}values represent the variance accounted for under the spatial and retinal hypotheses for each participant in each motion condition (foveal and peripheral). Also shown are the optimized model gain parameters (

*g*,

_{vs}*g*,

_{vg}*g*), the corresponding model

_{d}*R*

^{2}values, and the results of a one-tailed

*F*test for equal variance (comparing the model residuals to those for the retinal and spatial hypotheses). In general, the model fit provided good fit to the data, yielding a larger

*R*

^{2}value than either the retinal or spatial predictions in 41 of 48 possible comparisons. Statistically, the model fit provided a better fit to the data in all retinal comparisons (all

*F*statistics significantly greater than 1), and in 19 of 24 spatial comparisons, though in one case we did not detect a significant difference in residual variability (participant 3, peripheral motion condition, spatial hypothesis). To be sure we did not bias these results by splitting motion conditions into the foveal and peripheral conditions, we also carried out the model optimization with the full datasets, merged across motion conditions, and observed qualitatively identical results. These findings suggest that an underestimation of 3D eye orientation signals during the transformation from retinal to spatial coordinates is responsible for observed distortions to motion-in-depth perception.

*I*, represented by Equation 6:

_{T}*D*and

_{R}*D*are the Euclidian distances of each set of gain parameters from the retinal and spatial hypotheses, respectively. For example, a purely spatial set of gain parameters would be represented by [

_{S}*g*] = [1 1 1], corresponding to a

_{vs}g_{vg}g_{d}*D*= 0 and a

_{S}*D*= 1; subsequently,

_{R}*I*= (1 − 0)/(1 + 0) = 1. By the same logic, for a purely retinal set of gains,

_{T}*I*= −1. We present the distributions of these gain parameters for each participant separated for foveal and peripheral motion, merged across vergence fits, in Figure 6.

_{T}*I*was significantly greater than 0 for both foveal: mean ±

_{T}*SD*, 0.28 ± 0.13;

*t*(35) = 13.19,

*p*< 0.01, and peripheral motion: mean ±

*SD*, 0.29 ± 0.09;

*t*(35) = 19.13,

*p*< 0.01, suggesting that, in both cases, the average reconstructed trajectory was coded according to an intermediate coordinate frame that was more similar to spatial than to retinal.

*both*monocular retinal cues and extraretinal signals for an accurate perception of 3D motion, as neither retinal (Blohm et al., 2008) nor extraretinal signals alone can support geometrically correct stereoscopic perception. This claim extends earlier work (Blohm et al., 2008; Harris, 2006; Welchman et al., 2009) with a model which directly quantifies the extraretinal contribution to motion in depth perception.

*only*a disparity stimulus to the observer regardless of retinal location, and the relative contribution of disparity to depth perception decreases with eccentricity (Held et al., 2012). The observed percept of compressed motion in the periphery is in line with the idea of a lower weighted contribution of disparity cues (Held et al., 2012), while changes in defocus blur of the point stimulus were likely negligible. As well, the tendency of disparity-tuned neurons to disproportionately prefer disparities <1° (DeAngelis & Uka, 2003) is another clue that motion-in-depth at the fovea is represented differently in the visual system than motion-in-depth in the periphery, where disparity magnitudes are much larger (Blohm et al., 2008). In agreement with this idea, psychophysical findings reveal that such a lateral compression could be captured using Bayesian probabilities for visual target motion (Rokers, Fulvio, Pillow, & Cooper, 2018; Welchman, Lam, & Bulthoff, 2008) and due to greater relative uncertainty in the estimate of the depth motion component for motion in the periphery (Fulvio et al., 2015; Rokers et al., 2018). Determining whether motion-in-depth perception is based on such a statistically optimal combination of disparity, retinal defocus blur and extraretinal cues therefore represents a potential extension of this work.

*static*eye orientation signals in interpreting a

*dynamic, moving*stimulus, although in natural viewing our eyes and head are often moving as well. Both disparity and eye movements contribute to depth perception but the precise nature of these contributions, and how they might depend on one another, is unclear. For example, vergence angle corresponds to perceived depth during the kinetic depth effect (Ringach, Hawken, & Shapley, 1996), but artificially inducing disparity changes between correlated (and anticorrelated) random-dot stimuli can cause the eyes to rapidly converge (or diverge) without any perception of depth (Masson, Bussettini, & Miles, 1997). On the neural level, disparity is coded in V1 without a necessary perception of depth (Cumming & Parker, 1997). Psychophysics work has shown that vergence eye movements are beneficial, though not sufficient, for judging the relative depth (Foley & Richards, 1972) and depth motion of stimuli (Harris, 2006; Welchman et al., 2009), but to our knowledge, before this report, no one had quantified the 3D geometric extent to which these signals are used to form a continuous perception of motion in depth.

*Frontiers in Computational Neuroscience*, 9 (July), 1–9, https://doi.org/10.3389/fncom.2015.00082.

*Vision Research*, 39 (6), 1143–1170, https://doi.org/10.1016/S0042-6989(98)00139-4.

*Vision Research*, 38 (2), 187–194, https://doi.org/10.1016/S0042-6989(97)00179-X.

*Science*, 285 (5425), 257–260. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/10398603

*Journal of Vision*, 7 (5): 4, 1–22, https://doi.org/10.1167/7.5.4. [PubMed] [Article]

*Journal of Vision*, 8 (16): 3, 1–23, https://doi.org/10.1167/8.16.3. [PubMed] [Article]

*Journal of Neurophysiology*, 104 (4), 2103–2115, https://doi.org/10.1152/jn.00728.2009.Smooth.

*Journal of Neuroscience*, 35 (46), 15430–15441, https://doi.org/10.1523/jneurosci.3189-15.2015.

*Frontiers in Human Neuroscience*, 4 (December), 1–15, https://doi.org/10.3389/fnhum.2010.00221.

*Journal of Vision*, 11 (9): 3, 1–9, https://doi.org/10.1167/11.9.3. [PubMed] [Article]

*Neuron*, 64 (5), 744–755, https://doi.org/10.1016/j.neuron.2009.11.005.

*Nature*, 389 (6648), 280–283, https://doi.org/10.1038/38487.

*Journal of Neurophysiology*, 89 (2), 1094–1111, https://doi.org/10.1152/jn.00717.2002.

*Stereoscopic displays and virtual reality systems XI*(Vol. 5291, pp. 36–46). Bellingham, WA: SPIE. Retrieved from https://doi.org/10.1117/12.529999

*Experimental Brain Research*, 203 (3), 485–498, https://doi.org/10.1007/s00221-010-2251-1.

*Perception & Psychophysics*, 11 (6), 423–427, https://doi.org/10.3758/BF03206284.

*Scientific Reports*,

*7*(1): 16009, https://doi.org/10.1038/s41598-017-16161-3.

*Attention, Perception, and Psychophysics*, 77 (5), 1685–1696, https://doi.org/10.3758/s13414-015-0881-x.

*Philosophical Transactions of the Royal Society B: Biological Sciences*, 371 (20150253), 1–15, https://doi.org/10.1098/rstb.2015.0253.

*Journal of Vision*, 6 (8): 2, 777–790, https://doi.org/10.1167/6.8.2. [PubMed] [Article]

*Current Biology*, 22 (5), 426–431, https://doi.org/10.1016/j.cub.2012.01.033.

*The Journal of Neuroscience*, 18 (4), 1583–1594.

*Journal of Neurophysiology*, 110 (8), 1945–1957, https://doi.org/10.1152/jn.00130.2013.

*Frontiers in Behavioral Neuroscience*, 7 (February): 7, https://doi.org/10.3389/fnbeh.2013.00007.

*Nature*, 389, 283–286.

*Nature Neuroscience*, 12 (8), 1056–1061, https://doi.org/10.1038/nn.2357.Sensory.

*Journal of Vision*, 3 (9): 125, https://doi.org/10.1167/3.9.125. [Abstract]

*Perception*, 28 (2), 167–181, https://doi.org/10.1068/p2737.

*Ergonomics*, 43 (3), 391–404, https://doi.org/10.1080/001401300184486.

*Experimental Brain Research*, 133 (3), 407–413, https://doi.org/10.1007/s002210000410.

*Journal of Neurophysiology*, 110, 732–747, https://doi.org/10.1152/jn.00991.2012.

*Perception*, 39 (6), 727–744, https://doi.org/10.1068/p6284.

*Frontiers in Psychology*, 1 (October), 1–8, https://doi.org/10.3389/fpsyg.2010.00155.

*Current Biology: CB*, 18 (18), R845–R850, https://doi.org/10.1016/j.cub.2008.07.006.

*Vision Research*, 36 (10), 1479–1492, https://doi.org/10.1016/0042-6989(95)00285-5.

*Journal of Vision*, 18 (3): 23, 1–23, https://doi.org/10.1167/18.3.23. [PubMed] [Article]

*Journal of Neurophysiology*, 97, 4203–4214, https://doi.org/10.1152/jn.00160.2007.

*Journal of Neuroscience*, 23 (18), 6982–6992.

*Vision Research*, 49 (7), 782–789, https://doi.org/10.1016/j.visres.2009.02.014.

*Proceedings of the National Academy of Sciences*,

*USA*, 105 (33), 12087–12092, https://doi.org/10.1073/pnas.0804378105.

*Journal of Vision*, 16 (6): 17, 1–25, https://doi.org/10.1167/16.6.17. [PubMed] [Article]

*Vision Research*, 51 (20), 2186–2197, https://doi.org/10.1016/j.visres.2011.08.014.