Pulfrich phenomena are a class of depth illusions generated by an interocular time delay. This may be demonstrated with continuously moving stimuli, stroboscopic displays undergoing apparent motion, or dynamic noise patterns. Previous studies suggest that neurons jointly tuned to motion and disparity may be responsible for the phenomena. Model cells with such joint coding can explain all Pulfrich phenomena in a unified way (N. Qian & R. A. Andersen, 1997). However, the joint-coding idea has been challenged by recent models (J. C. Read & B. G. Cumming, 2005a, 2005c) that focus on the S shaped functions of perceived disparity in stroboscopic Pulfrich effect (M. J. Morgan, 1979). Here we demonstrate fundamental problems with the recent models in terms of causality, physiological plausibility, and definitions for joint and separate coding, and we compare the two coding schemes under physiologically plausible assumptions. We show that joint coding of disparity and either unidirectional or bidirectional motion selectivity can account for the S curves, but unidirectional selectivity is required to explain direction-depth contingency in Pulfrich effects. In contrast, separate coding can explain neither the S curves nor the direction-depth contingency. We conclude that Pulfrich phenomena are logically accounted for by joint encoding of unidirectional-motion and disparity.

*T*) and interflash distance (

*X*) in proportion to keep the speed constant. For each

*T*, we varied the interocular time delay from 0 to

*T*.

*x*) and the temporal dimension (

*t*) in all our simulations. The

*x*dimension of the left and right receptive fields of a binocular simple cell was modeled as Gabor functions (Ohzawa et al., 1990):

*σ*determines the receptive field size,

*ω*

_{x}is the preferred horizontal spatial frequency, and

*φ*

_{l}and

*φ*

_{r}are the phase parameters that determine the shifts of the ON/OFF subregions within the Gaussian envelope. For our simulations, we let

*σ*= 0.32° (16 pixels),

*ω*

_{x}/2

*π*= 0.031 cycles/deg (32 pixels/cycle), and we uniformly sampled 8 values of (

*φ*

_{l}−

*φ*

_{r}) in [−

*π, π*) to cover the full range of preferred disparities under a given

*ω*

_{x}(Qian, 1994). The values for the

*φ*

_{l}and

*φ*

_{r}pairs were the same as those in Qian (1994).

*τ*is the time constant for the gamma envelope and

*α*determines its degree of skewness. The cosine term with frequency

*ω*

_{t}generates multiphasic temporal kernels and the phase

*φ*

_{t}can be adjusted to allow the first and second half cycles of the kernels to have different durations as illustrated in Figure 1. The kernels are zero for negative

*t*to ensure causality. With appropriate choice of parameters, this function can closely mimic the multiphase temporal kernels of real visual cells. The green curve in Figure 1 was obtained with

*α*= 2.5,

*τ*= 22.5 ms (4.5 pixels),

*ω*

_{t}/2

*π*= 8.3 cycles/sec (24 pixels/cycle), and

*φ*

_{t}= −0.2

*π*. The solid red curve in the figure was obtained with the same parameters but the cosine was replaced by sine in the equation. This is very close to the Hilbert transform (dashed red curve) of the green curve so that the green and solid red curves together with the spatial Gabor filters are well suited for constructing spatiotemporally oriented (directionally selective) filters in Equations 4 and 5 below (Adelson & Bergen, 1985; Watson & Ahumada, 1985). The gamma envelope is shown as the blue dotted curve. To generate a range of speed preferences for our simulations, we scaled the green and solid red curves (without changing their shapes) by dividing

*τ*and multiplying

*ω*

_{t}by a factor taken from the list (0.67, 1.0, 1.5, 2.25): these numbers form a geometric sequence with a common ratio of 1.5.

*g*and

*h*functions by replacing all the cosine terms by the sine terms. The weighting factor

*η*determines directional selectivity of the cells, with

*η*= 0 for spatiotemporally separable receptive fields (bidirectional) and

*η*= ±1 for spatiotemporally oriented receptive fields (unidirectional). For ∣

*η*∣ between 0 and 1, intermediate levels of directional selectivity can be created. For simulations in this paper, we use

*η*= 0, ±1 to consider both unidirectional and bidirectional tuning. For each

*η,*we generate four different speed preferences by scaling the same temporal kernels by the four factors noted above.

*η*= ±1), we also have 8 motion preferences (4 speed preferences for each of the two opposite directions). Therefore, at each stimulus location there are a total of 8 × 8 = 64 complex cells jointly tuned to disparity and motion. For the bidirectional case (

*η*= 0), the two opposite directions are combined yielding a total of 32 complex cells at each location. The responses of a complex cell are obtained from a quadrature pair of simple cells according to the standard disparity energy model (Ohzawa et al., 1990). The simple cell responses are obtained through spatial correlation and temporal convolution between the cell's binocular spatiotemporal receptive fields and the stimulus (Qian & Andersen, 1997).

*σ*= 40 ms,

*ω*/2

*π*= 6.3 cycles/sec, and

*φ*= 0 to generate a temporal kernel and its sine counterpart. We then scale these kernels by the same four factors as above to create four different speed preferences for each direction of motion. For each

*σ,*we let the total kernel duration be 5

*σ*. We then let the leftmost point of the kernels represent

*t*= 0 so that the kernels are zero for negative

*t*to ensure causality.

*X*and

*T*are the spatial and temporal step sizes of the apparent motion, Δ

*t*is the interocular time delay,

*jX*is the disparity between the middle blue dot and the

*j*th red dot, and

*w*(

*jT*+ Δ

*t*) is the weighting factor as a Gaussian function of the temporal separation between the two dots in the match.

*t*, then the dots presented after time

*t*have not yet appeared, and the disparities of the matches involving any future dots should not be included in the summation to estimate the currently perceived disparity.

*τ*for determining the Gaussian weighting factors in Equation 6, then contributions of the future matches are negligible. Thus, summation to the infinite future in the conceptual model ( Equation 6) can be replaced by summation up to the current time, and all past dots' positions and times can be assumed to be stored in memory and are available for disparity calculations. Unfortunately, when this solution to the causality problem is applied, it immediately creates a new problem. The model only considers matches involving a special dot (the middle blue dot in Figure 4A). Other dots seen by the same eye as the special dot (other blue dots in Figure 4A) are ignored. The new problem occurs when the current time is not aligned with the special dot: The special dot is just one of many that appeared in the past and there should not be anything special about it. Therefore, one has to include all the previously ignored matches ( Figure 4B).

_{LR}) but their separations from the current time (S

_{CT}) are very different. It certainly does not make sense to weight them equally regardless of how large S

_{CT}is. The weighting factor in Equation 6 of the conceptual model proposal (Read & Cumming, 2005c) is only a function of S

_{LR}. The effect of S

_{CT}is not considered.

*t*. If the two eyes' patterns have the same contrast polarity, one does not see depth for Δ

*t*> 50 ms. Interestingly, for larger values of Δ

*t,*depth perception can be restored when the two eyes' patterns have opposite polarities (Cogan et al., 1993). This observation may be explained computationally (Grunewald & Grossberg, 1998) by use of typical

*biphasic*temporal kernels of visual cells (DeAngelis, Ohzawa, & Freeman, 1993a, 1993b; Hawken, Shapley, & Grosof, 1996). A flash of light in a V1 cell's ON region, for example, generates an initial excitatory response, followed by a longer inhibitory response (cf. the green and solid red curves in Figure 1). The full temporal response lasts for about 100 to 200 ms. When the two retinal images have the same contrast polarity, Δ

*t*has to be less than 50 ms to allow an overlap between the same-signed responses evoked through the two eyes and thus enable stereo matching. If Δ

*t*is greater than 50 ms, there is only an interocular overlap between the opposite-signed responses and stereovision fails. When the two retinal images have opposite polarities and Δ

*t*is greater than 50 ms, the overlapping responses evoked through the two eyes have the same sign again, and stereovision is restored.

*t*) should drop to zero at about Δ

*t*= 50 ms and then reverse its sign. Obviously, the model will not work with such a dramatic change of the weighting function (Read & Cumming, 2005a). Even if we assume that the weighting function stays at zero for Δ

*t*> 50 ms, the model will still not work. For example, if the sum of the interflash interval

*T*and the interocular delay Δ

*t*exceeds 50 ms, the match indicated by the green arrow in Figure 2 will have zero weight.

*D*and Δ

*t*, respectively. Since there is only one match between the eyes, Equation 6 reduces to:

*t*and

*D*, the predicted disparity is always

*D*. This prediction is clearly incorrect because as we mentioned above, the perceived disparity should decrease to zero when Δ

*t*increases to 50 ms. Note that the conceptual model critically depends on this prediction. If the model is revised to eliminate this incorrect prediction, it will no longer explain the perceived disparity in stroboscopic Pulfrich effect.

*identical*motion tuning (we thank Dr. Read for clarifying this implementation detail). The problem is that cells with identical motion tuning cannot encode motion just like cells all with identical orientation tuning curves cannot encode orientation and a visual system with only red (L) cones is color blind. To encode a stimulus property, cells preferring a

*range*of that property are required. Thus, the recent “joint” motion-disparity coding model (Read & Cumming, 2005a) cannot encode motion and does not qualify as a joint coding model. Since both the “joint” and separate coding models in the recent study (Read & Cumming, 2005a) can only code disparity but not motion, they are really different versions of separate coding according to our definition (see Methods).

*separate*motion and disparity cells is sufficient to account for the dynamic-noise Pulfrich effect (Read & Cumming, 2005a). But how is this correlation computed? In the model (Read & Cumming, 2005a), this correlation is computed artificially and not by the separate motion and disparity cells. If the brain computes this correlation, then the cells involved must link motion and disparity responses in some way (Spang & Morgan, 2008). Although, a priori, the linkage could take different forms, the presence of jointly tuned cells in V1 and MT makes it unnecessary to propose other mechanisms.

*d*excites photoreceptors at

*x*on the left retina and at

*x*+

*d*on the right retina, there is correlation between the responses of these two sets of photoreceptors. The logical flaw here is that although one can compute these correlations artificially, the photoreceptors cannot. All the information the visual system receives, including various correlations across space, time, and eyes, is already present at the retina. The presence of a correlation provides an opportunity for encoding but that is not the same as actual encoding.

*by chance*when there are more spikes. For this reason, studies on synchronization remove spurious synchronization via shuffle correction or randomization between data and experimental conditions (Castelo-Branco, Neuenschwander, & Singer, 1998; Fries, Womelsdorf, Oostenveld, & Desimone, 2008). If spurious synchronization were allowed as a coding mechanism, one would end up with a logical fallacy similar to the one discussed above, namely that all vision problems are solved by retinal ganglion cells.

*T*(see Equation 12 of Read and Cumming, 2005a), or equivalently an integer multiple of

*T*. A key step of the proof, Equation A3 of Read and Cumming (2005a), fails without this assumption. However, the assumption is non-physiological because it implies that neural integration time has to equal to arbitrary interflash intervals of stroboscopic Pulfrich stimuli. If the assumption is replaced by a generally applicable and physiologically plausible one, such as integration over a 200 ms window of V1 responses (as we did in our simulations; see Methods), the proof fails and it is not clear whether the simulations in that study (Read & Cumming, 2005a) can still work. A related ambiguity with the study (Read & Cumming, 2005a) is that although it claims to use the disparity energy model (Ohzawa et al., 1990), there is no indication of using the standard quadrature pair construction, or its equivalent (Qian & Mikaelian, 2000), in either the proof or the simulations.

*π*range. The range of motion preference is obtained by scaling the same temporal kernels by four different factors so that there are four different preferred temporal frequencies and thus four different preferred speed ranges. In addition, we include preferences to opposite directions of motion so that there are 8 motion preferences (2 directions each with 4 speed ranges). Overall, there are a total of 8 × 8 = 64 complex cells at each location to jointly code disparity and motion. The responses of the underlying simple cells at all locations as a function of time were obtained via spatial correlation and temporal convolution (Qian & Andersen, 1997), and the responses of the complex cells were computed via the energy method (Ohzawa et al., 1990). Cells with different temporal kernels have different response time courses. To determine the equivalent disparity at each spatial location, we first integrate temporal responses of each complex cell at that location over the past 200 ms. For disparity estimation, we also pool responses across different motion preferences. We then locate the peak along the disparity dimension and used it to represent the perceived equivalent disparity in exactly the same way as for previous models (Qian, 1994; Qian & Andersen, 1997). As the stimulus is flashed at successive locations, the equivalent disparity quickly (<200 ms) builds up to a steady state value which is used in our plots.

*d*) as a fraction of the interflash distance (

*X*) is plotted against interocular time delay (Δ

*t*) as a fraction of the interflash time interval (

*T*) for several

*T*'s. To understand what is presented here, first note that for continuously moving stimuli,

*d*= (

*X*/

*T*) Δ

*t*, or

*d*/

*X*= Δ

*t*/

*T*; this corresponds to the diagonal line in the figure. When

*T*is 30 ms (or smaller), the results follow the diagonal line indicating that the equivalent disparities are equal to the values for a stimulus moving continuously. When

*T*is larger than 30 ms, the model reproduces the S curves (Morgan, 1979).

*T*is relatively large and Δ

*t*is between 0 and

*T*/2, the equivalent disparity is smaller than expected from the continuous motion case. To understand the reason that the joint-coding model can explain this finding, first consider the case when

*T*is much smaller than cells' temporal response durations so that the apparent motion of the stimulus is almost as strong as the continuous motion case. Here, the cells whose motion preference matches the stimulus motion will respond far more vigorously than other cells. In this case, pooling across different motion preferences can be well approximated by use only of cells tuned to the stimulus motion. This simplification is used in the original Pulfrich model (Qian & Andersen, 1997). The computed equivalent disparity is equal to that of the continuous motion case. When

*T*is comparable to the cells' temporal response durations, the apparent motion is very weak and the Pulfrich effect will disappear. In this case, cells with difference motion preferences will respond equally, and since different motion preferences predict different equivalent disparities (including different signs), the pooled response predicts a near zero equivalent disparity (Qian & Andersen, 1997). For intermediate

*T*values, the equivalent disparity is not zero but smaller than that for the continuous motion case, as shown in Figure 5.

*t*is larger than

*T*/2, the S curves of Figure 5 are above the diagonal line. This is because the

*n*th dot in the delayed eye image is temporally closer to the (

*n*+ 1)st dot than to the

*n*th dot in the other eye so that the retinal disparity of this match is equal to the interflash distance instead of 0.

*η*= 0 in Equations 4 and 5) and the 8 motion preferences used above are collapsed into 4 motion preferences because each cell treats the two opposite directions equally.

*t*and its equivalent disparity

*d*(Equation 10 in Qian & Andersen, 1997)

*v*= −

*ω*

_{t}/

*ω*

_{x}, whereas a corresponding bidirectional cell is tuned to two opposite velocities (±

*v*). When the Pulfrich stimulus contains a strong motion signal, then either the

*v*or the −

*v*component of the bidirectional cell will be strongly activated and an interocular time delay Δ

*t*will be treated as an equivalent disparity of

*v*Δ

*t*or −

*v*Δ

*t*. As the motion signal gets weaker, the dominance of one component over the other in the bidirectional cell will become weaker and the equivalent disparity will be between

*v*Δ

*t*or −

*v*Δ

*t*.

*bidirectional*motion can also produce the S curves, we emphasize that Pulfrich phenomena are best explained by joint coding of disparity and

*unidirectional*motion. The reason is that for Pulfrich phenomena, the perceived depth and direction of motion are contingent (e.g., when the left eye's view is delayed, dots moving to the left and right have near and far disparity, respectively). Bidirectional cells cannot determine the direction of motion and thus cannot explain the direction-depth contingency while unidirectional cells can (Qian & Andersen, 1997).

*unidirectional*motion. Joint coding between disparity and bidirectional motion in monkey V1 can only extract stimulus disparity and speed but not direction.

*unidirectional*motion is also needed to explain the dynamic noise Pulfrich effect (Qian & Andersen, 1997). Here the percept is a volume revolving in depth. When the left eye's view is delayed with respect to that of the right eye, the near and far halves of the volume rotate to the left and right, respectively. When the right eye's view is delayed, the directions of motion reverse. Bidirectional motion preferences cannot distinguish between the two opposite directions of motion and thus cannot fully explain the perception. An assertion in the recent computational model (Read & Cumming, 2005a) is that dynamic noise Pulfrich effects can be explained via a correlation between pure disparity and pure motion responses. However, this correlation has to be computed by joint motion-disparity cells as was done in a previous Pulfrich model (Qian & Andersen, 1997). Otherwise, such correlation is simply a reflection of the same correlation in the stimulus. It does not represent coding in the same sense that correlation among photoreceptors in response to a bar does not represent orientation coding. The joint coding of disparity and unidirectional motion by MT cells and a small fraction of V1 cells naturally represents the correlation between disparity and motion in dynamic noise Pulfrich stimuli and explains the perception (Qian & Andersen, 1997).

*large*interflash intervals in stroboscopic Pulfrich stimuli. Since under this condition, no motion preference dominates the responses, we pooled across different motion preferences before estimating disparity. We further found that either unidirectional or bidirectional motion selectivity can be used in the joint coding model to explain the S curves. This is not surprising because the S curves do not involve perception of motion direction. However, one may debate whether joint coding of bidirectional motion and disparity could also be viewed as a version of separate coding. We do not think so because the model covers a range of motion-speed preferences and a range of disparity preferences combinatorially, and thus can jointly encode speed and disparity. By reducing the range of speed preferences to a single speed preference, we produced truly separate coding models and showed that they cannot explain the S curves. In any case, the debate is not particularly interesting because only joint coding of

*unidirectional*motion and disparity can explain both the S curves and the direction-depth contingency in Pulfrich effects.