Estimating the velocity at which external objects move is essential for survival in most animal species, as it underlies vital abilities such as catching preys or avoiding collisions. A significant number of studies have been devoted to understand how the visual system processes speed information, revealing several important properties of these computations (e.g., Perrone & Thiele,
2001). However, current knowledge is mostly circumscribed to how the brain extracts velocity information from single sensory modalities, mostly vision, largely ignoring the potential contribution of other sensory channels carrying speed information about distal stimuli. Whether, and eventually how, the different sensory sources of velocity information influence each other during the perception of speed is therefore a question that has yet to be systematically addressed (but see Manabe & Riquimaroux,
2000).
In everyday life situations, the diversity of sensory signals originating from a single object are usually highly correlated (e.g., an approaching car produces an expanding image in the retina as well as a raising sound at the ear). Generally, these intersensory correlations are exploited by the brain to create robust representations of the environment, especially under impoverished input conditions (e.g., Ernst & Banks,
2002; Stein & Meredith,
1993). Motion is no exception, and several studies have already revealed strong mutual influences between vision and audition in the perception of motion direction. On the one hand, detection performance can improve when auditory and visual motion signals are available together. For example, Wuerger, Hofbauer, and Meyer (
2003) reported an improvement of motion detection for bimodal stimuli that could be predicted by a probability summation model and interpreted this result as the visual and the auditory signals being integrated at a decision stage. Alais and Burr (
2004) also provided evidence for a decreased motion detection threshold when sound and vision were shown together that was consistent with both probabilistic summation and maximum likelihood. These two studies therefore support the general view that estimating a physical attribute (e.g., the speed of an audiovisual object) improves when more than one cue to that attribute is available. Although there is some debate about the processing level at which audiovisual interactions occur (see Soto-Faraco, Kingstone, & Spence,
2003), the view of an audiovisual integration mechanism operating after the stimuli have been processed in unimodal pathways has been favored by some authors (Alais & Burr,
2004; Burr & Alais,
2006) against earlier interactions such as those predicted by linear summation models.
Another successful approach to study multisensory influences on motion perception (and other attributes) has been intersensory conflict, addressing whether information incongruence in one sensory modality can exert an influence on the perception of motion in another modality. For example, vision can strongly affect the perception of auditory direction (e.g., Mays & Schirillo,
2005; Soto-Faraco, Spence, & Kingstone,
2004; for a review, see Soto-Faraco et al.,
2003). Likewise, suprathreshold auditory motion can bias the perception of near-threshold visual motion direction in a way that is consistent with the direction of the auditory motion (Meyer & Wuerger,
2001). Here, we precisely focus on this approach, namely, studying the effects of one modality (vision) on the other (audition), to address the nature and level of processing regarding multisensory integration of motion speed
. It is worth noting that this approach deviates from the question of how estimating the speed of an audiovisual object improves by combining multisensory cues and focuses on gaining a better understanding of where and how auditory and visual information interact to compute speed.
The rationale of the present experiment relies on how visual motion can be decomposed. In particular, visual research has successfully addressed velocity processing by decomposing it into spatial frequency (SF) and temporal frequency (TF) (Watson & Ahumada,
1983), two frequency domains that can be conveniently separated using sinusoidal moving gratings. The velocity (
v) of a grating is given by the ratio (TF/SF) between its TF (in Hz) and its SF (number of cycles per degree of visual angle). This spatiotemporal definition of stimulus space has been used to characterize the spectral receptive fields of neurons at various levels of the visual system in the monkey (e.g., Perrone & Thiele,
2001). For instance, many neurons in the middle temporal cortex (MT) encode velocity, as their preferred response stimuli lay on an elongated area oriented along an isovelocity line in the space defined by SF and TF. That is, these neurons are tuned to a given velocity and not to a particular value of TF or SF. Psychophysical evidence for a velocity-tuned mechanism has also been reported in humans (Reisbeck & Gegenfurtner,
1999). Unlike MT neurons, motion-sensitive neurons found in earlier stages of the visual system such as V1 do not show an invariant response across stimuli moving at the same velocity, but rather a TF response profile. This is regarded as evidence that V1 neurons are not tuned to speed but to local temporal frequencies.
Several plausible models have been put forward to explain how MT neurons integrate information from motion-related activity at lower levels of the visual system (such as V1) to compute speed (e.g., Perrone & Thiele,
2002; Priebe, Lisberger, & Movshon,
2006). However, the mechanisms underlying the combination of speed cues regarding a multisensory object remain largely unknown. Are speed signals conveyed by different sensory systems actually integrated and, if so, does this multisensory integration of speed depend on the local spatial structure of the moving stimuli? An affirmative answer could be regarded as evidence for very early audiovisual interactions, occurring before the velocity-tuned mechanisms in MT have combined signals motion detectors sensitive to the stimulus spatial structure (presumably separable mechanisms in V1). On the contrary, if the integration of audiovisual speed information depends on visual velocity independently of particular values of SF and TF, then one can conclude that these interactions take place later on, only after MT computations have been carried out.