Abstract
Objects and events in the environment typically produce correlated input to several sensory modalities at once. It is important to understand the conditions under which the different sensory streams are integrated and the supporting mechanism. We ask whether there is crossmodal binding of non-speech auditory and visual stimuli and how and where it is realized physiologically. Do the pitch of a sound and the location of a visual object have some crossmodal correspondence that might provide a basis for their integration (as suggested by the shared use of verbal labels like “high” and “low”). In two studies, participants made speeded discrimination responses to one modality of a bimodal audiovisual stimulus, with congruent, incongruent, or neutral pairing of features. RTs were significantly faster to either the location of the object or the pitch of the sound when the bimodal stimuli were congruent (a high pitch sound with an object in the upper location) than when they were incongruent (a low pitch sound with an object in the upper location). The second study asked if the enhancement was due to perceptual integration, by using a discrimination that was orthogonal to the congruent or incongruent features, and so could not be enhanced by shared response activation. We found faster RT's for congruent stimuli even when the task was to discriminate the object shape or the instrument sound. The advantage of the congruent over the incongruent condition in this orthogonal task must reflect a crossmodal perceptual process rather than crossmodally induced shifts in response criteria. Since the synchrony and spatial proximity were matched across conditions, the bimodal interaction was based primarily on the featural or content correspondence. We conclude that the pitch of a sound and the spatial location of a visual object have a natural correspondence or mapping and may be automatically integrated. An fMRI study explored the neural basis of this crossmodal interaction.