Abstract
Events in the world are inherently multimodal. A ball bouncing provides correlated auditory and visual information to the senses. How are such events neurally represented? One possibility is that these distinct sources are integrated into a coherent percept of the event. Alternatively, auditory and visual information may be separably represented, but linked via semantic knowledge or their correlated temporal structure. We investigated this using event-related fMRI. Participants viewed and/or heard 2.5s environmental events, for example, paper ripping or door knocking, in two unimodal and three multimodal conditions:
Auditory only (ripping sound) Visual only (movie of paper ripping) Congruent Auditory/Visual (sound + movie of same instance) Semantically Incongruent A/V (ripping sound + movie of knocking) Temporally Incongruent A/V (ripping sound + movie of different ripping instance)
Of interest is the encoding of Congruent and Semantically Incongruent A/V events. The integrated proposal predicts sensory brain regions showing a differential response to semantic incongruencies, while under the separate representation account, there should be no difference. Critically, this multimodal response must be stronger than the responses for unimodal stimuli. We also consider whether integration processes function at the level of semantic congruity or at a fine-grained temporal level that binds sound to vision. The Congruent and Temporally Incongruent comparison addresses whether integrated multimodal responses arise due to A/V information from within the same semantic category. Alternatively, a single event gives rise to a high correlation between onsets, offsets, and temporal structures between domains. Such correlated information may be the “glue” that allows the brain to combine perceptually-distinct information into coherent representations of events. Preliminary results provide support for the perceptual integration of auditory and visual information originating from a common source, that is, one in which there is a correlation between the temporal structure across modalities.