Abstract
What we see and hear carry different physical properties, but we are able to integrate the distinct information to form a coherent percept. The cross-modal integration is observed at many brain regions including primary and non-primary sensory areas as well as high-level cortical areas. Most previous studies on audiovisual integration used flash/tones or image/sound pairs, which are easy to manipulate the experimental conditions but lack ecological relevance. Under more natural scenarios when audiovisual events are perceived, however, where and when different levels of information are processed and integrated across brain areas and over time remain less investigated. To address that, we selected sixty 1-second naturalistic videos with representative visuals and sounds of three categories - animals, objects, and scenes. We recorded both functional magnetic resonance imaging (fMRI) and electroencephalography (EEG) data when participants (N=19) viewed videos and listened to the accompanying sounds while doing an orthogonal oddball detection task. With multivariate pattern analysis and representational similarity approach, we found that the visual and acoustic features were processed almost simultaneously, with the onset at ~60 ms and the first peak at ~100 ms. The acoustic information was represented not only in auditory areas, but also in visual areas including the primary visual cortex and high-level visual regions, demonstrating the early cross-modal interactions. However, the visual features were only represented in visual cortices, suggesting asymmetrical neural representations of modality information during multisensory perception. The high-level categorical and semantic information emerged later in time with the onset at ~ 120 ms and the peak at ~210 ms and was observed at high-order visual and association areas as well as the parietal and frontal cortex. By fusing the representations from fMRI and EEG, we also resolved the neural processing during audiovisual perception at each voxel and at each millisecond.