Events in the real world constitute an overwhelming source of sensory signals, thus the ability to flexibly integrate or combine different sources of information plays a fundamental role in our perception. The integration of acoustic and visual information is one of the most important issues in the cross-modal studies of perception and attention (Burr & Alais,
2006; Driver & Spence,
2004; Ernst & Bulthoff,
2004; Vroomen & de Gelder,
2000).
Cross-modal stimulation affects performance in visual detection and spatial discrimination (Driver & Spence,
2004; McDonald, Teder-Salejarvi, & Hillyard,
2000) as well as in covert attention tasks (Driver & Spence,
1998; McDonald & Ward,
2000) and may generate misrepresentations of some visual stimuli features, leading to perceptual illusion (Alais & Burr,
2003; McGurk & MacDonald,
1976; Shams, Kamitani, & Shimojo,
2000,
2002).
Moreover it has been shown that a sound presented in synchrony with a visual stimulus in tasks requiring the detection of visual targets enhances the sensitivity to specific visual features like contrast (Lippert, Logothetis, & Kayser,
2007), intensity (Stein, London, Wilkonson, & Price,
1996), and pattern configuration (Vroomen & de Gelder,
2000).
Some recent contributions have focused on the level at which the audio–visual interaction occurs (Mishra, Martinez, Sejnowski, & Hillyard,
2007; Shams et al.,
2002; Wallace, Carriere, Perrault, Vaughan, & Stein,
2006). In particular, Lippert et al. (
2007), using vertical Gabor gratings at variable contrast, compared the effect of a synchronous sound presented alone (“sound informative” condition) or combined with a visual cue (a gray frame surrounding the target, “sound uninformative” condition) in a contrast detection task. They found that the cross-modal facilitation of visual contrast detection disappeared when the sound was redundant with the visual display. The authors interpreted this finding in terms of a cognitive, high level interaction, ruling out the possibility of a low level interaction as suggested by previous studies (Marks, Ben-Artzi, & Lakatos,
2003; Odgaard, Arieh, & Marks,
2003).
On the other hand, Mishra et al. (
2007) and Shams et al. (
2002) interpreted the robustness of the “sound induced flash illusion”, consisting in the perception of multiple flashes when a single flash is presented with multiple beeps, as an evidence of the action of a mainstream circuitry, providing an interpretation in favor of a low-level neural integration of audio–visual signals, which is supported by the multisensory activation observed in both visual and auditory primary cortices (Kayser & Logothetis,
2007; Martuzzi et al.,
2007) and by the presence of multisensory neurons in the superior colliculus (Stein, Meredith, & Wallace,
1993; Stein, Stanford, Ramachandran, Perrault, & Rowland,
2009).
The aim of the present study is to probe the mechanisms of acoustic facilitation of visual detection through the use of the Classification Images technique (Ahumada,
2002; Ahumada & Lovell,
1971), which is also referred to as Psychophysical Reverse Correlation. This method is based on the analysis of the visual noise characteristics leading to specific observers' responses and has been very useful in revealing the characteristics of the perceptual templates exploited by an observer in visual tasks such as Vernier acuity (Beard & Ahumada,
1999), disparity discrimination (Neri, Parker, & Blakemore,
1999), illusory-contour perception (Gold, Murray, Bennett, & Sekuler,
2000), orientation discrimination (Solomon,
2002) as well as spatially cued detection (Eckstein, Shimozaki, & Abbey,
2002). In particular, a recent spatio-temporal version of this technique (Neri & Heeger,
2002) has provided a powerful tool to probe behaviorally the characteristics of the mechanisms involved in visual detection and discrimination across both space and time. Using spatio-temporally modulated white noise and classification images analysis of both the 1st and 2nd order kernels, constituted by the mean and variance template respectively, the authors were able to dissociate two processing stages: an early ‘detection’ stage, in which initial and strong noise variations are able to engage automatic and exogenous mechanisms of attentional capture, and a later ‘identification’ stage that follows detection by about 100 ms and is characterized by the use of image intensities to identify the luminance polarity of the signal (a bright or dark bar).
In the present study, we use this paradigm during a visual detection task in a Unimodal (only-visual) and a Bimodal (audio–visual) condition, in order to investigate the nature of the interaction between the auditory and the visual system in response to cross-modal stimulation. We hypothesized that, if the facilitation of detection induced by a sound synchronous to the signal depends on the same low-level mechanisms of visual detection per se, then the improvement should be reflected on the pattern of activation of the 2nd order kernels. The results confirmed our predictions showing that the effect of the sound is reflected by the non-linear stage probed by the noise variance, providing novel insights to explain how a sound interacts with a visual stimulus to make it more detectable.