Our nervous system typically processes signals from multiple sensory modalities at any given moment and is therefore posed with two important problems: which of the signals are caused by a common event, and how to combine those signals. We investigated human perception in the presence of auditory, visual, and tactile stimulation in a numerosity judgment task. Observers were presented with stimuli in one, two, or three modalities simultaneously and were asked to report their percepts in each modality. The degree of congruency between the modalities varied across trials. For example, a single flash was paired in some trials with two beeps and two taps. Cross-modal illusions were observed in most conditions in which there was incongruence among the two or three stimuli, revealing robust interactions among the three modalities in all directions. The observers' bimodal and trimodal percepts were remarkably consistent with a Bayes-optimal strategy of combining the evidence in each modality with the prior probability of the events. These findings provide evidence that the combination of sensory information among three modalities follows optimal statistical inference for the entire spectrum of conditions.

*V*= 1,

*A*= 2,

*T*= 1] is shown in Figure 1.

*Z*

_{V},

*Z*

_{A},

*Z*

_{T}(denoting visual, auditory, and tactile stimuli, respectively). We assume that the sensory signals

*s*

_{V},

*s*

_{A},

*s*

_{T}are conditionally independent given the events

*Z*

_{V},

*Z*

_{A},

*Z*

_{T}. This common assumption is based on the fact that the signals of the different modalities are processed in separate pathways (up to the point where interactions occur) and are therefore corrupted by independent noise processes. A Bayesian observer would make the best possible guess about the causes given the sensory signals and prior probability of the causes. The best possible guess is achieved by optimal inference about posterior probabilities

*P*(

*Z*

_{V},

*Z*

_{A},

*Z*

_{T}∣

*s*

_{V},

*s*

_{A},

*s*

_{T}) using Bayes' rule, which given the assumptions of the model described above simplifies to

*P*(

*s*

_{V}∣

*Z*

_{V}),

*P*(

*s*

_{A}∣

*Z*

_{A}), and

*P*(

*s*

_{T}∣

*Z*

_{T}) were modeled as a normally distributed sensory signal centered about the true stimulus and corrupted by independent unbiased Gaussian noise with standard deviations

*σ*

_{V},

*σ*

_{A},

*σ*

_{T}, respectively. The joint prior probability

*P*(

*Z*

_{V},

*Z*

_{A},

*Z*

_{T}) was modeled as a multivariate normal distribution centered about its vector mean

*μ*

_{prior}with covariance matrix Σ

_{prior}:

*μ*

_{prior},

*σ*

_{prior}, cov

_{prior}.

*σ*

_{A},

*σ*

_{V}, and

*σ*

_{T}for auditory, visual, and tactile modalities, respectively. The posterior is calculated according to Equation 1, using a prior distribution characterized by the three free parameters,

*μ*

_{prior},

*σ*

_{prior}, and cov

_{prior}. The appropriate

*N*× 1 prior mean vector and

*N*×

*N*prior covariance matrix is used in calculating the posterior distribution, where

*N*is the number of stimulus modalities presented for that given simulated trial. This makes the assumption that subjects do not report any pulsations for an absent stimulus, i.e., no hallucinations or synesthesia. This assumption was indeed confirmed by the data. While we allowed reporting a non-zero number for an absent stimulus, subjects did so only in 2% of all possible unimodal or bimodal experimental trials, likely due to motor or memory errors. We assume that the observer tries to minimize the mean squared error (i.e., least squares loss function), and thus the optimal response would be the mean of the posterior distribution. Note that as we use normal distributions for the likelihood and priors, the posterior is also normally distributed, and taking the mean is equivalent to finding the maximum of the distribution. To produce a response based on this optimal estimate, for each modality, we then choose the response category (0, 1, or 2) nearest to the optimal estimate. These simulations result in a response distribution for each of the stimulus conditions. An optimization search is used to find the parameters of the prior distribution that minimize the mean squared error between the simulated responses and responses of human observers.

*R*

^{2}= 1 − SS

_{E}/SS

_{T}, as the measure of goodness of fit between the model and the data, as well as the Bayesian Information Criteria (Burnham & Anderson, 2002).

*V*= 1,

*A*= 0,

*T*= 0] and bimodal condition [

*V*= 1,

*A*= 2,

*T*= 0] may be either statistically insignificant or could correspond to a statistically significant modulation of visual perception by sound. Therefore, to find which of the changes between two conditions correspond to a statistically significant perceptual interaction, we performed the following analysis. We calculated

*d*′ discriminability index (Smith, 1982) for each of the modalities in each of the stimulus conditions and examined whether the change in sensitivity (

*d*′) between two conditions is statistically significant. Figure 4 shows the corrected

*t*-statistics for each unimodal condition given in the graph title compared with all conditions that differ in the other two modalities. Test statistics were calculated by comparing the

*d*′ across subjects for perception in each modality between the unisensory condition (e.g., visual percept in [

*V*= 1,

*A*= 0,

*T*= 0]) and corresponding bi-sensory (e.g., visual percept in [

*V*= 1,

*A*= 2,

*T*= 0]) or tri-sensory condition (e.g., visual percept in [

*V*= 1,

*A*= 2,

*T*= 2]) using a two-tailed paired

*t*-test (

*α*= 0.05, Bonferroni corrected for 48 tests). The

*p*-values are provided in gray scale with darker squares corresponding to lower

*p*-values. Statistically significant tests are highlighted in red squares, and all were found to be in the right tail, shifting away from the veridical percept for that given modality (positive

*t*-statistics). The first row of Figure 4 provides a statistical examination of illusory fission effects, in which the percept of a single pulse in one modality (e.g., a single flash or beep or tap) is changed into two pulses (two flashes, or two beeps, or two taps) when paired with two pulsations in one or both of the other modalities. The second row provides a statistical examination of illusory fusion effects, in which the percept of two pulses in one modality (2 flashes, or 2 beeps, or 2 taps) is changed into one when paired with one pulse in one or two of the other modalities.

*V*= 1,

*A*= 2,

*T*= 0], the sound-induced flash illusion (Shams et al., 2000) can be seen in the visual responses in which subjects report two flashes in a large fraction of trials (Figure 3 top plot in orange box) due to introduction of two beeps (

*p*< 0.001; Figure 4A). A visual fusion illusion is found in the [

*V*= 2,

*A*= 1,

*T*= 0] condition in which subjects report seeing one flash in a large fraction of trials as a result of pairing with 1 beep (Figure 3 bottom plot in orange box, and Figure 4D). Similar fission (

*p*< 0.001) and fusion (

*p*< 0.001) touch-induced visual illusions (Violentyev et al., 2005) are found in the [

*V*= 1,

*T*= 2,

*A*= 0] and [

*V*= 2,

*T*= 1,

*A*= 0] conditions (Figure 3 brown box, and Figures 4A and 4D). In addition to these previously reported illusions, we find weaker but statistically significant auditory and tactile illusions in some of the bimodal conditions. For example, in condition [

*V*= 0,

*A*= 1,

*T*= 2], a touch-induced double-beep illusion occurs (

*p*< 0.001, Figure 4B). Sound-induced and visually induced fission touch illusions also occur in conditions [

*V*= 2,

*A*= 0,

*T*= 1] (

*p*< 0.001, Figure 4C) and [

*V*= 0,

*A*= 2,

*T*= 1] (

*p*< 0.001, Figure 4C), respectively.

*p*-value associated with the change in

*d*′ from the original bimodal responses while showing the original bimodal pair along the vertical axis. Again, a paired two-tailed

*t*-test was performed for each comparison, and

*p*-values are provided in gray scale with darker squares corresponding to lower

*p*-values. It is interesting to note that some of the significant effects are a result of the third modality introducing an illusory effect, similar to those in Figure 4. These are shown by red squares. The statistically significant changes that fall within the left tail (negative

*t*-statistics) are shown in blue. The blue squares represent changes where the addition of a third modality resulted in a

*decrease*in the initial illusion, i.e., a shift toward the veridical percept for that given modality. For example, comparing visual and tactile responses in [

*V*= 2,

*T*= 1,

*A*= 0] ( Figure 3, top plot in brown box) vs. [

*V*= 2,

*T*= 1,

*A*= 2] ( Figure 3, green box), we find that the modulatory effect of single tap on vision (leading to the visual fusion effect) seen in the former condition is significantly reduced in the later due to the introduction of the double beeps while increasing the rate of tactile fusion effects. The red square in the second column of Figure 5F indicates that adding two beeps to the [

*V*= 2,

*T*= 1,

*A*= 0] condition (shown in the title) results in significant increase in illusory tactile percepts (first row), and the blue square shows significant decrease in visual illusion (second row).

*congruent*bimodal conditions (see Figures 5G– 5L) shows that the double pulses in the third modality consistently lead to fission effects in one or both of the congruent modalities ( Figures 5G– 5I), whereas the addition of a third single-pulse event to congruent bimodal events only leads to a tactile fusion effect ( Figure 5L).

*R*

^{2}= 0.95). However, this measure of goodness-of-fit may be somewhat inflated since over half of the conditions are either unimodal or congruent stimulus presentations where the responses are close to veridical. As mentioned above, most interactions occur in incongruent conditions, and thus we are particularly interested in examining how the model would account for this data. The

*R*

^{2}value for these conditions is presented in Table 1, along with the comparison values for three alternative models: independence, forced fusion, and veridical.

Model | R ^{2} (incongruent) | BIC |
---|---|---|

Bayesian inference | 0.8884 | −455 |

Cross-validation | 0.8746 | −443 |

Independent | 0.6792 | −355 |

Forced fusion | 0.5791 | −326 |

Veridical | 0.4340 | −294 |

*R*

^{2}of this cross-validation averaged over the two partings of the subjects is also shown in Table 1. The calculated unimodal variances and the optimized parameter values for the prior distribution are shown in Table 2 for group and individual subject fits. The relationship between the parameter values and the observed behavior will be explored in the Discussion section. The comparison of the model with individual subject's data also resulted in a good fit (

*R*

^{2}= 0.85 ± 0.015). This goodness of fit is not as high as that of the group data; however, this is to be expected due to the relatively small number of trials per condition for each subject.

σ _{V} | σ _{A} | σ _{T} | μ _{prior} | var _{prior} | cov _{prior} | |
---|---|---|---|---|---|---|

Group | 0.45 | 0.21 | 0.25 | 1.93 | 0.25 | 0.21 |

Individual | 0.32 ± 0.031 | 0.13 ± 0.031 | 0.11 ± 0.030 | 1.38 ± 0.146 | 0.26 ± 0.012 | 0.17 ± 0.014 |

*R*

^{2}= 0.97 vs.

*R*

^{2}= 0.95). Importantly, the similar parameter values across the modalities in the 9-parameter model ( Table A1) confirms our assumption of equal values across modalities for the mean, variance, and covariance, resulting in 3 free parameters (see Methods section).

μ _{prior} | var _{prior} | cov _{prior} | |
---|---|---|---|

Vision | 1.92 | 0.29 | 0.19 (VA) |

Audition | 1.85 | 0.22 | 0.22 (AT) |

Tactile | 1.85 | 0.24 | 0.21 (TV) |