Abstract
Audiovisual speech integration combines information from auditory speech (talker's voice) and visual speech (talker's mouth movements) to improve perceptual accuracy. However, if the auditory and visual speech emanate from different talkers, integration decreases accuracy. Therefore, a key step in audiovisual speech perception is deciding whether auditory and visual speech have the same cause, a process known as causal inference. A primary cue for this decision is the disparity between the auditory and visual speech content, with lower disparity indicating a single cause. A well-known multisensory illusion, the McGurk Effect, consists of incongruent audiovisual speech, such as auditory "ba" + visual "ga" (AbaVga), that is integrated to produce a fused percept ("da"). This illusion raises at least two questions: first, given the disparity between auditory and visual speech, why are they integrated; and second, why does the McGurk Effect occur for some syllables (e.g., AbaVga) but not other, ostensibly similar, syllables (e.g., AgaVba). We describe a Bayesian model of causal inference in multisensory speech perception (CIMS2) that calculates the percept resulting from assuming common vs. separate causes; computes the likelihood of common vs. separate causes using content disparity; averages the common and separate cause percepts weighted by their likelihood; and finally applies a decision rule to categorize the averaged percept. We apply this model to behavioral data collected from 265 subjects perceiving two incongruent speech stimuli, AbaVga and AgaVba. The CIMS2 model successfully predicted the integration (McGurk Effect) observed when human subjects were presented with AbaVga and the lack of integration (no McGurk Effect) for AgaVba. Without the causal inference step, the model predicted integration for both stimuli. Our results demonstrate a fundamental role for causal inference in multisensory speech perception, and provide a computational framework for studying speech perception in conditions of varying audiovisual disparity.
Meeting abstract presented at VSS 2016