Abstract
Purpose: Normal-hearing observers are typically able to understand speech to some degree when it is presented in the visual-only modality, without an accompanying auditory signal. However, different talkers vary in how easily they can be understood through visual-only speech perception. It has previously been unclear whether this variability in talker intelligibility is due to differences in the amount of physical information available in the visual speech signal or to human perceptual strategies that are more optimally suited to some talkers than others. We investigated this issue by comparing human performance to that of an ideal observer constrained only by the availability of information in visual-only speech.
Methods: 8 talkers (4 male, 4 female) were videotaped saying 10 monosyllabic English words equated for frequency. The visual portions of the movies were presented to observers in a 1-interval, 10-alternative identification task that was blocked by talker. On each trial, dynamic Gaussian pixel noise was added to a randomly chosen word movie. The contrast of the movies was varied across trials using a staircase procedure to obtain each observer's 71% correct word-identification threshold for each talker. Ideal observer thresholds for each talker were measured using Monte Carlo simulations.
Results & Conclusions: Although the ideal observer's thresholds varied somewhat across talkers, human observer thresholds showed a different pattern and a much wider range of variability. Pilot data from 2 human observers indicated that word recognition efficiencies (ideal/human thresholds) varied by as much as a factor of 30 across talkers. This variability in efficiency suggests that the differences across talkers in human visual speech recognition are not due to differences in the amount of physical information available in visual speech patterns, but instead to differences in the relative suitability of human perceptual strategies for different talkers.