Abstract
Visual speech animation techniques which are “videorealistic” (potentially indistinguishable from real recorded video) are starting to become available. We describe here a perceptual evaluation scheme and its application to a new videorealistic visual-speech animation system, called Mary101.
Two types of experiments were performed: a) distinguishing visually between real and synthetic image-sequences of the same utterances (“Turing tests”), and b) lip-reading real and synthetic image-sequences of the same utterances (“Intelligibility tests”). In the explicit perceptual discrimination task (a), each stimulus is classified directly as a real or synthetic image-sequence by detecting a possible difference between the synthetic and the real image-sequences. The implicit perceptual discrimination (b) consists of a comparison between visual recognition of speech of real and synthetic image-sequences.
Subjects performed at chance level in the first explicit discrimination task (a). However, in the implicit lip-reading discrimination task (b), the same subjects performed significantly better with real image-sequences than with synthetic ones. This was true with recognition of whole-words, syllables and phonemes. This suggests that the latter task is a more sensitive method for discrimination between synthetic and real image-sequences.