Abstract
Seeing the face of a talker aids speech perception, especially for noisy speech. Advances in computer graphics have made encounters with synthetic faces more frequent, but little is known about their perceptual properties. We examined the benefit for noisy speech perception of two types of synthetic faces, one that used the facial action coding system (FACS) to simulate the musculature underlying jaw and lip movements during speech production, and one generated with a deep neural network (DNN). Audiovisual recordings of 64 single words were combined with pink noise at a signal-to-noise ratio of -12 dB. The words were presented in four formats: noisy auditory-only (An); noisy audiovisual with a real face (AnV:Real) and noisy audiovisual with a synthetic face (AnV:FACS or AnV:DNN). Sixty participants recruited from Amazon Mechanical Turk attempted to identify each word. Within participants, each word was presented in only a single format and counterbalancing across participants ensured that every word was presented in every format. Seeing the real talker’s face improved the intelligibility of noisy auditory words (accuracy of 59% for AnV:Real vs. 10% for An). Synthetic faces also improved intelligibility, but by a smaller amount (accuracy of 29% for AnV:FACS and 30% for AnV:DNN vs. 10% for An). A mixed-effects model showed that real faces provided more benefit than synthetic faces (p < 10-16) but there was no difference between synthetic face types (t = 0.2, p = 0.99). The accuracy difference between real and synthetic faces was more pronounced for some speech tokens than others, and was the largest for /th/ and /f/ tokens. These data show that synthetic faces may provide a useful experimental tool for studying audiovisual integration during speech perception and suggest ways to improve the verisimilitude of synthetic faces.