Abstract
Most studies of infant language acquisition have focused on the role of purely linguistic information as the central constraint. However, several researchers (e.g. Baldwin) have suggested that non-linguistic information, such as vision and talkers' attention, also plays a major role in language acquisition. In light of this, we implemented an embodied language learning system that explores the computational role of attention to build visually grounded lexicons. The central idea is to make use of visual perception as contextual information to facilitate word spotting, and utilize eye and head movements as deictic references to discover temporal correlations of data from multiple modalities. In the experiments, subjects were asked to perform three kinds of everyday activities (pouring water, stapling papers, and taking a lid off) while providing natural language descriptions of their behaviors. We collected speech data in concert with user-centric multisensory information from non-speech modalities, which included visual data from a head-mounted camera, gaze positions, head directions and hand movements. A multimodal learning algorithm identified the sound patterns of words and built grounded lexicons by associating object names and action verbs with attentional objects and hand motions. We compared our approach with the one that does not use eye and head movements to infer referential intentions. The results of the attention-based method are much better than the other approach (word spotting accuracy 86% vs. 28% and word learning accuracy 79% vs. 23%). The significant differences lie in the fact that there are a multitude of co-occurrences between words and possible meanings grounded in the environment, and inferences of speakers' attention from their body movements can indicate which co-occurrences are relevant. This work provides a computational account of infant language acquisition and suggests the importance of visual attention in embodied language learning.