October 2003
Volume 3, Issue 9
Vision Sciences Society Annual Meeting Abstract  |   October 2003
A formal model of visual attention in embodied language acquisition
Author Affiliations
  • Chen Yu
    Department of Computer Science, University of Rochester, USA
Journal of Vision October 2003, Vol.3, 309. doi:https://doi.org/10.1167/3.9.309
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Chen Yu, Dana H Ballard; A formal model of visual attention in embodied language acquisition. Journal of Vision 2003;3(9):309. doi: https://doi.org/10.1167/3.9.309.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Most studies of infant language acquisition have focused on the role of purely linguistic information as the central constraint. However, several researchers (e.g. Baldwin) have suggested that non-linguistic information, such as vision and talkers' attention, also plays a major role in language acquisition. In light of this, we implemented an embodied language learning system that explores the computational role of attention to build visually grounded lexicons. The central idea is to make use of visual perception as contextual information to facilitate word spotting, and utilize eye and head movements as deictic references to discover temporal correlations of data from multiple modalities. In the experiments, subjects were asked to perform three kinds of everyday activities (pouring water, stapling papers, and taking a lid off) while providing natural language descriptions of their behaviors. We collected speech data in concert with user-centric multisensory information from non-speech modalities, which included visual data from a head-mounted camera, gaze positions, head directions and hand movements. A multimodal learning algorithm identified the sound patterns of words and built grounded lexicons by associating object names and action verbs with attentional objects and hand motions. We compared our approach with the one that does not use eye and head movements to infer referential intentions. The results of the attention-based method are much better than the other approach (word spotting accuracy 86% vs. 28% and word learning accuracy 79% vs. 23%). The significant differences lie in the fact that there are a multitude of co-occurrences between words and possible meanings grounded in the environment, and inferences of speakers' attention from their body movements can indicate which co-occurrences are relevant. This work provides a computational account of infant language acquisition and suggests the importance of visual attention in embodied language learning.

Yu, C., Ballard, D. H.(2003). A formal model of visual attention in embodied language acquisition [Abstract]. Journal of Vision, 3( 9): 309, 309a, http://journalofvision.org/3/9/309/, doi:10.1167/3.9.309. [CrossRef]

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.