September 2011
Volume 11, Issue 11
Vision Sciences Society Annual Meeting Abstract  |   September 2011
Detecting synchrony in degraded audio-visual streams
Author Affiliations
  • Keshav Dhandhania
    Department of Brain and Cognitive Sciences, MIT, USA
  • Jonas Wulff
    RWTH Aachen University, Germany
  • Pawan Sinha
    Department of Brain and Cognitive Sciences, MIT, USA
Journal of Vision September 2011, Vol.11, 800. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Keshav Dhandhania, Jonas Wulff, Pawan Sinha; Detecting synchrony in degraded audio-visual streams. Journal of Vision 2011;11(11):800.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Even 8–10 week old infants, when presented with two dynamic faces and a speech stream, look significantly longer at the ‘correct’ talking person (Patterson & Werker, 2003). This is true even though their reduced visual acuity prevents them from utilizing high spatial frequencies. Computational analyses in the field of audio/video synchrony and automatic speaker detection (e.g. Hershey & Movellan, 2000), in contrast, usually depend on high-resolution images. Therefore, the correlation mechanisms found in these computational studies are not directly applicable to the processes through which we learn to integrate the modalities of speech and vision. In this work, we investigated the correlation between speech signals and degraded video signals. We found a high correlation persisting even with high image degradation, resembling the low visual acuity of young infants. Additionally (in a fashion similar to Graf et al., 2002) we explored which parts of the face correlate with the audio in the degraded video sequences. Perfect synchrony and small offsets in the audio were used while finding the correlation, thereby detecting visual events preceding and following audio events. In order to achieve a sufficiently high temporal resolution, high-speed video sequences (500 frames per second) of talking people were used. This is a temporal resolution unachieved in previous studies and has allowed us to capture very subtle and short visual events. We believe that the results of this study might be interesting not only to vision researchers, but, by revealing subtle effects on a very fine timescale, also to people working in computer graphics and the generation and animation of artificial faces.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.