October 2020
Volume 20, Issue 11
Open Access
Vision Sciences Society Annual Meeting Abstract  |   October 2020
Spoken Moments: A Large Scale Dataset of Audio Descriptions of Dynamic Events in Video
Author Affiliations
  • Mathew Monfort
  • SouYoung Jin
  • David Harwath
  • Rogerio Feris
    IBM Research
  • James Glass
  • Aude Oliva
Journal of Vision October 2020, Vol.20, 1447. doi:https://doi.org/10.1167/jov.20.11.1447
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Mathew Monfort, SouYoung Jin, David Harwath, Rogerio Feris, James Glass, Aude Oliva; Spoken Moments: A Large Scale Dataset of Audio Descriptions of Dynamic Events in Video. Journal of Vision 2020;20(11):1447. doi: https://doi.org/10.1167/jov.20.11.1447.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

When people observe events they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the obseved event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video. They provide expanded attributes for video labeling (e.g. actions/objects/scenes/sentiment/etc.) while allowing us to gain new insight into what people find important or necessary to summarize specific events. In this vein, we present a new dataset of audio descriptions collected for a set of 500K different short videos depicting a broad range of different events. We collect our descriptions using audio recordings to ensure that they remain as natural and concise as possible and provide verified text transcriptions of each recording. We additionally present a multi-modal audio-visual model for jointly learning a shared representation between the video and the audio descriptions and show how this learned representation can be applied to a number of different tasks in video understanding.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.