September 2018
Volume 18, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2018
A Large Scale Video Dataset for Event Recognition
Author Affiliations
  • Mathew Monfort
    CSAIL, MIT
  • Bolei Zhou
    CSAIL, MIT
  • Sarah Bargal
    Computer Science, Boston University
  • Alex Andonian
    CSAIL, MIT
  • Kandan Ramakrishnan
    CSAIL, MIT
  • Carl Vondrick
    CSAIL, MIT
  • Aude Oliva
    CSAIL, MIT
Journal of Vision September 2018, Vol.18, 753. doi:10.1167/18.10.753
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Mathew Monfort, Bolei Zhou, Sarah Bargal, Alex Andonian, Kandan Ramakrishnan, Carl Vondrick, Aude Oliva; A Large Scale Video Dataset for Event Recognition. Journal of Vision 2018;18(10):753. doi: 10.1167/18.10.753.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

"The best things in life are not things, they are moments" of raining, walking, splashing, resting, laughing, crying, jumping, etc. Moments happening in the world can unfold at time scales from a second to minutes, occur in different places, and involve people, animals, objects, and natural phenomena, like rain, wind, or just silence. Of particular interest are moments of a few seconds: they represent an ecosystem of changes in our surroundings that convey enough temporal information to interpret the auditory and visually dynamic world. We present the Moments in Time Dataset, a large-scale human-annotated collection of one million videos corresponding to dynamic events unfolding within 3 seconds. These short temporal events correspond to the average duration of human working memory (a short-term memory-in-action buffer specialized in representing temporally dynamic information). Importantly, 3 seconds is a temporal envelope which holds meaningful actions between people, objects and phenomena (e.g. wind blowing, objects falling on the floor, picking something up) or between actors (e.g. greeting someone, shaking hands, playing with a pet, etc). There is a common transformation that occurs in space and time involving agents and/or objects that allows humans to associate it with the semantic meaning of an action despite a large amount of visual and auditory variance in the events belonging to that action. The challenge is to develop models that recognize these transformations in a way that will allow them to discriminate between different actions, yet generalize to other agents and settings within the same action. This dataset, designed to have a large coverage and diversity of events in both visual and auditory modalities, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis.

Meeting abstract presented at VSS 2018

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×