Abstract
"The best things in life are not things, they are moments" of raining, walking, splashing, resting, laughing, crying, jumping, etc. Moments happening in the world can unfold at time scales from a second to minutes, occur in different places, and involve people, animals, objects, and natural phenomena, like rain, wind, or just silence. Of particular interest are moments of a few seconds: they represent an ecosystem of changes in our surroundings that convey enough temporal information to interpret the auditory and visually dynamic world. We present the Moments in Time Dataset, a large-scale human-annotated collection of one million videos corresponding to dynamic events unfolding within 3 seconds. These short temporal events correspond to the average duration of human working memory (a short-term memory-in-action buffer specialized in representing temporally dynamic information). Importantly, 3 seconds is a temporal envelope which holds meaningful actions between people, objects and phenomena (e.g. wind blowing, objects falling on the floor, picking something up) or between actors (e.g. greeting someone, shaking hands, playing with a pet, etc). There is a common transformation that occurs in space and time involving agents and/or objects that allows humans to associate it with the semantic meaning of an action despite a large amount of visual and auditory variance in the events belonging to that action. The challenge is to develop models that recognize these transformations in a way that will allow them to discriminate between different actions, yet generalize to other agents and settings within the same action. This dataset, designed to have a large coverage and diversity of events in both visual and auditory modalities, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis.
Meeting abstract presented at VSS 2018