Abstract
Visual scene understanding involves processing and integration from different levels of visual tasks, including recognition of objects, actions and interactions. Here we study the dynamics of scene understanding over time. In particular, we study the time trajectory of scene interpretation, by controlling the exposure time with perceptual masking. 140 MTurk participants were instructed to provide a detailed free-recall description to 14 stimuli images portraying various interactions between animate agents (humans and pets) and other agents and objects. They were instructed to report the type of objects and agents in the image with their properties and inter-relations. For each image, subjects were assigned to one of seven exposure conditions: 50, 75, 100, 125, 200, 500 and 2000ms followed by a mask. A fixation cross at the center of the image frame appeared prior to image display. Participants had 15 minutes for task completion. Evaluation of the subjects’ responses was conducted by 4 scorers, who followed a detailed analysis protocol, which minimized subjective judgements. Preliminary results indicate consistent trends in the time evolution of scene perception: (i) human agents are reported earlier than objects and global scene description, even when objects appear at the center of fixation (e.g. ‘two men’ before ‘a park bench’); (ii) actions are reported earlier than the acted upon objects (e.g. ‘drinking’ before ‘cup’); (iii) for human agents, the number of agents is reported early, followed by age, and gender is reported on the average later (e.g. ‘two people’, before ‘two kids’, and then ‘two boys’). These findings are interesting from a modeling perspective since they do not fit the common scene understanding paradigm in computer vision, where objects are first detected and only then their inter-relations are processed. We will consider scene perception schemes that are more consistent with human dynamics of scene perception than current approaches.