Abstract
PURPOSE. To establish the temporal dynamics of the human ability to extract meaning from scenes. METHODS. EXP 1: 384 color images with emotional valence from the IAPS set were presented (masked) once to each of 96 subjects, at durations from one video-frame (13 ms) to 1710ms. Subjects rated each image valence on a 9-point scale. We calculated mean ratings per exposure and derived hazard functions for different valence categories. EXP 2: Three image classes were tested in a blocked design: positive/negative images, landscapes/cityscapes and animals/vehicles. Each image was presented (masked) for 13-50msec. Subjects categorized the images in a 2AFC design and accuracy of categorization was calculated per exposure. RESULTS. EXP 1: Valence was reliably discriminated after a single video frame and asymptoted at ∼1s. The derived hazard functions show that categorization rates for positive and negative images are the same, with a transient peak at ∼50ms, and a sharp decline by 200ms. EXP 2: Performance remained constant at ∼95% for landscapes/cityscapes and animals/vehicles at all exposures; performance for emotional scenes improved from ∼60% at one frame exposure to ∼75% at 50 ms exposure. To determine if low-level features could be responsible for these results we built a statistical model consisting of 24 low-level measurements of luminance and spatial frequency. A linear classifier was able to almost perfectly separate the landscapes/cityscapes and animals/vehicles, but was unable to separate the valence categories. CONCLUSION: Image meaning is available at exposures as brief as one video-frame. While rapid categorization of some image classes could exploit differences in low-level image properties, no such differences seem to be available for emotional scenes, and yet image meaning can be extracted from them reliably and quickly. This suggests a true act of object recognition, dependent on mechanisms functioning on similarly fast scales.
NIH EY13155 to V. Maljkovic