Abstract
When engaging in natural tasks, the human visual system processes a highly dynamic visual data stream. The retina, performing the very first steps in this processing, is thought to be adapted to take advantage of low-level signal regularities, such as the autocorrelation function or power spectrum, to produce a more efficient encoding of the data (Atick & Redlich, 1992). Previous work examined the joint spatio-temporal power spectrum of handheld camera videos and Hollywood movies, showing that power falls as an inverse power-law function of spatial and temporal frequency, with an inseparable relationship (Dong & Attick, 1995). However these data are far from a true characterization of a day in the life of the retina due to body, head and eye motion. In addition, the distribution of natural tasks will influence the statistics of this signal. Here, we aim to characterize these statistics of natural vision using a custom device that consists of a head-mounted eye tracker coupled with high frame-rate world cameras and orientation sensors. Using video data captured from this setup, we analyze the joint spatiotemporal power spectrum for three conditions: 1) a static camera viewing a natural task being performed, 2) a head mounted camera worn by a subject engaged in a natural task, and 3) videos simulating the dynamic retinal image, created by overlaying the subject's eye motions on the head-mounted camera video stream. Results suggest that compared to a static camera, body and head motion have the effect of boosting high temporal frequencies. Eye motion enhances this effect, particularly for mid to high spatial frequencies, causing this portion to deviate from the power law and become nearly flat. These data will be important for developing efficient coding models relevant to natural vision.