Abstract
Computer vision algorithms can estimate 3-D structure from image streams either by using feature correspondence across views, or patterns of motion signals. We ask whether feature correspondence is sufficient for human observers, or whether motion extraction is necessary. We measured the minimum velocity required to reliably identify 3-D shapes from monocular motion cues, and compared it to the minimum velocity at which the direction of motion-energy can be identified reliably. For 3-D shape identification, we presented a vertical sinusoidal corrugation translating for a half cycle across a central fixation. Observers had to indicate whether the half-cycle was concave, convex, right-slant or left-slant (4AFC). The corrugation was covered with a starry night texture which does not convey 3-D information in static displays (Zabulis & Backus, 2004). Measured by a method of constant stimuli, thresholds for three observers were between 0.44 and 1.02 deg/sec. For the identification of motion-energy direction, we presented a moving sinusoidal grating added to one of the component sinusoidal gratings of a static orthogonal plaid of the same spatial frequency. The grating could move in either direction along either of the two axes. The contrast of the plaid was three times the contrast of the moving grating, making accurate direction identification possible only when the motion energy of the grating can be extracted from the compound spatio-temporal spectrum (Lu & Sperling, 1995; Zaidi & DeBonet, 2000). Measured by a 4AFC method of constant stimuli, thresholds for direction of motion for three observers were between 0.65 and 0.78 deg/sec. Punctate spatially random sampling of the stimulus did not increase thresholds. Reliable identification of 3-D shapes thus occurs roughly at velocities at which motion energy is extracted. These results indicate that the human visual system uses motion signals per se to estimate 3-D shapes from image streams.