**Abstract**:

**Abstract**
**Extraction of motion from visual input plays an important role in many visual tasks, such as separation of figure from ground and navigation through space. Several kinds of local motion signals have been distinguished based on mathematical and computational considerations (e.g., motion based on spatiotemporal correlation of luminance, and motion based on spatiotemporal correlation of flicker), but little is known about the prevalence of these different kinds of signals in the real world. To address this question, we first note that different kinds of local motion signals (e.g., Fourier, non-Fourier, and glider) are characterized by second- and higher-order correlations in slanted spatiotemporal regions. The prevalence of local motion signals in natural scenes can thus be estimated by measuring the extent to which each of these correlations are present in space-time patches and whether they are coherent across spatiotemporal scales. We apply this technique to several popular movies. The results show that all three kinds of local motion signals are present in natural movies. While the balance of the different kinds of motion signals varies from segment to segment during the course of each movie, the overall pattern of prevalence of the different kinds of motion and their subtypes, and the correlations between them, is strikingly similar across movies (but is absent from white noise movies). In sum, naturalistic movies contain a diversity of local motion signals that occur with a consistent prevalence and pattern of covariation, indicating a substantial regularity of their high-order spatiotemporal image statistics.**

*presence*of a pairwise spatiotemporal correlation (Adelson & Bergen, 1985; Reichardt, 1961) of luminance. (The reason that the term

*Fourier motion*is used is that the set of pairwise correlations—the autocorrelation function—is the Fourier transform of the power spectrum, as is well known [Bracewell, 1999].) In contrast, other kinds of motion signals have been defined on the basis of perceptual phenomena that occur in the

*absence*of such correlations. The best-known examples of this are often called non-Fourier (NF) motion (Chubb & Sperling, 1988; Fleet & Langley, 1994), in which there is pairwise spatiotemporal correlation of a feature (e.g., a spatial edge or a temporal flicker edge). Moreover, motion perception can also occur in the absence of pairwise correlations of luminance (F motion) or of local features (NF motion), a phenomenon known as glider (G) motion (Fitzgerald, Katsov, Clandinin, & Schnitzer, 2011; Hu & Victor, 2010). However, the extent to which these mathematically distinct signals are present in naturalistic inputs is unknown. To address this question, a necessary first step is to formalize the notions of F, NF, and G motion signals (and their subtypes) in terms of specific mathematical transformations so that they can be compared on equal footing.

*check*to represent the analysis unit (i.e., either a single pixel or a block of pixels that have been averaged). Since the original films had a landscape aspect ratio, each pixel in the database represented a rectangular region of the original film, larger in the horizontal direction than the vertical. Specific movies that were selected were

*The 39 Steps*(1935),

*A Night at the Opera*(1935),

*Anna Karenina*(1935), and

*Mr. and Mrs. Smith*(2005). In designating check position, we used matrix convention in which the X-coordinate increases from top to bottom and the Y-coordinate increases from left to right. The analyses in the main text concern the YT plane (i.e., horizontal motion); parallel analyses in the XT plane (vertical motion) are in Supplement S1.

*B*, which is a set of spatiotemporal voxels in a specific relative position. We represent a template as a set of triplets [(

*x*

_{1},

*y*

_{1},

*t*

_{1}), (

*x*

_{2},

*y*

_{2},

*t*

_{2}), …, (

*x*,

_{n}*y*,

_{n}*t*)], in which each of the

_{n}*x*,

_{i}*y*, and

_{i}*t*are integers and

_{i}*n*is the number of elements in the template. Since the template is determined by the relative positions of its voxels, we require that min(

*x*) = min(

_{i}*y*) = min(

_{i}*t*) = 0, where

_{i}*i*= 1, …,

*n*.

*X*-dimension, which we denote as

*B*, is the template in which each triplet (

^{X}*x*,

_{i}*y*,

_{i}*t*) of

_{i}*B*is replaced by [

*L*(

_{X}*B*)−

*x*,

_{i}*y*,

_{i}*t*], where

_{i}*L*(

_{X}*B*) is the length of the template in the

*X*-dimension, namely, max(

*x*). Reversals along the

_{i}*Y*and

*T*dimensions are similarly defined.

*B*, for example, denotes a template that has been reversed along the

^{YT}*Y*dimension and then along the

*T*dimension.

*B*at the position (

*x*,

*y*,

*t*) is defined as a product that involves all offsets contained in the glider: where

*I*(

*x*,

*y*,

*t*) is the luminance of the image at the position (

*x*,

*y*,

*t*) and

*Ī*is the median luminance across the shot. Finally, the local motion score at position (

_{shot}*x*,

*y*,

*t*) for motion type

*B*in direction

*Z*is defined by the double-opponent calculation:

*I*is a movie in which checks are randomly permuted within a shot, and

_{rand}*Ī*is the median luminance across the movie. The SM score for a shot, for motion type

_{movie}*B*in direction

*Z*, is the local motion score, averaged over the shot, normalized by the corresponding quantity for a random movie:

*x*,

*y*,

*t*) of the template within the shot.

*separate*grids, and must be considered as such. That is, when the motion scores are computed, the template must be placed in generic positions on the stimulus and not just in register with the grid used for stimulus generation. This detail is critical. Without it, the present approach might fail to detect the motion signal in some of the drift-balanced stimuli of Chubb and Sperling (1988), but with it, the approach captures the motion in all of them. This is illustrated and further discussed in Supplement S3 (Figure S13).

*X*,

*Y*,

*T*]) are considered together. Each ROI is then scored to indicate to what extent there was a coherent F, NF, or G motion signal throughout the patch. To simplify the process of defining and computing these scores, we first binarized the luminance values in each check—we replaced each luminance by +1(black) or −1 (white), depending on how it compared with the median luminance within the shot. (Parallel analyses in Supplement S1 show that the results were robust with respect to the threshold used for binarization [Figures S7, S8 and S11] and that similar results were found for analysis in the XT plane [Figures S5 and S6]. Results in the main text are for the YT plane.) Note that this binarization can be considered as a form of dimension reduction. Prior to binarization, there are 256

^{16}possibilities for the ways that a 16-check ROI can be colored; after binarization, there are only 2

^{16}such combinations. Thus, binarization dramatically simplifies the process of defining, and then computing, a mapping from all of the possible ROI to a motion score; this is our motivation for it.

*I*(

*x*,

*y*,

*t*) by

*I*(

^{binarized}*x*,

*y*,

*t*), where

*I*(

^{binarized}*x*,

*y*,

*t*) is +1 or −1, according to whether

*I*(

*x*,

*y*,

*t*) is above or below a threshold (here, the median luminance within the shot). All of the above quantities can then be calculated from the binarized movie. We denote such quantities by

*RawCorr*(

^{binarized}*x*,

*y*,

*t*;

*B*),

*RawCorrRand*(

^{binarized}*x*,

*y*,

*t*;

*B*) etc. Once binarization replaces each luminance value with +1 or −1, the product of luminance values within a template reduces to determining whether there is an even or an odd number of checks of each color. All of the colorings that yield a product of +1 contribute positively to a rightward motion signal, and all of the colorings that yield a product of −1 contribute negatively. Thus, the configurations that contribute positively to the motion score can be enumerated in a library. This is shown in Figure 2A, using the four-check NF-S template as an example. Since all of the colorings in the library yield a product of +1, they have an even number of white and black checks distributed among its four positions (two checks at one time step and two checks at the next). Thus, if a coloring has a spatial edge at one time step (one black and one white check), it must have a spatial edge at the next; if it lacks a spatial edge at one time step (two blacks or two whites), it must lack a spatial edge at the next. These relationships capture the notion that NF-S corresponds to spatiotemporal correlation of the presence or absence of an edge.

*B*and any slab-like ROI that can contain the template along either the

*X*or

*Y*dimension, and has its other spatial dimension equal to one. Since the RMO score is an opponent score, we first define its components: the RM score

*RM*(

*B*;

*Z*;

*ROI*). This is given by the total number of displacements (

*x*,

_{i}*y*,

_{i}*t*) of the template within the ROI for which

_{i}*RawCorr*(

^{binarized}*x*+

*x*,

_{i}*y*+

*y*,

_{i}*t*+

*t*;

_{i}*B*) = 1, and thus is effectively a sum of

*RawCorr*scores within the ROI. The RMO score,

^{binarized}*RMO*(

*B*;

*Z*;

*ROI*), is then

*PM*(

*B*;

*Z*;

*ROI*) is the minimum Hamming distance from the ROI to any region

*K*for which every placement of the template in

*K*yields a local motion score of +1. This minimum takes into account all possible colorings of

*K*; this is one reason why the dimensionality reduction is important. Finally, the PMO score is given by

^{6}(without binarization) to 10

^{3}(with binarization to +1 and −1). Most of this compression is due to an increase in the lowest motion scores since binarization eliminates the possibility of multiplication by values near zero. But the upper ends of the distribution are also affected by binarization: Thresholding substantially reduces highest values for NF-S and NF-T (Figure 5B and C) and slightly reduces the highest values for G (Figure 5D). The likely reason for this is that the NF scores reflect products of four values (since the templates have four checks) and the G scores reflect products of three values (since the templates have three checks). Hence, binarization results in a moderate reduction in the extreme high values that result from products of three luminance values (G) and a more severe reduction in the extreme high values that result from products of four values (NF). In line with the increasing range compression as the number of checks in the template increase, correlations of the log-scaled SM scores with and without binarization are largest for F motion (0.79), next-largest for G motion (0.69), and smallest for NF-S and NF-T motion (0.62 and 0.63, respectively;

*p*< 0.001 in all cases).

*Journal of the Optical Society of America A**,*2 (2), 284–299. [CrossRef]

*, 26 (12), 1539–1548. [CrossRef] [PubMed]*

*Perception*

*Proceedings of the National Academy of Sciences, USA**,*99 (8), 5661. [CrossRef]

*. New York: McGraw-Hill Science/Engineering/Math.*

*The Fourier transform and its applications*(3rd ed.)

*Vision Research**,*51 (13), 1431–1456. [CrossRef] [PubMed]

*Journal of the Optical Society of America A: Optics and Image Science**,*5, 1986–2007. [CrossRef]

*i-Perception**,*2 (6), 569–576, doi:10.1068/i0441aap. [CrossRef] [PubMed]

*, 5 (2), 115. [CrossRef]*

*Psychology of Aesthetics, Creativity, and the Arts**, 6 (3), 345–358. [CrossRef]*

*Network: Computation in Neural Systems*

*Journal of the Optical Society of America A: Optics and Image Science**,*4 (12), 2379–2394. [CrossRef]

*Proceedings of the National Academy of Sciences, USA**,*108 (31), 12909–12914, doi:10.1073/pnas.1015680108. [CrossRef]

*, 34 (22), 3057–3079. [CrossRef] [PubMed]*

*Vision Research*

*Science**,*218 (4571), 486–487, doi:10.1126/science.7123249. [CrossRef] [PubMed]

*Annual Review of Neuroscience**,*27 (1), 649–677, doi:10.1146/annurev.neuro.27.070203.144220. [CrossRef] [PubMed]

*Perception and Psychophysics**,*55 (1), 48–120. [CrossRef] [PubMed]

*, 12 (5), 711–720. [CrossRef] [PubMed]*

*Journal of Cognitive Neuroscience*

*Journal of Vision**,*10 (3): 9, 1–16, http://www.journalofvision.org/content/10/3/9, doi:10.1167/10.3.9. [PubMed] [Article] [CrossRef] [PubMed]

*Attention, Perception, and Psychophysics**,*14 (2), 201–211. [CrossRef]

*Neurocomputing**,*52, 117–123. [CrossRef]

*Journal of the Optical Society of America A**,*8 (2), 377–385. [CrossRef]

*Journal of the Optical Society of America A**,*18 (9), 2331–2370, doi:10.1364/JOSAA.18.002331. [CrossRef]

*Vision Research**,*25 (5), 625–660. [CrossRef] [PubMed]

*Sensory Communication**,*303–317.

*, 161 (4), 533–547. [CrossRef]*

*Journal of Comparative Physiology A: Neuroethology, Sensory, Neural, and Behavioral Physiology**, 10 (4), 341–350. [CrossRef] [PubMed]*

*Network (Bristol, England)*

*Journal of Neuroscience**,*18 (10), 3816–3830. [PubMed]

*Proceedings of the Royal Society**,*203 (1153), 405–426, doi:10.1098/rspb.1979.0006. [CrossRef]

*. England: Oxford.*

*The interpretation of visual motion*

*Visual Neuroscience**,*5 (4), 353–369. [CrossRef] [PubMed]

*Proceedings of the Royal Society of London Series B: Biological Sciences**,*265 (1394), 359–366, doi:10.1098/rspb.1998.0303. [CrossRef]

*Journal of the Optical Society of America A**,*2 (2), 300–321. [CrossRef]