Our goal was to quantify different kinds of local motion signals (F, standard NF, and G) in a segment of a naturalistic movie. We did this by first measuring each kind of motion signal based on the luminance correlations within the appropriate spatiotemporal template (
Figure 1) to obtain local motion scores, and then, for each kind of motion, we combined these scores across space in different ways.
We began by motivating the definition of each kind of motion signal. Typically, F motion is defined by pairwise spatiotemporal correlation of the luminance values in the image (Van Santen & Sperling,
1985). NF motion denotes the motion of a local feature, such as an edge or flicker, in the absence of pairwise spatiotemporal correlation of luminance. An example of NF motion is an object that is flickering randomly—thus eliminating pairwise correlations—while moving across a background of equal mean luminance (Chubb & Sperling,
1988). However, although several models for NF motion extraction have been proposed (Chubb & Sperling,
1988; Fleet & Langley,
1994), there is no single mathematical quantity (analogous to spatiotemporal correlation used for F motion) that is recognized as defining its strength. As will be shown later and in
Supplement S3, our approach is able to capture the motion signals in these stimuli. Finally, G motion (Hu & Victor,
2010) encompasses third- or higher-order correlation in slanted spatiotemporal regions and occurs in the absence of pairwise spatiotemporal correlation of luminance (F motion) or simple features (NF motion).
These motion types have a fundamental similarity: They all depend on correlations within a slanted spatiotemporal region (
Figure 1). For F motion, the correlation is pairwise, and the region consists of two checks, offset in space and time. For NF motion, the region consists of four checks, and the shape of the region depends on the subtype of NF motion. For NF motion of a spatial feature (NF-S), the region is a parallelogram consisting of two pairs of checks, and the pairs are in adjacent time-slices. Each pair of checks effectively detects the spatial feature (match vs. mismatch), and the combination of the two pairs detects whether this feature moves. For NF motion of a temporal feature (NF-T), the same region is rotated to interchange the roles of space and time. Each pair of checks detects whether there is local flicker, and the combination of the two pairs detects whether the feature moves. For the G motion types considered here, the region is a triplet of checks. Depending on the orientation of the triangle formed by the three checks, the region corresponds to either expansion or contraction over time. Thus, in all cases, the local motion signal corresponds to the correlations among a group of checks in a specific shape, i.e., the template (
Figure 1). The templates shown in
Figure 1A correspond to motion to the right; flipping them across the Y-axis corresponds to motion to the left.
To quantify the correlations within these templates, we calculated the product of the luminance values in their checks (after subtracting the mean luminance of each shot separately). To implement this for color movies, we first converted the color inputs to gray levels using Matlab's (The MathWorks, Inc., Natick, MA) rgb2gray function. (The numeric range of luminance is irrelevant because we later normalized our calculations by a parallel computation for a movie with spatial correlations removed; see next section for details.)
Following
Reichardt (1961) and many others, we noted that the raw correlation value (i.e., the product of the luminance contrasts) will contain spurious motion signals when a static spatial edge is present. As is standard for F motion, we removed this spurious signal by an opponent process in which correlations from left-facing and right-facing templates were subtracted (
Figure 1). This strategy suffices for NF motion as well, but is insufficient for G motion (
Figure 1). To eliminate this signal for G motion, we added a second opponent stage in which signals from forward- and backward-facing templates were subtracted. Fundamentally, this second opponent stage is needed because the glider for G motion lacks the symmetry of the templates for F and NF motion—for F and NF templates, left-versus-right spatial opponency is equivalent to forward-versus-backward temporal opponency. In other words, because of this symmetry for F and NF templates, the standard single-opponent calculation (space only) is equivalent to a double-opponent calculation (space and time), but for G templates, these two opponencies must be explicit. (Note that had we included only the forward-versus-backward opponency for G templates, then we also would not have eliminated spurious motion signals due to full-field flicker.)
Formally, the calculation of the local motion score is as follows. A motion type corresponds to a template, B, which is a set of spatiotemporal voxels in a specific relative position. We represent a template as a set of triplets [(x1,y1,t1), (x2,y2,t2), …, (xn,yn,tn)], in which each of the xi, yi, and ti are integers and n is the number of elements in the template. Since the template is determined by the relative positions of its voxels, we require that min(xi) = min(yi) = min(ti) = 0, where i = 1, …, n.
A template that is reversed along the X-dimension, which we denote as BX, is the template in which each triplet (xi,yi,ti) of B is replaced by [LX(B)−xi,yi,ti], where LX(B) is the length of the template in the X-dimension, namely, max(xi). Reversals along the Y and T dimensions are similarly defined. BYT, for example, denotes a template that has been reversed along the Y dimension and then along the T dimension.
The raw correlation value for the glider
B at the position (
x,
y,
t) is defined as a product that involves all offsets contained in the glider:
where
I(
x,
y,
t) is the luminance of the image at the position (
x,
y,
t) and
Īshot is the median luminance across the shot. Finally, the local motion score at position (
x,
y,
t) for motion type
B in direction
Z is defined by the double-opponent calculation:
Note that although our approach aims to capture specific types and kinds of local motion signals (F, NF-S, NF-T, and G), it can be easily modified to capture motion signals carried by correlations in other spatiotemporal configurations (e.g., Hu & Victor,
2010) by using the appropriate templates.