Shape and motion are two dominant cues for object recognition, but it can be difficult to investigate their relative quantitative contribution to the recognition process. In the present study, we combined shape and non-rigid motion morphing to investigate the relative contributions of both types of cues to the discrimination of dynamic objects. In Experiment 1, we validated a novel parameter-based motion morphing technique using a single-part three-dimensional object. We then combined shape morphing with the novel motion morphing technique to pairs of multipart objects to create a joint shape and motion similarity space. In Experiment 2, participants were shown pairs of morphed objects from this space and responded “same” on the basis of motion-only, shape-only, or both cues. Both cue types influenced judgments: When responding to only one cue, the other cue could be ignored, although shape cues were more difficult to ignore. When responding on the basis of both cues, there was an overall bias to weight shape cues more than motion cues. Overall, our results suggest that shape influences discrimination more than motion even when both cue types have been made quantitatively equivalent in terms of their individual discriminability.

*discrimination contours*through a joint shape and motion space to visualize the relative contribution of both types of cues to the task.

*both*shape and motion. As we and others have found (e.g., Cutzu & Edelman, 1996; Giese, Thornton, & Edelman, 2008; Jastorff et al., 2006; Lawson & Bülthoff, 2008; Pyles et al., 2007; Schultz et al., 2008; Vuong et al., 2009), perceptual similarity in the shape or motion domain can affect how well observers recognize objects.

*linear combination of prototypes*(Giese & Poggio, 2000; see Ullman, 1998, for a linear combination of 2D views to represent a 3D object). The prototypes serve as stored examples of an object class for which a 3D description is available. In the simple case, this description is the 3D position (i.e.,

*x*-,

*y*-, and

*z*-coordinates) of the set of vertices that define the 3D object model. New object models are synthesized by taking a linear combination of the prototypes, that is, by taking a weighted average of the 3D position at each corresponding vertex between prototypes. This linear combination is also referred to as

*morphing*. On this view, given a set of prototypes (

*P*

_{1}…

*P*

_{ n }), a morph,

*M,*can be created by the linear combination:

*M*=

*c*

_{1}

*P*

_{1}+ … +

*c*

_{ n }

*P*

_{ n }, where the weight,

*c*, represents the contribution of each prototype to the morph and with the constraint that

*c*

_{1}+ … +

*c*

_{ n }= 1. Giese and Poggio (2000) extended the linear combination of prototypes to human actions (e.g., walking, running, marching), which added a temporal dimension to shape. In their spatiotemporal motion morphing technique, prototype actions were represented by trajectories of key parts and joints of the human body (e.g., forehead, shoulders, elbows, wrists, hips, pelvis, knees, and feet).

*c*

_{1}, …,

*c*

_{ n }) within the space spanned by the prototypes. That is, the prototypes define the physical dimensions of the space and a weight vector defines a point (i.e., object) within it. The similarity between any two objects can then be defined as the Euclidean distance between their weight vectors on the specific dimensions that are represented. Using such morphing techniques, behavioral experiments have shown that similarity in the parametric space maps to perceived similarity and is important for behavior in both the shape and motion domains (e.g., Cutzu & Edelman, 1996; Giese et al., 2008; Jastorff et al., 2006; Lawson & Bülthoff, 2008; Schultz et al., 2008; Troje, 2002). Thus, the linear combination framework allows researchers to create perceptually meaningful spaces, that is, observers are sensitive to systematic variations in this type of space.

*C*is the combined difference estimate between two shapes and motions,

*S*is the shape difference estimate,

*M*is the motion difference estimate, and

*w*is the relative weighting of shape and motion differences. The objects are considered different if the sum is greater than some decision threshold

*θ*. The relative weight,

*w*, ranges from 0 to 1: As observers rely more and more on the shape cue,

*w*

^{2}approaches 1; conversely, as they rely progressively more on the motion cue,

*w*

^{2}approaches 0; and if they use both cues equally,

*w*

^{2}= 0.5. For convenience, we also define

*w*

_{ s }=

*w*(the relative weight assigned to shape differences) and

*w*

_{ m }=

*θ*that determines the amount of difference required for the observer to give a “different” response (i.e., the observer's decision threshold),

*σ*

_{ m }and

*σ*

_{ s }that determine the reliability of the motion and shape signals (i.e., the variance or noise in the estimates

*M*and

*S*), respectively, and

*w*that determines the relative weight assigned to the motion and shape signals. Thus, our full model has four parameters (

*θ, σ*

_{ m },

*σ*

_{ s }, and

*w*). The Deriving a model of shape and motion discrimination section in 1 provides a derivation of our model (Equations A4 to A9).

*and*motion of different multipart objects (Experiment 2).

*same–different*discrimination task. We expected that if participants were sensitive to the parametric motion space spanned by the three prototype motions we used, their ability to discriminate morphs within this space would be monotonically related to the

*motion distance*between two prototypes (Giese et al., 2008). Furthermore, we designed the experiment to reduce reliance on image changes

*per se*by changing the viewpoint at which each of the two objects was presented.

*bending, twisting,*and

*stretching*(recall the snake). To relate these motions to the linear combination framework, we refer to them as the “prototype” non-rigid motions (see below). These different motions can be considered to be global

*deformation fields*that smoothly warp the 3D position of all the vertices on the surface of a 3D model (Barr, 1984; Watt & Watt, 1992). By smooth deformations, we mean that there are no sharp changes or discontinuities to the resulting 3D geometry. These deformations are illustrated in Figure 1. For example, a single- or multipart object (Figure 1, left and right panels, top row) can be bent, twisted, or stretched (Figure 1, left and right panels, middle row). In Experiment 1, we used the single-part object illustrated in the left panel, and in Experiment 2, we used the multipart object shown in the right panel; however, their creation and morphing were identical so we describe the stimulus creation for both experiments together. Prototype motions can be combined and the resulting deformation field mapped to the object. Thus, the technique of using deformation fields to morph motions on any shape offers tremendous freedom because it is not necessary to find either corresponding points on the surfaces of two different 3D shapes or corresponding points in time.

*temporal profiles*are illustrated in Figure 2. The five parameters are: bend angle (0° [straight] to 180° [fully bent]), bend direction (arbitrary range in degrees, 0° to 270° used in the present study), twist angle (−90° to +90°), twist bias (arbitrary range in degrees, 7° to 15° used in the present study), and stretch amount (−1 to 1, in arbitrary units). The bend direction and twist bias affect the initial direction of bending or twisting relative to some arbitrary starting position (0°), respectively. For the stretch amount, positive values stretch the shape longer (see Figure 1) and negative values compress the shape shorter (not illustrated).

*per se*to perform the task. The location (left or right of fixation) and viewpoint of the two selected videos were randomly determined on each trial.

*F*(1,19) = 461.1,

*η*

_{p}

^{2}= 0.60), which suggests that participants' responses were increasing as a function of motion difference. There was also a main effect of morph pair (

*F*(2,38) = 16.1,

*η*

_{p}

^{2}= 0.45). Lastly, there was a significant interaction between the two factors (

*F*(10,190) = 3.65,

*η*

_{p}

^{2}= 0.16). Post-hoc comparison using Tukey's Honestly Significant Difference (with the within-subjects error term from the significant interaction) showed that the twisting–stretching pair was different from both the bending–twisting and bending–stretching pairs (

*p*s < 0.05). However, qualitatively, the curves are the same.

*w*

_{ s }= 0 (i.e., all the weight is assigned to motion differences), making the value of

*σ*

_{ s }immaterial. We could then fit the two remaining parameters (

*θ*and

*σ*

_{ m }) with Equation A9 using maximum likelihood fitting. The Maximum likelihood fitting section in 1 describes the maximum likelihood procedure we used. We assessed the relative fit for each participant's data across the different morph pairs using the root mean square error (RMSE) between the actual data and the predicted data (from the participant's individual fitted parameters). The RMSE was similar across the three morph pairs (bending–twisting:

*M*= 7.6%,

*SE*= 0.6%; bending–stretching:

*M*= 6.5%,

*SE*= 0.8%; and twisting–stretching:

*M*= 7.6%,

*SE*= 0.6%). We then averaged the parameters across participants to compute a population psychometric function for each morph pair. These functions are illustrated in Figure 4 (see Figure A2 in 1 for fits for each participant).

*SEM*= 1.8%). This threshold for motion discrimination was similar to the threshold obtained in our previous study using shape morphs of multipart objects (Schultz et al., 2008). In that study, we found that the 75% shape threshold was 41.3% (

*SEM*= 2.0%;

*N*= 15).

*shape only, motion only,*and

*shape*+

*motion*).

*same*if and only if both stimuli had the same shape and same motion (disregarding changes in viewpoint).

*x*-axis represents the task-relevant cue. There was a significant linear effect of the task-relevant cue (shape only:

*F*(1,19) = 405.4,

*η*

_{p}

^{2}= 0.96; motion only:

*F*(1,19) = 92.2,

*η*

_{p}

^{2}= 0.83). From the point of view of interference from the task-irrelevant cue, for the motion-only group there was a small but significant linear effect of the task-irrelevant shape (

*F*(1,19) = 5.0,

*η*

_{p}

^{2}= 0.21). Otherwise, there was generally no effect of the task-irrelevant cue or interactions with it for either group. In other words, participants were generally able to make their discriminations based on only the shape or motion cue and could ignore the task-irrelevant cue, although participants in the motion-only group were slightly affected by shape differences.

*x*-axis; otherwise, the data are identical in the two panels. In contrast to the previous results, there were significant and large effects of shape and motion. For the main effects of shape and of motion, the linear effects were significant (shape:

*F*(1,19) = 117.1,

*η*

_{p}

^{2}= 0.86; motion:

*F*(1,19) = 31.4,

*η*

_{p}

^{2}= 0.62). There was also a significant interaction between shape and motion (

*F*(16,304) = 8.32,

*η*

_{p}

^{2}= 0.31). In Figure 5c, it can be seen that when shapes were similar, the motion cues facilitated responses, but when shapes were very dissimilar at the largest shape difference (40%), the motion cues had little effect. By comparison, it is easy to see in Figure 5d that shape still has an effect on discrimination even at the largest motion difference tested (40%).

*P*

_{diff}= 50%. In the area below or to the left of this contour, stimuli are more likely to be judged “same”; in the area above or to its right, they are more likely to be judged “different.” The dashed contours mark

*P*

_{diff}= 25% and

*P*

_{diff}= 75%, and the dotted contours mark

*P*

_{diff}= 10% and

*P*

_{diff}= 90%.

*σ*

_{ m }and

*σ*

_{ s }are independent of the task, since they reflect a constant internal encoding noise. In Figures 6a–6c, we arbitrarily set

*σ*

_{ m }= 18% and

*σ*

_{ s }= 8%. We further assumed that the decision boundary

*θ*remained constant (at 20%). Differences in the model observer's performance between tasks therefore solely reflect changes in the weights it assigns to the different cues. In the shape-only and motion-only tasks, participants were instructed to completely discount the irrelevant cue. We therefore assumed that our idealized model observer achieves this perfectly, that is, the model observer sets

*w*

_{ s }= 0.0 and

*w*

_{ m }= 1.0 on the motion-only task (Figure 6a) and sets

*w*

_{ s }= 1.0 and

*w*

_{ m }= 0.0 on the shape-only task (Figure 6b). In each case, the iso-probability contours are straight lines parallel to the task-irrelevant axis. This is the signature of an observer who weights one cue to the total exclusion of the other.

*σ*

_{ m }>

*σ*

_{ s }), so the model gives more weight to the more reliable shape cue. The resulting iso-probability contours are shown in Figure 6c. They are ellipses centered on zero difference, with the semi-major axis of the ellipse parallel to the less reliable dimension (here motion). This is the signature of an observer who is using both cues but weighting one more than the other. The signature of an observer who is weighting both cues equally would be circular iso-probability contours (not shown).

*x*-axis can be considered the 75% motion threshold (holding shape constant) and the intersection with the

*y*-axis can be considered the 75% shape threshold (holding motion constant). For example, when both motion and shape cues are task-relevant, the model observer's 75% shape threshold is about half of its motion threshold (∼25% vs. ∼60%). That is, the model is better at discriminating shape than motion (not surprisingly, since motion is the less reliable cue by design).

*w*

_{ m }and

*w*

_{ s }are inversely proportional to

*σ*

_{ m }and

*σ*

_{ s }section of 1). We therefore fitted Equation A6 to each participant's data to estimate

*θ, σ*

_{ m }, and

*σ*

_{ s }and then averaged these parameters across participants in each group. As in Experiment 1 (see Psychometric functions section), the RMSE was similar for the three tasks (motion only:

*M*= 11.2%,

*SE*= 0.9%; shape only:

*M*= 10.3%,

*SE*= 0.6%; and shape + motion:

*M*= 10.2%,

*SE*= 0.6%). The parameter means were used to generate the iso-probability contours for each group shown in Figures 6d–6f.

*w*

_{ s }and

*w*

_{ m }for each participant from his or her individually fitted parameters. Table 1 provides the means and standard error of the means (

*SEM*) for these relative weights. In the shape + motion group, participants were able to use both shape and motion cues to discriminate pairs of objects (compare Figures 6c and 6f). However, they weighted the shape cue more than the motion cue to perform the task (paired-sample

*t*(19) = 3.07,

*p*= 0.006). This finding suggests that participants could detect differences in shape more reliably than differences in motion because the weight given to each cue reflects its reliability (Landy et al., 1995). This is consistent with our expectation of a stronger weighting/more reliable estimation of shape cues (e.g., Spetch et al., 2006; Vuong & Tarr, 2006).

Motion only | Shape only | Shape + motion | |
---|---|---|---|

w _{s} | 0.34 (0.08) | 0.99 (0.004) | 0.80 (0.04) |

w _{m} | 0.84 (0.07) | 0.13 (0.02) | 0.50 (0.06) |

*t*-value approached infinity as

*p*< 10

^{−18}; motion only:

*t*(19) = 3.77,

*p*= 0.001). However, participants in the motion-only group had relatively more difficulty basing their decisions solely on motion differences and relied also on shape differences to some extent whereas those in the shape-only group were relatively better at ignoring the task-irrelevant cue. The weight assigned to the task-irrelevant cue between these two groups was significant (2-sample

*t*(19) = 2.93,

*p*= 0.009). Consistent with this, Figure 7 shows the 75% iso-probability contours individually for all 20 participants in each task. The colored thin contours (blue or red) represent a single participant; the black thick contour represents the group-averaged contour. The red (horizontal) contours represent those participants who assigned a very large relative weight to shape cues (which we chose to be

*w*

_{ s }> 0.9). Not surprisingly, all participants in the shape-only group weighted shape cues relatively more than motion cues. However as evident in Figure 7, six of the 20 participants in the shape + motion group assigned a relatively large weight to shape cues and three of the 20 participants in the motion-only group also assigned a relatively large weight to shape cues. These individuals have likely set their own relative weights, despite the explicit task instructions (perhaps due to the difficulty of the task).

*x*- and

*y*-axes in Figures 6d–6f). We found that the shape thresholds were similar for the shape-only (27.2%) and shape + motion groups (24.6%). In contrast, the motion threshold for the motion-only group (32.4%) was almost 2.4 times smaller than the motion threshold for the shape + motion group (76.6%). Lastly, we found that the

*shape*threshold for the shape-only group was similar to the

*motion*threshold for the motion-only group (27.2% versus 32.4%, respectively). This finding suggests that participants were equally sensitive to linear changes along our shape or motion morph continuum, even though motion cues were generally less reliable and weighted less than shape cues when both shape and motion were task-relevant. Furthermore, it suggests that the viewpoint changes we used to avoid image matching did not drastically bias shape or motion processing.

*per se*. If this was the case, we would also expect to see an increase in the shape discrimination threshold for the shape + motion group relative to participants in the shape-only group, and there was none. In fact, our model can explain this increase in the motion threshold for the shape + motion group; exactly the same effect is seen in the simulated example plotted in Figures 6b and 6c. In these simulations, the noise affecting each cue was held constant in the motion-only and shape + motion tasks, and only

*w*varied. When only the motion cue was task-relevant, the model observer assigned all the weight to that cue (i.e.,

*w*

_{ m }= 1.0). However when both shape and motion cues were task-relevant, the model observer weighted each cue proportionally to its noise (which is larger for motion than shape information for these simulations); therefore; much less weight was given to the less reliable motion cue. In this case, small motion differences are, in effect,

*further*reduced because of the small weight assigned to the motion cue (i.e.,

*w*

_{ m }< 1.0). Consequently, the model observer's sensitivity to motion differences decreased in the shape + motion task relative to the motion-only task. Thus, the model successfully captures this aspect of participants' discrimination performance.

*per se*may not influence motion discrimination and, second, helps further validate our parameter-based morphing technique for dynamic objects. Thus, overall, using a parametric manipulation of both shape and motion differences allowed us to model and visualize the relative contribution of shape and motion cues under different task demands.

*per se*but how non-rigid motion (articulations and deformations) can be used generally in the service of object recognition.

*x*∼

*N*(

*μ*,

*σ*) means that

*x*is a random deviate drawn from a normal (Gaussian) distribution with mean

*μ*and standard deviation

*σ*, the motion (difference) signal,

*m,*and shape (difference) signal,

*s*, estimated by the observer on any given trial are

*s*= 0 (i.e., the shape signal is zero as both objects on a given trial always had the same shape); in Experiment 2, both

*m*and

*s*were generally non-zero.

*C*<

*θ*, where

*w*is in the range [0 1]. The parameter

*w*is the relative weight assigned to the motion and shape signals. It is included because observers may explicitly choose to weight one cue relatively more than the other (e.g., because of task instructions) independently of how well they can estimate the motion or shape signals. This model therefore formally captures the different discrimination tasks that were implemented in Experiments 1 and 2 (motion only, shape only, and shape + motion tasks). For notational convenience, we define the relative weight assigned to the shape cue as

*w*

_{ s }=

*w*and the relative weight assigned to the motion cue as

*w*

_{ m }=

*C*

^{2}=

*w*

_{ s }

^{2}

*S*

^{2}+

*w*

_{ m }

^{2}

*M*

^{2}and

*w*

_{ s }

^{2}+

*w*

_{ m }

^{2}= 1.

*M, S*). This boundary is dependent on the decision threshold (

*θ*) and relative weighting of the two cues (

*w*). The thin blue and green colored contours illustrate the distribution of (

*M*,

*S*) for two different sample stimulus differences (

*m*,

*s*). These signal distributions are dependent on the noise in the motion and shape estimation (

*σ*

_{ m }and

*σ*

_{ s }). The decision boundary and stimulus distributions extend into the negative range because we model a difference signal in the motion or shape dimension. We therefore based our fits on the sum of two cumulative Gaussian distributions. The use of two cumulative Gaussians allows us to ignore the sign of the motion/shape difference, e.g., if stimulus 1 was a 35% motion morph, then morphs of 25% and 45% for stimulus 2 would both be counted as a 10% motion difference.

*M, S*) under the null hypothesis that there is no difference in the stimulus (

*m, s*). The blue contours show the distribution of (

*M, S*) for a given non-zero pair of stimulus differences (

*m, s*), marked by the blue dot. The aspect ratio of these contours depends on the relative standard deviations along the two axes, specifically,

*σ*

_{ m }/

*σ*

_{ s }. To maximize performance on a same–different task, we assume that the decision contour should also have an aspect ratio of

*σ*

_{ m }/

*σ*

_{ s }. We consider this case below. By definition, our model observer will judge the stimuli to be the same when (

*M, S*) falls within the red ellipse.

*m*,

*s*) is the volume of the stimulus distribution that falls within the red ellipse:

*r*and

*α*, defined by

*θ*,

*σ*

_{ m },

*σ*

_{ s }, and

*w*to 25 data points (5 levels of shape differences × 5 levels of motion differences). We were reluctant to do this as the data did not allow all parameters to be well constrained; there is a trade-off between changes in relative weighting and changes in noise. We therefore limited our fits to the special case in which

*w*is inversely proportional to stimulus noise. Because

*w*provides a relative weighting of motion and shape cues, we can set the (relative) weight assigned to each cue to be proportional to the inverse of its variance (Landy et al., 1995) so that the aspect ratio of the red decision boundary in Figure A1 matches the stimulus distributions shown in blue or green. That is, we set

*w*

_{ s }

^{2}+

*w*

_{ m }

^{2}= 1. In this case, we can do the integration over

*α*analytically. Equation A4 now becomes

*A*

^{2}=

*B*

^{2}=

*I*

_{0}is the modified Bessel function of the first kind. We used this function to fit the shape + motion data with three free parameters. We evaluated the integral numerically using the Matlab function QUAD, with the Matlab function BESSELI to evaluate the Bessel function.

*m*,

*s*) along which the observer has a fixed probability of making a “same” judgment for a given set of parameters

*θ*,

*σ*

_{ m }, and

*σ*

_{ s }—are constant values of

*B*. That is, the contours are ellipses in (

*m*,

*s*) space whose semi-radii are

*Bσ*

_{ m }and

*Bσ*

_{ s }and where the value of

*B*depends on the value of

*P*

_{same}along the contour. For example, to find the contour where the observer is equally likely to respond “same” or “different,” we solve Equation A6 for

*B*with

*P*

_{same}= 0.5. We used the Matlab function FZERO to solve this numerically.

*P*

_{same}analytically when all the weight is assigned to one cue. For example, if

*w*

_{ s }= 0.0 and

*w*

_{ m }= 1.0, the probability of a “same” judgment becomes

*σ*

_{ s }is immaterial:

*m*,

*s*) is

*P*

_{same}, then the probability of observing

*k*“same” responses out of

*n*trials is

*m*

_{ j },

*s*

_{ j }),

*n*

_{ j }is the total number of trials performed with those stimulus values, and

*k*

_{ j }is the number of trials on which the observer judged “same.”

*P*

_{same}

^{ j }is the probability of getting a “same” response given the stimulus values (

*m*

_{ j },

*s*

_{ j }) used on that trial and the particular fit parameters being tested. The fit parameters were adjusted using Matlab's FMINSEARCH function until the quantity

*X*was maximized.

*w*

_{ m }and

*w*

_{ s }are inversely proportional to

*σ*

_{ m }and

*σ*

_{s}*σ*

_{ m }and

*σ*

_{ s }(albeit from different groups of participants in our study). However, it seems unlikely that an observer would generally encode shape and motion cues with differing reliability across the three tasks because the stimuli are identical. We therefore assumed that the reliability of the motion and shape estimates was the same across tasks, but the weights given to each cue changed as a result of the task instruction. This is what we modeled in the simulations in Figures 6a–6c. The values of

*θ, σ*

_{ m }, and

*σ*

_{ s }are kept constant in these three panels, and the different predicted performance is achieved by altering the relative weights

*w*

_{ m }and

*w*

_{ s }. However, despite its conceptual limitations, Equation A6 is adequate to describe our data. To check this, we fitted the shape-only data with a 2-parameter model that assumes that the observer uses only shape differences for the task (Equation A9). In this case,

*w*

_{ s }= 1.0 and

*w*

_{ m }= 0.0. The value of

*σ*

_{ m }is then immaterial, leaving only two parameters to fit:

*θ*and

*σ*

_{ s }. We also fitted the two parameters,

*θ*and

*σ*

_{ m }, to the motion-only data in which

*w*

_{ m }= 1.0 and

*w*

_{ s }= 0.0. In each case, although the fits were slightly lower quality, the values of

*σ*

_{ m }and

*σ*

_{ s }were similar to those obtained with the 3-parameter model. This confirms that, in the shape-only task, the value of

*w*

_{ m }, which is under-constrained by the data, does not greatly affect

*σ*

_{ s }, which is much more constrained by the data. Similarly in the motion-only task, the precise value of

*w*

_{ s }does not greatly affect

*σ*

_{ m }. Therefore, in the main text, we report the results of fitting the same 3-parameter model to all three data sets, and then we estimated

*w*

_{ m }and

*w*

_{ s }based on

*σ*

_{ m }and

*σ*

_{ s }.