The human visual system is extremely sensitive to animate motion patterns. We quickly and efficiently detect another living being in a visual scene, and we can recognize many aspects of biological, psychological, and social significance. Human motion, for instance, contains a wealth of information about the actions, intentions, emotions, and personality traits of a person. What our visual system seems to solve so effortlessly is still a riddle in vision research and an unsolved problem in computer vision. Little is known about exactly how biologically and psychologically relevant information is encoded in visual motion patterns. This study aims to provide a general framework that can be used to address this question. The approach is based on transforming biological motion data into a representation that subsequently allows for analysis using linear statistics and pattern recognition. To demonstrate the potential of this framework, we construct a sex classifier and compare its performance with the performance of human observers that classify the same stimuli.
Some 30 years ago,
Gunnar Johansson (1973,
1976) introduced to experimental psychology a visual stimulus display designed to separate biological motion information from other sources of information that are normally intermingled with motion information. Johansson attached small point lights to the main joints of a person’s body and filmed the scene so that only the lights were visible in front of an otherwise homogeneously dark background. Using these displays, he demonstrated the compelling power of perceptual organization from biological motion of just a few light points.
A large number of studies have since used Johansson’s point-light displays. It has been demonstrated that biological motion perception goes far beyond the ability to recognize a set of moving dots as a human walker. Point-light displays contain enough information to recognize other actions as well (
Dittrich, 1993), to determine the gender of a person (
Barclay, Cutting, & Kozlowski, 1978;
Hill & Johnston, 2001;
Kozlowski & Cutting, 1977;
Mather & Murdoch, 1994,
Runeson, 1994), to recognize emotions (
Dittrich, Troscianko, Lea, & Morgan, 1996;
Pollick, Paterson, Bruderlin, & Sanford, 2001), to identify individual persons (
Cutting & Kozlowski, 1977;
Hill & Pollick, 2000), and even one’s own walking pattern (
Beardsworth & Buckner, 1981). However, whereas many studies exist that demonstrate the capability of the human visual system to detect, recognize, and interpret biological motion, there have been virtually no attempts to solve the question of how information about the moving person is encoded in the motion patterns. Only for gender recognition are there a few investigations addressing the nature of the informational content mediating this ability. In this study, we will also use gender classification of walking patterns as an example. However, the proposed framework can be generalized to solve other pattern classification problems based on biological motion.
One way to approach the question of where diagnostic information is hidden in a sensory stimulus is through psychophysical experiments. In such studies, the stimulus is manipulated along different dimensions in order to measure the effect of such manipulations on recognition performance. The first study on gender recognition from biological motion was conducted by
Kozlowski and Cutting (1977). They demonstrated that observers are able to classify point-light walkers shown in saggital view with a performance of 63% correct recognition. Additionally, they introduced a number of manipulations: increased or reduced arm swing amplitudes, unnaturally fast or slow walking speeds, and occlusion of either the lower or the upper part of the body. All manipulations considerably reduced recognition performance. With unnatural arm swings, performance dropped almost to chance level. Showing only the lower body impaired recognition to a larger extent than showing only the upper body. None of the manipulations caused a shift in perception into a defined direction, making the percept either more male or more female. Only for the speed manipulation did there seem to be a trend to perceive fast walkers more female, which, however, did not reach a statistically significant level.
Barclay et al. (1978) conducted a similar study investigating the influence of four different parameters. The initial experiment focused on the influence of exposure duration. The results show that two complete gait cycles are required to determine gender from biological motion. Shorter exposure times result in reduced performance. In a second experiment, speed was altered, but rather than recording different walking speeds from the model walkers as in
Kozlowski and Cutting’s (1977) study, they used just one recording showing a walker at his most comfortable walking speed and presented this stimulus with different play-back speeds to the observers. This manipulation had a strong effect and gender recognition was almost at chance level. The third manipulation consisted of blurring the discrete dots of the point-light walker to such an extent that the walker appeared as a single blob that changed shape during walking. This caused gender recognition performance to decrease also to chance level. Finally, the authors tested gender recognition with walkers that were presented upside-down. Interestingly, in this case, recognition performance dropped significantly below chance. If a female walker was turned upside down, the display tended to be perceived as a man and an inverted man tended to be perceived as a woman. Whereas all other manipulations only resulted in a general decrease in recognition performance, inversion of a point-light walker clearly induced defined shifts in perceived gender.
Barclay et al. (1978) proposed that their finding was due to the fact that the ratio of shoulder width and pelvis width differ between men and women. Men tend to have wider shoulders than hips, whereas this ratio is reversed in women. If, upon inversion, the walker’s shoulders are seen as if they were hips and the hips are seen as if they were shoulders, then observers’ responses would reverse with respect to the true gender of the walker. Given this scenario, the question remains how shoulder and hip width could be measured. Because the walker was presented in a side view, neither shoulder nor hip width could be determined directly from the stimulus. However, due to a torsional twist of the upper body, both shoulder and hip perform elliptical motions in the saggital plain. The amplitude of those ellipses depends on the widths of shoulder and pelvis, and, therefore, may have provided a diagnostic cue.
If the extent of movement at the shoulder and the hip is an important cue for gender recognition, artificial walkers that differ only in those attributes should be classified accordingly.
Cutting (1978a,
1978b,
1978c) developed a generative model of human gait and showed that this is indeed the case. The isolated cue apparently provided diagnostic information about a walker’s gender.
However, biological motion contains more information that can serve for gender classification. In principle, biological motion can provide two sources of information. One is motion-mediated structural information, and the second is truly dynamic information. In contrast to a static frame of a point-light walker, motion reveals the articulation of the body. Setting a point-light walker into motion immediately uncovers information about which segments are rigid, where the joints are located, and, therefore, about the lengths of the connecting segments. The resulting information is structural, static information about the geometry of the body. Motion is only needed as a medium to obtain this information and could be replaced by other cues. A static view of a point-light walker in which the connections are explicitly drawn (stick figure) combined with information to disambiguate the 2D projection (e.g., using stereo displays) would, in principle, provide the same information.
In addition to motion-mediated structural information, biological motion also contains truly dynamic information. The amplitude and velocity of the arm swing or the torsion of the trunk are simple examples for information that is clearly different from structural information. It should be noted, however, that although representing two different sources of information, structural and dynamic information might not be independent. The amplitude of the elliptical motion of shoulders and hips as a function of the respective widths, as discussed above, provides an illustrative example of this fact.
The role of motion-mediated structural information and dynamic cues for gender recognition from biological motion was explicitly addressed in a series of experiments conducted by
Mather and Murdoch (1994). The static cue they concentrated on was the ratio of the width of the hip and the width of the shoulder. The dynamic cue that was manipulated differed from the one used by
Cutting (1978b). Whereas Cutting emphasized differences in motion of hips and shoulders in the saggital plane, Mather and Murdoch focused on differences in lateral body sway. Men show a larger extent of lateral sway of the upper body than women do (
Murray, Kory, & Sepic, 1970).
Mather and Murdoch (1994) generated stimuli that showed artificial point-light walkers with well-defined structural measures (shoulder and hip width) and well-defined dynamic cues (lateral sway of shoulder and hip). The walkers were shown from different viewing angles and subjects had to indicate the perceived gender. Setting structural and dynamic cues into conflict, the authors could show that the dynamic cue clearly dominated the structural cue.
In summary, the different studies on gender recognition from biological motion show that information about a walker’s gender is not a matter of a single feature.
Barcley et al. (1978), as well as
Cutting (1978a), have identified the elliptical motion of shoulder and hip in the saggital plain to be an important cue to gender.
Mather and Murdoch (1994) focused on the extent of lateral body sway.
Kozlowski and Cutting (1977) showed that seeing only parts of the body could provide enough information about the gender of its owner to yield classification performances above chance. Gender recognition appears to be a complex process with a holistic character that takes into consideration hints and cues that are distributed over the whole display and that are carried both by motion-mediated structural information and by pure dynamics. Other studies employing different tasks confirm the holistic nature of biological motion perception (
Bertenthal & Pinto, 1994;
Lappe & Beintema, 2002).
Most of the studies summarized above aimed to investigate particular properties of the stimulus that were suspected to be promising candidates to carry information for gender discrimination. The role of such stimulus properties for gender recognition was, in turn, scrutinized by means of psychophysical experiments. In this study, we chose a different approach to the question of how information is encoded in biological motion patterns. Here we want to treat the problem as a pattern-recognition problem. With no a priori assumptions about possible candidate cues, we attempted to construct a linear classifier that can discriminate male from female walking patterns. We can then, in turn, scrutinize the classifier to determine which cues have been used. The cues may be simple features or complex holistic cues that are described in terms of correlation patterns between different parts and motions of the body. Any attribute or combination of attributes that changes when moving along an axis perpendicular to the separation plane defining the classifier is diagnostic for gender classification. Attributes that change while moving within the separation plane do not contribute any information to the gender classification problem.
A prerequisite to generate a linear classifier for gender discrimination or other stimulus features from human motion is a data structure within which linear operations are effectively applicable. The problem is similar to attempts to construct linear models of classes of images. In the domain of object recognition and human face recognition, such representations have been termed “linear object classes” (
Vetter, 1998;
Vetter & Poggio, 1997) or “morphable models” (
Giese & Poggio, 2000;
Jones & Poggio, 1999;
Shelton, 2000). The latter term expresses the fact that the linear transition from one item to another represents a well-defined smooth metamorphosis between the items. Another term that has been used for the same class of models in the context of human face recognition is “correspondence based representations” (
Troje & Vetter, 1998;
Vetter & Troje, 1997). This term focuses on morphable models’ reliance on establishing correspondence between features across the data set, resulting in a separation of the overall information into range-specific information on the one hand and domain specific information on the other hand (
Ramsay & Silverman, 1997).
The use of linear techniques to describe human motion data has been employed in a number of studies, both in computer vision and in animation. Some of these techniques focus on recognition of actions and blending between actions. Others concentrate on the recognition and generation of emotion and other stylistic features within a set of instances of an action. In the context of this study, we want to define an action as a set of motion instances that are structurally similar. Extrapolating
Alexander’s (1989) definition of a gait, we define an
action as a pattern of motion characteristics described by quantities of which one or more change discontinuously at transitions to other actions. Instances of the same action can be smoothly transformed into each other, with all transitions being valuable representations of the particular action. The definition implies structural similarity between instances of the same action, and, therefore, a means to define correspondence in space and time between two or more instances in a canonical and unambiguous way. Systematic differences between motion instances of an action are referred to as
styles. Styles can correspond to emotions, personality or biological features, such as age or gender. According to the above definitions, the stylistic state space of an action is expected to be continuous and therefore defines smooth transitions between all instances of an action. Warping between actions, in contrast, requires the definition of additional constraints in order to achieve unambiguous correspondence.
Most of the existing systems for recognition, classification, synthesis, and editing of biological motion are based on data representations with a continuous smooth behaviour. A number of different techniques have been used to achieve this behaviour.
Brand and Hertzmann’s (2000) “style machines” are based on a hidden Markov model, that is, a probabilistic finite-state machine consisting of a set of discrete states, state-to-state transition probabilities, and state-to-signal emission probabilities (see also
Wilson & Bobick, 1995).
Rose, Bodenheimer, and Cohen (1998) presented a model using radial basis functions and low-order polynomials that both provide blending between actions and interpolation within stylistic state spaces. A number of models are based on frequency domain manipulations. Fourier techniques (
Davis, Bobick, & Richards, 2000;
Davis, 2001;
Unuma, Anjyo, & Takeuchi, 1995;
Unuma & Takeuchi, 1993) are suitable for periodic motions, such as locomotion patterns. Multiresolution filtering (
Bruderlin & Williams, 1995) applies to a wider spectrum of movements but is restricted to modify and edit existing motion, rather than creating new motions through interpolation between existing motions. If the latter is required, multiresolution filtering has to be combined with time-warping techniques (
Witkin & Popovic, 1995). Time warps are required to align corresponding signal features in time. Depending on the complexity of the action, time warps are parameterized in terms of simple uniform scaling and translation (e.g.,
Wiley & Hahn, 1997;
Yacoob & Black, 1997) by using nonlinear models, such as B-splines (
Ramsay & Li, 1989;
Ramsay & Silverman, 1997), or by fitting nonparametric models by means of dynamic programming (
Bruderlin & Williams, 1995;
Giese & Poggio, 1999;
2000).
The dimensionality of the resulting linear spaces are not necessarily reflecting the number of degrees of freedom within the set of represented data. Some of the above cited techniques therefore use principal components analysis (PCA) to reduce the dimensionality to a degree that stands in a reasonable relation to the size of the available data set. PCA can be used on different levels. For instance,
Yacoob and Black (1997) apply PCA to a set of “atomic activities,” which are registered in time and then represented by concatenating all measurements (joint angles) of all frames of the sequence.
Ormoneit, Sidenbladh, Black, and Hastie (2000) use a similar approach (see also
Bobick, 1997;
Ju, Black, & Yacoob, 1996;
Li, Dettmer, & Shah, 1997).
Rosales and Scarloff (2000) apply PCA to a set of postures, each posture being represented only by measurements of a single frame.
Linear motion models have been applied to a number of different problems, such as motion editing (
Brand & Hertzmann, 2000;
Bruderlin & Williams, 1995;
Gleicher, 1998;
Guo & Roberge, 1996;
Wiley & Hahn, 1997), retargeting motion from one character to another (
Gleicher, 1998), tracking a human figure from video data (
Ju et al., 1996;
Ormoneit et al., 2000;
Rosales & Scarloff, 2000), recognizing activities (
Yacoob & Black, 1997), speech (
Li et al., 1997) or gait patterns (
Giese & Poggio, 2000). Giese and Poggio’s model, which is in many respects similar to ours, is able to discriminate between different gaits (running and walking), but also to discriminate limping from walking. Whereas running and walking have to be considered two different actions according to the above definition, limping and walking are two styles of the same action. Other than this work,
Davis’ (2001) work on visual categorization of children and adult walking styles is the only one that we are aware of that applies linear motion modelling to the recognition of stylistic aspects within an action.
Although linear motion models have become common within the animation and computer vision community, there exist only few studies that use such models for psychological studies on motion perception. An exception is the work by Hill, Pollick, and colleagues (
Hill & Pollick, 2000;
Pollick, Fidopiastis, & Braden, 2001). Both studies show that extrapolations in linear motion spaces are perceived as caricatured instances that are recognized even better than the original sequences. The results imply that the topology of perceptual spaces used for biological motion recognition is similar to the one implicit in artificial linear motion spaces that are based on a distinction between range-specific information on the one hand and domain-specific information on the other hand.
Our approach to linearize human walking data employs many of the techniques summarized above. Starting with motion capture data from a number of human subjects, we first reduce the dimensionality of each subject’s set of postures using PCA in a way similar to that described by
Rosales and Scarloff (2000). This results in a low-dimensional space spanned by the first few eigenpostures. As postures change during walking, the corresponding coefficients change sinusoidally. The temporal behaviour of the sequence is well described by simple sine functions, and the decomposition becomes very similar to previous work on Fourier decomposition of walking data (
Unuma et al., 1995). The eigenposture approach, however, is more general because it is not based in the frequency domain and thus can be used for nonperiodic motions as well. The main difference is that time warping, which reduces to simple uniform scaling in the case of our walking data, has to be parameterized using a more complex model.
Based on the outlined linearization of biological motion data, we are primarily interested to recognize and characterize stylistic features within an action. The action we are using is human walking. The stylistic variations we are investigating are the differences relating to the walker’s gender. The aim of this study is twofold. First, we want to quantitatively characterize the differences in walking style between men and women. We test the success of our approach in terms of a linear classifier operating on the proposed linear representation of a set of human walking data. Second, we compare the performance of the linear classifier to the performance of human observers in a gender classification task. By depriving both the linear classifier as well as our human observers from parts of the information contained in the walking patterns, we want to find out which aspects of the stimulus are diagnostic and relevant for solving the gender classification task.