We present a three-mode expressive-feature model for recognizing gender (female, male) from point-light displays of walking people. Prototype female and male walkers are initially decomposed into a subspace of their three-mode components (posture, time, and gender). We then apply a weight factor to each point-light trajectory in the basis representation to enable adaptive, context-based gender estimations. The weight values are automatically learned from labeled training data. We present experiments using physical (actual) and perceived (from perceptual experiments) gender labels to train and test the system. Results with 40 walkers demonstrate greater than 90% recognition for both physically and perceptually-labeled training examples. The approach has a greater flexibility over standard squared-error gender estimation to successfully adapt to different matching contexts.

*is*female/male). An alternate context for the system could be to recognize the “perceived” gender of the walker (

*appears*female/male). For example, a female walker could be consistently perceived by several observers to have a male-like gait pattern (appearance vs. truth). Both of these contexts may have particular applied relevance. For automatic visual surveillance, recognizing the physical gender is of most concern, whereas a model of the perceived gender is most important for computer animation tools to give the best appearance of gender. Models of perceived gender are also important for studying how humans discriminate gender. As different expressive weights may be required for different recognition contexts (physical or perceived gender), the approach is designed to automatically learn the weight values for a specific context of labeled training data.

**Z**(see Figure 1a), with the rows in each frontal plane/matrix

**Z**

*composed of the point-light trajectories (segmented and normalized to a fixed length) for a particular style index*

_{k}*k*. The matrix data for each variation

*k*could alternatively be rasterized into a column vector and placed into an ordinary two-mode matrix (each column a style example), but this simply ignores the underlying three-mode nature of the data (posture, time, and style).

**Z**(see Figure 1a). For gender recognition, we consider the style dimension as a binary gender mode of FEMALE or MALE. We begin by reducing the cube

**Z**to a prototype data cube using two walking sequences, matrices and , that represent the average female and average male walking styles. Each prototype is constructed by averaging multiple walking examples of each gender class. To ensure proper alignment when averaging, one walk cycle of each example is extracted (at the same walking phase), height-normalized, and time-normalized to a specific duration

*N*. Further details of this preprocessing stage are presented later.

*M*point-light trajectories per person and

*N*frames in the sequence (time-normalized walk cycle), each prototype is represented as a trajectory matrix of size

*M*×

*N*. We then subtract the prototype mean from the two gender prototype matrices and and place them into the first and second (last) frontal plane of the cube (see Figure 1b). The dimensionality of is therefore

*M*×

*N*× 2.

**P**,

**T**, and

**G**that span the column (posture), row (time), and slice (gender) dimensions of the cube (see Figure 1c). The core

**C**is a matrix that represents the complex relationships of the components in

**P**,

**T**, and

**G**for reconstructing . The desired column and row spaces can be found using SVD. We outline the technique in .

**P**is able to represent any body posture (of point-lights) at any particular time for either gender prototype (i.e., column basis for ). The time basis

**T**represents any temporal trajectory (of any point-light) for either gender prototype (i.e., row basis for ). Lastly, the gender basis

**G**represents the gender-related changes between the two prototypes for any posture at any particular time (i.e., slice-line basis for ).

**C**can then be solved by simply re-arranging Equation 1 as where

**C**need not be diagonal, as is required in two-mode PCA/SVD. Related methods for solving this three-mode factorization can be found in Kroonenberg and Leeuw (1980) and Vasilescu and Terzopoulos (2002).

*i*,

*j*correspond to the elements in the respective posture and time dimensions (1 ≤

*i*≤

*M*and 1 ≤

*j*≤

*N*).

**Y**(already mean-subtracted with the model) can be estimated by finding the value of ĝ that minimizes the sum-of-squared-error (SSE) reconstruction Setting the derivative of to zero and re-arranging the equation, the resulting gender parameter ĝ is given by where the gender parameter is computed by the normalized projection of

**Y**onto the basis. The final gender can be assigned by examining the sign of ĝ, choosing FEMALE if it is negative and MALE if positive (i.e., selecting the nearest gender prototype).

*MN*× 2 matrix (each column is a rasterized gender prototype), performing a standard two-mode PCA, and estimating the gender parameter for a new walker by computing and thresholding its projection coefficient. The three-mode formulation (Equation 10), however, enables us to easily embed tunable weight factors (on trajectories) to influence the estimation of the gender parameter.

*M*point-light trajectories with The new expressive gender parameter estimation is given by where . As the denominator in Equation 13 is a constant for a given set of factors , we fold this term into the final “expressive weights” in Equation 14. If we set each expressive weight to in Equation 14, the resulting gender parameter estimation reverts to the previous SSE method (Equation 10). However, with non-uniform values for , the approach is capable of producing other non-SSE gender estimations according to a specific recognition context.

*K*different training examples To solve for the expressive weights in Equation 16, we employ a fast iterative gradient descent algorithm (Burden & Faires, 1993) of the form with the gradients computed over the

*K*training examples The learning rate

*η*is re-computed at each iteration (via interpolation of the error function) (Burden & Faires, 1993) to yield the best incremental update.

*Z*-axis). The center-of-rotation (root location) for each person was selected as the average center position between the hips throughout the walking sequence. The angle of rotation was computed such that the average root orientation was facing directly forward.

*N*= 50 frames to avoid under-sampling (longest cycle sequence of the walkers was 44 frames at 30 FPS).

*x*(

*t*) by distributing the error throughout the trajectory as where

*t*denotes the frame number (1 ≤

*t*≤

*N*). The approach distributes the discontinuity using small shifts throughout the trajectory to align the starting and ending positions without the loss of high-frequency information. This simple, yet effective, method produces seamless walking cycles without noticeable visual distortion (other Fourier-based techniques could also be employed).

*N*= 50 frames) appear abnormally slow if rendered at 30 FPS, we used a slightly faster rendering speed of 36 FPS determined from the longest natural cycle time of the walkers (1.4 s). Each sequence was looped continuously while presented to the observer. The height of each point-light walker was scaled to 70% of the screen resolution height (1280 × 1024 resolution, with 20-in. diagonal viewable monitor). The root location of each walker was randomly positioned within a small circle at the center of screen (with radius 10% of the screen resolution height). These display parameters were used to prevent any explicit position or size comparison between the walkers. The point-light display was generated using C++ and OpenGL with anti-aliasing. Each observer was seated approximately 60 cm from the monitor, which corresponded to a visual angle of approximately 20 deg for the height of the point-light figure.

*t*(39) = 5.52,

*p*< .001). Previous experiments employing a frontal view of walkers (as in this experiment) reported rates of 64% (Hirashima, 1999) and 76% (Troje, 2002).

*s*and hip

*h*in the first image of each walker sequence. We note that there could, however, be more discriminative structural information in later frames (though it should not change drastically at the frontal view). The average shoulder-hip ratio

*s/h*was 1.71 ±.26 for females and 1.92 ±.14 for males, and were significantly different (two-tailed

*t*test:

*t*(38) = 3.14,

*p*< .01). The differences in our shoulder-hip ratios with previous measurements (Cutting et al., 1978) are likely due to the inward placement of the front and back hip markers on the body (not at the maximal hip width). The two hip point-lights (left, right) were created by averaging the back and front hip markers on each side. Therefore the calculated hip width would be shorter than the actual hip width (thus increasing the shoulder-hip ratio).

*r*= .34) suggests that this factor alone does not account for the perceptual gender choices. We also computed the center-of-moment for the walkers, using

*C*

*=*

_{m}*m*/(

*m*+

*s*). Even though significantly different for females and males (two-tailed

*h*test:

*h*(38) = 3.04,

*t*< .01), its correlation with the gender consistency values was also low (

*r*= .34).

*t*(39) = 2.62,

*p*< .05). Many of the walkers were ambiguous to label given one static frame, yet walkers #2, #32, and #37 were correctly recognized at 87%, 87%, and 93%, respectively.

*t*test:

*t*(78) = 2.62,

*p*< .05), with the static recognition rate almost 10% lower than achieved with the dynamic displays. We also calculated the absolute value of the difference between the static and dynamic consistency values (see Figure 5b). A large difference magnitude close to 2 for a walker indicates a strong gender inconsistency between the static and dynamic cases, and a value close to 0 indicates that the two stimuli provided a similar (strong or weak) gender consistency. Interestingly, several walkers (e.g., #7, #10, #15, and #29) had fairly strong gender conflictions between the static and dynamic cases. Overall, the dynamic stimuli appear to give more gender-related information than the single frame case.

*is*female/male) and the perceived label (

*appears*female/male).

**P**and

**T**basis sets (posture and time) from 50% – 95% (in 5% increments). For example, an “85% modal fit” means that we accumulate the top basis vectors in the posture basis

**P**until 85% of the variance in the data is captured. We apply the same criterion for the basis

**T**. The gender basis

**G**remains fixed to .

**P**and

**T**, we constructed 40 different models, each using 39 training examples by leaving one different example out of the set. For each model (39 examples at a particular modal fit), we created the gender prototypes, computed the three-mode PCA for the prototypes, and ran the learning algorithm to compute the expressive weights (examples labeled with their true gender). We empirically selected a limit of 1,500 iterations for the gradient descent learning algorithm as it provided satisfactory convergence of the expressive weights for our data set (in both recognition contexts). The training error for the model was computed by examining the sign of the computed gender parameter value for each of its 39 labeled training examples (−: FEMALE, +: MALE). The testing (validation) error for the model was similarly computed using only the single left-out example.

**P**and

**T**were computed at the selected modal fit (75%), and were of dimension 26 × 3 and 50 × 3, respectively (the core

**C**was therefore of size 3 × 3). The resulting three-mode PCA captured 98% of the overall data variance in the two gender prototypes. The expressive weights for this model were generated by averaging the 40 sets of expressive weights computed at the selected cross-validation modal fit (75%). We show the average expressive weights ±1

*SD*in Figure 7. Some weights were zero, signifying that they were not relevant to the gender assignments. The larger magnitude weights appear to have a significant deviation across the 40 leave-one-out sets, showing the impact of the singular left-out examples. However, as previously mentioned, it is difficult to assign any high-level interpretation to the larger magnitude weights. The mapping of the 26 weights to the point-lights is shown in Figure 8.

*generalized*set of expressive weights (averaged from the 40 cross-validation models at the selected modal fit). Both of these factors enabled the expressive model to achieve a smaller recognition error than with SSE.

**P**and

**T**(at 80% modal fit) were of dimension 26 × 4 and 50 × 4, respectively (the core

**C**was therefore of size 4 × 4). The resulting three-mode PCA captured 98% of the overall data variance in the two gender prototypes. The expressive weights were generated by averaging the 40 sets of expressive weights computed at the 80% cross-validation fit. We show the average weights ±1

*SD*in Figure 11. As before, some weights are zero, and we also see a larger variation in the higher magnitude weights. As the training data for the physical and perceptual labels are in fact different, we expect the resulting weights to also be different. There is an unexplained asymmetry in the arms, though this may be mostly due to the variation in the cross-validation weights. However, we do not yet have a high-level interpretation of the cause for the weight differences between the two contexts.

*r*= .89 (SSE correlation was

*r*=.69).

*K*training examples and their assigned perceptual genders , we slightly alter the previous matching error function (Equation 16) by using their consistency magnitudes

*ω*to bias the minimization procedure to those examples having more reliable matches across the observers

_{k}*k*The corresponding perceptual gradient is then This new gradient is used as before in the gradient descent procedure ( This new gradient is used as before in the gradient descent procedure (Equation 17) to determine the appropriate expressive weights for the perceptually-labeled walkers.

*is*female/male). Perceptual labeling assigns genders resulting from a perceptual classification task to attain the observed gender (

*appears*female/male). The labeled training data are used in a gradient descent-learning algorithm to solve for the expressive weight values needed to bias the model estimation of gender to the desired training values. Instead of matching a new walker to several examples for recognition, our expressive model is used to directly compute a gender value/label for the walker.

*m*×

*n*matrix

**A**into where

**U**and

**V**are orthonormal matrices and ∑ is diagonal with

*r*singular values

*σ*

_{1},…,

*σ*

_{r}. The columns of

**U**correspond to a column space of

**A**, where any column of

**A**can be formed by a linear combination of the columns in

**U**. Similarly, the rows of

**V**

^{T}correspond to a row space of

**A**, where any row in

**A**can be constructed by a linear combination of the rows in

**V**

^{T}.

**P**,

**T**, and

**S**that span the column (posture), row (time), and slice (gender) dimensions of the cube (see Figure 1c). The desired basis sets can be found with SVD using three different 2D matrix-flattening arrangements of where is the transpose of , and is the rasterized column vector of matrix (concatenation of point-light trajectories for each gender into a single column vector), and [

**X**|

**Y**] is a matrix with the columns of

**X**followed by the columns of

**Y**. Note that no two of the three basis sets can be produced within a single two-mode (matrix) factorization of .