**According to a long-standing hypothesis in motor control, complex body motion is organized in terms of movement primitives, reducing massively the dimensionality of the underlying control problems. For body movements, this low-dimensional organization has been convincingly demonstrated by the learning of low-dimensional representations from kinematic and EMG data. In contrast, the effective dimensionality of dynamic facial expressions is unknown, and dominant analysis approaches have been based on heuristically defined facial “action units,” which reflect contributions of individual face muscles. We determined the effective dimensionality of dynamic facial expressions by learning of a low-dimensional model from 11 facial expressions. We found an amazingly low dimensionality with only two movement primitives being sufficient to simulate these dynamic expressions with high accuracy. This low dimensionality is confirmed statistically, by Bayesian model comparison of models with different numbers of primitives, and by a psychophysical experiment that demonstrates that expressions, simulated with only two primitives, are indistinguishable from natural ones. In addition, we find statistically optimal integration of the emotion information specified by these primitives in visual perception. Taken together, our results indicate that facial expressions might be controlled by a very small number of independent control units, permitting very low-dimensional parametrization of the associated facial expression.**

*t*, a vector of morphing weights

**W***(

*t*) was identified, which specifies how much the individual AUs contribute to the approximation of the present expressions. The second step of the method synthesizes dynamic facial expressions by linearly combining 3D scans of the AU peak frames with the same vector

**W***(

*t*), to create photo-realistic animations. In the following, we explain these two steps briefly (see Curio et al. 2006; Curio et al., 2008; de la Rosa, Giese, Bülthoff, & Curio, 2013) for more details. We did not model any spatial rotation or translation of the head, and all the expressions were displayed with the same orientation of the head in space.

*t*, the error between the kinematic vector of the facial expression and the kinematic vector resulting by the superposition of the vector representing the AUs (that were previously recorded) was minimized over each weighting coefficient

*W*(

_{i}*t*). See Figure 1, left side. In the following, we indicate by

*M*(

_{E}*t*) the kinematic data associated with the facial expression and by

*M*indicate the vector of the static neutral reference face and

_{N}*M*(

_{AU,i}*t*) the kinematic vector associated with the

*ith*Action Unit. The weights were estimated by solving the nonnegative least-square problem

*N*indicates the total number of AU (in our case

*N*= 17). The computed optimal weight vectors are denoted by

**W***(

*t*). The nonnegativity constraint is justified by the fact that AUs, as equivalent of muscle activations, should always be nonnegative. The time course of the AU activation coefficients were obtained solving this optimization problem separately for all time steps

*t*instead of solving the optimization over all time steps simultaneously. Both optimization approaches provide, however, equivalent solutions given the linearity of the problem. Figure 2 shows examples of these time courses for two different emotional expressions, respectively disgust and happy.

**W***(

*t*) were used to generate photorealistic dynamic facial expressions. This was achieved by superimposing, at each time instant

*t*, the 3D scans of the shapes of the single AUs modulated by the morphing weights in

*W**(

_{i}*t*). See Figure 1, right. The shapes of the single action units were recorded from one single actor that executed the action units individually. As a consequence, all the facial expressions generated for this study were always associated with the same identity. Indicating by

*S*(

_{E}*t*) the 3D shape of the reconstructed face (parametrized by 3D polygons), by

*S*the shape associated with a neutral reference face and by

_{N}*S*the 3D scan of the shape corresponding to the peak of the

_{AU,i}*ith*AU, expressions were computed using the equation

*W**(

_{i}*t*) over time, we applied unsupervised learning techniques.

*T*= 100 time steps. The dimensionality reduction techniques were applied to the data matrix

**X**, whose rows consist of the identified AU temporal profiles

**W***(

*t*) associated with all facial expressions that were produced by a professional actor.

**W***(

*t*) was sampled at

*T*points in time, resulting in a size of

**X**of 561 (17 AU × 11 expressions × 3 repetitions) by 100 time samples. Facial expressions considered were facial scrunch, mouth opening, agreement, confusion, disagreement, disgust, fear, happiness, and surprise, thinking for problem solving, and thinking to remember. The first method that we applied for dimensionality reduction was Nonnegative Matrix Factorization (NMF), as developed by D. D. Lee and Seung (1999). The second method is called FADA (Fourier-based Anechoic Demixing Algorithm), which we developed (Chiovetto & Giese, 2013) inspired by the previous work by Omlor and Giese (2007a, 2007b). Differently from NMF, which is based on an instantaneous mixing model, the FADA algorithm is based on an anechoic mixture equation, Equation 3, as used in acoustics for the modeling of acoustic mixtures in reverberation-free rooms (Bofill, 2003; Emile & Common, 1998; Yilmaz & Rickard, 2004). This model assumes that

*N*observable acoustic signals

_{s}*x*(

_{i}*i*= 1, 2,...

*N*) are caused by the superposition of

_{s}*P*acoustic source functions (signals)

*s*(

_{j}*t*), where time-shifted versions of these source functions are linearly superposed with the mixing weights

*a*. The time shifts are specified by the time delays

_{ij}*τ*, and in the acoustical model are determined by the traveling times of the signals. The model has the following mathematical form:

_{ij}*τ*= 0 for all pairs (

_{ij}*i, j*), this Equation 3 coincides with the classical instantaneous mixing model underlying NMF, except for the positivity constraints. As the anechoic algorithm used by Omlor and Giese (2007a, 2007b), FADA is based on Equation 3 but it includes additional smoothness priors for the source functions. The introduction of such priors is justified by the observations that biological data usually have limited bandwidth and by the fact that such priors improve substantially the robustness of the estimation method. A detailed description of the FADA algorithm can be found in the Supplementary material.

*d*X/

*dt*, the rows of which are the derivatives of the AU profiles in the matrix X. Once the components were identified, the original data were approximated by combining the integrals of the identified components. The constant values to add to the components after their integration were identified using an optimization procedure minimizing the error between actual and reconstructed data. An additional constraint was also imposed on the constant parameters in order to assure the nonnegativity after summation of the corresponding integrated source functions.

**W***(

*t*) obtained either from the motion capture data through the optimization procedure (described by Equation 1), or reconstructed mixing the temporal sources

*s*identified by the FADA algorithm based on the anechoic mixture, Equation 3. For the psychophysical experiments, only 3 of the 11 initial expressions were rendered, namely fear, disgust, and surprise. To test the invariance of the shapes of the identified sources

_{j}*s*across expressions, we used a leave-one-out approach. More specifically we compared the sources identified from the data associated with each single emotion, with the ones identified from all the other expressions. To compute the similarity between two sets of sources, we followed an iterative procedure. For each pair of sources in the two groups we first computed, for each temporal delay between the sources, the similarity index

_{j}*S*, quantified as the dot product between two components, normalized with respect to their norms. The index

*S*represents the cosine of the angles between the vectors identified by the two components. When the index is equal to 1, the components are proportional to each other, while

*S*= 0 implies that they are orthogonal. We then removed from the data the pair of sources with the highest

*S*value and repeated this procedure until only one pair of sources was left. We found that, on average, the similarity between the groups of sources was

*S*= 0.96 ± 0.02, indicating a very high level of invariance of the shapes of the sources across expressions.

*P*of source functions that is required for the generation of photorealistic emotional expressions based on Equation 3, we designed a Turing test. Participants sat in front of a screen and were presented a series of visual stimuli. Each stimulus consisted of two rendered facial animations appearing side-by-side on the screen and showing one of three emotional expressions (disgust, fear, and pleasant surprise). One was rendered using the original kinematic data collected during the motion capture recordings. The other one was generated based on Equation 3 using either

*P*= 1,

*P*= 2 or

*P*= 3 source functions. A series of examples of the visual stimuli used for the Turing test can be found in the Supplementary material. At any presentation of the stimuli, the order of the positions (left/right) of the two animations was chosen randomly, and the two animations ran simultaneously for three times. After the presentation of the stimulus, participants were asked to choose which one of the two animations was most natural. The aim of this experiment was to identify the minimum number

*P*from which original and synthetic expressions became indistinguishable. Twelve subjects participated in the experiment. Each participant was presented a total of 108 stimuli (3 emotions × 3 levels of model complexity × 12 repetitions). Once all the data were collected, we computed, for each model complexity

*P*and for each emotion, the probability that participants could discriminate between original and synthesized expression as the ratio between the number of correct answers over the total number of stimulus presentations. A discrimination probability equal to 0.5 (chance level) indicates that participants could not discriminate correctly which of the two stimuli was the original emotional expression.

*P*was determined, we generated another set of visual stimuli by varying the contribution of each of the

*P*source function

*s*for the synthesis of

_{j}**W***(

*t*). To this end we introduced

*P*additional morphing parameters

*γ*∈ [0, 0.33, 0.66, 1] in Equation 3 so that

_{j}*P*= 2 is sufficient (see Results), we generated 48 visual stimuli in total (4

^{2}morphing levels by 3 emotions). Each participant was presented with one of these stimuli, one at the time, and was asked to indicate, pressing one out of three buttons, which emotion such stimulus corresponded to and to rate the intensity of the stimulus by choosing a integer

*R*between 0 and 6. Each stimulus was presented 15 times. The aim of this experiment was to test the extent to which each source function

*s*contributed to the perception of emotional content. The probability with which participant could discriminate correctly the emotion associated with the presented facial expression was computed, for each morphing level, as the ratio between the correct number of answers over the total number of times the stimulus was presented. To avoid biases in the data due to subjective differences among the participants in the judgment of the emotional content, for each participant the ratings were normalized to [0, 1], applying the following formula:

_{j}*R*and

_{min}*R*indicate respectively the minimal and maximal ratings given by the participant.

_{max}*γ*

_{1}and

*γ*

_{2}(values: 0, 0.33, 0.66, and 1) on the classification and rating performance of the participants. All statistical tests were implemented using SPSS (V.22; SPSS, Inc., Chicago, IL). For all tests the significance level was set to 5%.

**X**is the observable data, Θ

*is a vector of model parameters for a model indexed by*

_{M}*M*,

*M*is a tuple (model type, model order) in which the model type is either a smooth anechoic mixture determined with the FADA algorithm (Chiovetto & Giese, 2013) or a synchronous (i.e., undelayed) mixture computed with NMF,

**H**is the Hessian matrix of second derivatives with respect to

*M*, we can select that

*M*(i.e., the model type and the model order, which maximizes the model evidence, since we have no a-priori preference for any

*M*.

*C*

_{1}and

*C*

_{2}are the amplitude-scaled versions of the movement primitives, which can be achieved by multiplying the emotion-specific weights

**W**

*for primitive*

_{j,i}*i*with a constant (positive) factor, the

*β*are scalar coefficients and

_{i}*R*is the perceived emotion strength associated with a specific facial expression. More specifically, it can be shown that

*β*

_{1}=

*β*

_{2}= 1 and therefore that

*C*

_{1},

*C*

_{2}> 0 can be computed by summing up the strength ratings measured when one of the cues is zero, minus a bias term

*β*

_{0}. Given the existence of some evident nonlinear saturation effects for high morphing levels (

*γ*

_{1},

*γ*

_{2}; see Figure 4), we also tested a cue fusion model with an output nonlinearity of the form

*R*(

*C*

_{1},

*C*

_{2}) is the rating predicted by the linear Bayesian cue fusion model, Equation 8. Combining therefore Equation 9 with Equation 8, we obtain the following equation for the predicted rating with output nonlinearity:

*F*(1.31, 1.44) = 10.88,

*p*= 0.003,

*η*= 0.48. A one sample Wilcoxon signed-rank test (see the results in Table 1) indicated that the participants could never discriminate above chance level between the actual recorded expressions and the ones obtained using the data predicted by the model when the displayed emotions were disgust and fear. In the case of surprise, differently, they could discriminate above chance when either one or two sources were used in the model. The Turing test based on the sources identified using the FADA algorithm was run on 12 other participants (results are shown in Figure 4B). Even in this case, data were analyzed using a 3 × 3 repeated measure ANOVA with emotion and model complexity as factors, and it revealed a significant main effect of the model complexity on the percentage of correct classification,

_{p}*F*(2, 22) = 12.64,

*p*< 0.001,

*η*= 0.54. A one-sample Wilcoxon signed-rank test (see Table 1) indicated that the participants could discriminate between the recorded expressions and the ones obtained using the data predicted by the model only when one single source was used to render the facial expressions (Figure 4A). In all these cases, the observed probability was always statistically significantly higher than chance level.

_{p}*γ*

_{1}> 0 or

*γ*

_{2}> 0). Results from the expressions generated using the instantaneous mixture of the NMF sources are shown in Figure 5A. For each expression, a 4 × 4 repeated measure ANOVA with

*γ*

_{1}and

*γ*

_{2}(taken the values 0, 0.33, 0.66, and 1) as factors revealed a significant main effect on the percentage of correct classification of both factors. In addition, these main effects were always qualified by an interaction between the factors (see Table 2). Similarly to the instantaneous case, participants were able to recognize with very high precision the presented emotion as long as one of the morphing parameters was different from 0 and also when the rendered expressions were based on AU coefficients resulting from the anechoic mixture of the sources functions identified using the FADA algorithm (Figure 5B). Even in this case, a 4 × 4 repeated measure ANOVA with

*γ*

_{1}and

*γ*

_{2}as factors revealed, for each expression, a significant main effect on the percentage of correct classification of both factors. Also the interaction effect was always significant (see Table 2).

*γ*

_{1}and

*γ*

_{1}as factors revealed a significant main effect on the ratings of both

*γ*

_{1}and

*γ*

_{2}. The interaction effect was significant only for disgust and surprise (see Table 3). Concerning the expressions based on the anechoic mixture of the sources identified with the FADA algorithm, the results of the rating experiments (Figure 6B) showed that participants increased their ratings approximately linearly with the values of the morphing parameters. For each expression, a 4 × 4 repeated measure ANOVA with

*γ*

_{1}and

*γ*

_{2}as factors revealed a significant main effect on the ratings of both factors. The analysis also revealed that these main effects were in most of the cases qualified by an interaction between the factors (see Table 3).

*J*= 17 time-courses of AU activities, and we analyzed a total of 21 trials. We tested both a FADA and a synchronous model with

*f*, where

_{N}*f*is the Nyquist frequency of the data; we took the same

_{N}*f*

_{0}for the wave kernel of the LAP. The synchronous model was not regularized for smooth sources (

*ν*,

*S*of the gamma prior on the weights and the

*λ*of the exponential distribution on the delays are estimated from the data after source extraction. See Supplementary material for more detailed information. To put our results into the context of well-known model selection schemes, we repeated our analysis with the Akaike Information Criterion (Akaike, 1974) and the Bayesian Information Criterion (Schwarz, 1978). All three criteria prefer the anechoic model in every trial, as shown in Figure 7, middle. However, as depicted in Figure 7, right, BIC and AIC would pick models with a larger number of sources than LAP, which yields an average

*I*= 1.95 ± 0.50 sources (

*SEM*= 0.11). Therefore, the only criterion which is consistent with the perceptual Turing-test results described above is LAP: Human observers reach chance discrimination level at two sources. A scree plot (Figure 7, left) shows a similar result for FADA algorithm (two sources are best), but provides no clear answer for the synchronous NMF model.

*γ*

_{1},

*γ*

_{2}), averaged across all levels. For each (

*γ*

_{1},

*γ*

_{2}), we trained the models on the data from all other morphing levels. Furthermore, to establish an upper bound on the performance of any model that predicts mean ratings, we computed the mean rating for each subject and emotion. Results are shown in Figure 8. Each plot depicts the average predicted VAFs for one participant, averaged across all emotions. Error bars indicates standard deviations. The horizontal line in each plot indicates the average upper bound. All models are on average below the bound; the negative VAF in one participant is a consequence of the cross-validation procedure: The data used for learning the model parameters are disjoint from the data used for evaluation. The linear Bayesian model tends to perform worse than the general linear model. The performance difference is significant (

*p*< 0.01, Wilcoxon signed rank test) in four participants, indicated by a star in the figure. The nonlinear model shows no significant difference to the general linear model in any of the participants. Furthermore, its VAF predictions are close to the maximally expectable ones (horizontal lines) in most participants. We can therefore conclude that a Bayesian cue fusion model with output nonlinearity is a good description of the computational process with integrates facial movement primitives for emotion recognition.

*S*= 0.90 ± 0.03). The observed differences between the two model types must thus be due to the presence of time delays in the anechoic mixture model, i.e., the type of invariance that is assumed by the model architecture.

*Automatic Control, IEEE Transactions on*, 19 (6), 716–723, https://doi.org/10.1109/TAC.1974.1100705.

*Behavioral and Brain Sciences*, 33 (6), 434–435.

*Journal of Multimedia*, 1 (6), 22–35.

*IEEE Transactions on Neural Networks*, 13 (6), 1450–1464, https://doi.org/10.1109/TNN.2002.804287.

*Journal of Vision*, 14 (10): 838, https://doi.org/10.1167/14.10.838. [Abstract]

*The Journal of Neuroscience*, 29 (1), 191–205.

*Pattern recognition and machine learning*. Secaucus, NJ: Springer-Verlag New York, Inc.

*Brain Research Reviews*, 57 (1), 125–133.

*Neurocomputing*, 55 (34), 627–641. Available from http://www.sciencedirect.com/science/article/pii/S0925231202006318 (Evolving Solution with Neural Networks), http://doi.org/10.1016/S0925-2312(02)00631-8.

*3rd International Conference on Affective Computing and Intelligent Interaction and Workshops (ACII)*, (pp. 1–7).

*Proceedings of the National Academy of Sciences, USA*, 106 (46), 19563–19568.

*Frontiers in Computational Neuroscience*, 7, 11, https://doi.org/10.3389/fncom.2013.00011.

*Neuroscience*, 170 (4), 1223–1238. Available from http://www.sciencedirect.com/science/article/pii/S030645221000984X (reviewed), https://doi.org/10.1016/j.neuroscience.2010.07.006.

*arXiv preprint arXiv:1603.06879*.

*PLoS One*, 8 (11), https://doi.org/10.1371/journal.pone.0079555.

*Data fusion for sensory information processing systems*(Vol. 105). Berlin: Springer Science & Business Media.

*Proceedings of the 3rd symposium on applied perception in graphics and visualization*(pp. 77–84). Boston, MA.

*International conference on cognitive systems*(pp. 1–6). New York, NY: ACM.

*Dynamic faces: Insights from experiments and computation*(pp. 47–65). Cambridge, MA: MIT Press.

*The Journal of Neuroscience*, 26 (30), 7791–7810.

*Journal of Vision*, 13 (1): 23, 1–15, https://doi.org/10.1167/13.1.23. [PubMed] [Article]

*Journal of Vision*, 16 (8): 14, 1–20, https://doi.org/10.1167/16.8.14. [PubMed] [Article]

*Vision Research*, 100, 78–87.

*Science*, 334 (6058), 997–999.

*The facial action coding system: A technique for the measurement of facial action*. Palo Alto, CA: Consulting Psychologists Press.

*Emotion in the human face: Guidelines for research and an integration of findings*. New York, NY: Pergamon Press.

*Journal of Personality and Social Psychology*, 53 (4), 712.

*Signal Processing*, 68 (1), 93–100.

*Frontiers in Computational Neuroscience*, 7, 185. Available from http://www.frontiersin.org/computational_neuroscience/10.3389/fncom.2013.00185/abstract, https://doi.org/10.3389/fncom.2013.00185.

*Nature*, 415 (6870), 429–433.

*Proceedings of the 29th annual conference on computer graphics and interactive techniques*(pp. 388–398). New York, NY: ACM. Available from http://doi.acm.org/10.1145/566570.566594

*Current Opinion in Neurobiology*, 15 (6), 660–666.

*Emfacs-7: Emotional facial action coding system*. Unpublished manuscript, University of California at San Francisco, San Francisco, CA, 2 (36), 1.

*The Quarterly Journal of Experimental Psychology*, 68 (9), 1832–1843.

*Current Biology*, 11 (11), 880–885.

*The Journal of Physiology*, 556 (1), 267–282.

*Proceedings of the National Academy of Sciences, USA*, 109 (19), 7241–7244, https://doi.org/10.1073/pnas.1200155109.

*Proceedings of the 2002 ACM siggraph/eurographics symposium on computer animation*(pp. 55–63).

*Proceedings of SPIE*(Vol. 4309, pp. 16–25).

*Gait & Posture*, 26 (2), 256–262.

*Vision Research*, 43 (18), 1921–1936.

*PLoS One*, 2 (9), e943.

*Computer graphics international*2001 (pp. 38–46). Washington, DC, USA: IEEE Computer Society. Available from http://dl.acm.org/citation.cfm?id=647781.735231.

*Psychophysiology*, 30 (3), 261–273.

*Savants Étranges*, 6, 621–656.

*Review of personality and social psychology: Emotion*(Vol. 13. pp. 25–59). Newbury Park, CA: Sage.

*Nature*, 401 (6755), 788–791.

*Proceedings of the 22nd annual conference on computer graphics and interactive techniques*(pp. 55–62).

*Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition and Workshops*. (pp. 897–902).

*IEEE Proceedings on Vision, Image & Signal Processing*, 152, 491–500.

*Journal of WSCG*, 23 (2), 139–146.

*Behavioral and Brain Sciences*, 33, 417–433, https://doi.org/10.1017/S0140525X10000865.

*Advances in neural information processing systems 19*(pp. 1049–1056). Cambridge, MA: MIT Press.

*Neurocomputing*, 70 (10–12), 1938–1942.

*Trends in Cognitive Sciences*, 6 (6), 261–266.

*Proceedings of the ACM annual conference*(Vol. 1, pp. 451–457).

*Computer facial animation*. Wellesley, MA: CRC Press.

*NeuroImage*, 102, 407–415.

*Journal of Vision*, 9 (6): 15, 1–32, https://doi.org/10.1167/9.6.15. [PubMed] [Article]

*Journal of Personality and Social Psychology*, 39 (6), 1161–1178.

*The Journal of Neuroscience*, 18 (23), 10105–10115.

*The Journal of Neuroscience*, 22 (4), 1426–1435.

*Annals of Statistics*, 6 (2), 461–464.

*Proceedings of the 2006 ACM siggraph/eurographics symposium on computer animation*(pp. 261–270).

*Computer Animation and Virtual Worlds*, 1 (2), 73–80.

*Perception*, 31 (1), 113–132.

*Journal of Neurophysiology*, 93 (1), 609–613.

*Journal of Neurophysiology*, 98 (4), 2144–2156.

*2010 IEEE computer society conference on computer vision and pattern recognition-workshops*(pp. 42–47).

*IEEE Transactions on Signal Processing*, 52, 1830–1847.