Humans recognize basic facial expressions effortlessly. Yet, despite a considerable amount of research, this task remains elusive for computer vision systems. Here, we compared the behavior of one of the best computer models of facial expression recognition (Z. Hammal, L. Couvreur, A. Caplier, & M. Rombaut, 2007) with the behavior of human observers during the M. Smith, G. Cottrell, F. Gosselin, and P. G. Schyns ( 2005) facial expression recognition task performed on stimuli randomly sampled using Gaussian apertures. The model—which we had to significantly modify in order to give the ability to deal with partially occluded stimuli—classifies the six basic facial expressions (Happiness, Fear, Sadness, Surprise, Anger, and Disgust) plus Neutral from static images based on the permanent facial feature deformations and the Transferable Belief Model (TBM). Three simulations demonstrated the suitability of the TBM-based model to deal with partially occluded facial parts and revealed the differences between the facial information used by humans and by the model. This opens promising perspectives for the future development of the model.

*Happiness, Surprise, Disgust, Anger, Sadness,*and

*Fear*that are similarly expressed across different backgrounds and cultures (Cohn, 2006; Ekman, 1999; Izard, 1971, 1994). Facial expressions result from the precisely choreographed deformation of facial features, which are often described using the 46 Action Units (AUs; Ekman & Friesen, 1978).

*t*− 1 is added to the characteristic features vector at time

*t*. Contrary to the FACS-based methods described above, the classification results of the AUs obtained by the dynamic Bayesian network are combined using a rules table defined by the authors (2005) to associate to each AU, or combination of AUs, only one of the six basic facial expressions.

*Bubbles,*a psychophysical procedure that prunes stimuli in the complex spaces characteristic of visual categorization, in order to reveal the information that effectively determines a given behavioral response in a recognition task (Gosselin & Schyns, 2001).

*Bubbles*technique was applied to determine the information underlying the recognition of the six basic facial expressions plus Neutral. The stimuli were produced by randomly sampling 70 facial expression images from the California Facial Expressions database

^{1}at five scales using scale-adjusted Gaussian filters (see Figure 1 and Smith et al., 2005 for details).

*Bubbles*experiment described above. We had to significantly modify the model proposed by Hammal et al. (2007) for the classification of stimuli displaying the six basic facial expressions plus

*Neutral*to give it the ability of dealing with sparse stimuli like the ones encountered in a

*Bubbles*experiment as well as in real life (Zeng, Pantic, Roisman, & Huang, 2009).

*Bubbles*experiment. Finally, we compare the behaviors of the model and humans in three simulations and draw conclusions regarding future implementations of the model.

*D*

_{1}to

*D*

_{5}(Figure 3) extracted from the characteristic points corresponding to the contours of the permanent facial features. Each distance is normalized with respect to the distance between the centers of both irises in the analyzed face. This makes the analysis independent of the variability of face dimensions and of the position of the face with respect to the camera. In addition to distance normalization, only the deformations with respect to the Neutral expression are considered.

*D*

_{ i}(Hammal et al., 2007 for more details). It allows the conversion of each numerical value to a belief in five symbolic states reflecting the magnitude of the deformation.

*S*

_{i}if the current distance is roughly equal to its corresponding value in the Neutral expression,

*C*

_{i}

^{+}vs.

*C*

_{i}

^{−}if the current distance is significantly higher vs. lower than its corresponding value in the Neutral expression, and

*S*

_{i}or

*C*

_{i}

^{+}noted

*S*

_{i}∪

*C*

_{i}

^{+}vs.

*S*

_{i}or

*C*

_{i}

^{−}noted

*S*

_{i}∪

*C*

_{i}

^{−}(the sign ∪ means logical or) if the current distance is neither sufficiently higher vs. lower to be in

*C*

_{i}

^{+}vs.

*C*

_{i}

^{−}, nor sufficiently stable to be in

*S*

_{ i}.

*D*

_{2}(distance between eye corner and eyebrow corner) for several video sequences going from Neutral to Surprise expression and coming back to Neutral, which have been obtained from different individuals. We observe similar evolutions of the characteristic distance associated with the same facial expression. The characteristic distance

*D*

_{2}always increases in the case of Surprise because people raise their eyebrows. Thus,

*D*

_{2}evolves from the equal state (

*S*

_{2}) to the significantly higher state (

*C*

_{2}

^{+}) via an undetermined region (

*S*

_{2}∪

*C*

_{2}

^{+}) corresponding to a doubt between the two considered states.

*D*

_{ i}values to symbolic states is carried out using the function depicted in Figure 5. The threshold values defining the transition from one state to another {

*a*

_{ i},

*b*

_{ i},

*c*

_{ i},

*d*

_{ i},

*e*

_{ i},

*f*

_{ i},

*g*

_{ i},

*h*

_{ i}} have been derived through a statistical analysis of the Hammal–Caplier database (2003)

^{2}for each characteristic distance.

*D*

_{ i}, the minimum threshold

*a*

_{ i}is averaged across the minimum values of the characteristic distance

*D*

_{ i}for all the facial expressions and all the subjects. Similarly, the maximum threshold

*h*

_{ i}is obtained from the averaging of the maximum values of the characteristic distance

*D*

_{ i}for all the facial expressions and all the subjects. The middle thresholds

*d*

_{ i}and

*e*

_{ i}are defined as the mean of minimum and maximum, respectively, of the characteristic distances

*D*

_{ i}on Neutral facial images for all the subjects (Hammal et al., 2007).

*b*

_{ i}is computed as the threshold

*a*

_{ i}of the distance

*D*

_{ i}assigned to the lower state

*C*

_{ i}

^{−}augmented by the median of the minimum values of the distance

*D*

_{ i}over all the image sequences and for all the subjects. Likewise, the intermediate threshold

*g*

_{ i}is computed as the threshold

*h*

_{ i}of the distance

*D*

_{ i}assigned to the higher state

*C*

_{ i}

^{+}reduced by the median of the maximum values over all the image sequences and for all the subjects. The thresholds

*f*

_{ i}and

*c*

_{ i}are obtained similarly (Hammal et al., 2007).

*D*

_{2}is in

*C*

^{+}state), the upper eyelids are open (

*D*

_{1}is in

*C*

^{+}state), and the mouth is open (

*D*

_{3}is in

*C*

^{−}state and

*D*

_{4}is in

*C*

^{+}state).

D _{1} | D _{2} | D _{3} | D _{4} | D _{5} | |
---|---|---|---|---|---|

Happiness ( E _{1}) | C ^{−} | S ∪ C ^{−} | C ^{+} | C ^{+} | C ^{−} |

Surprise ( E _{2}) | C ^{+} | C ^{+} | C ^{−} | C ^{+} | C ^{+} |

Disgust ( E _{3}) | C ^{−} | C ^{−} | S ∪ C ^{+} | C ^{+} | S |

Anger ( E _{4}) | C ^{−} | C ^{−} | S | S ∪ C ^{−} | S |

Sadness ( E _{5}) | C ^{−} | C ^{+} | S | C ^{+} | S |

Fear ( E _{6}) | C ^{+} | S ∪ C ^{+} | S ∪ C ^{−} | S ∪ C ^{+} | S ∪ C ^{+} |

Neutral ( E _{7}) | S | S | S | S | S |

*H*

_{1}, …,

*H*

_{N}} of

*N*exclusive and exhaustive hypotheses characterizing some situations. This means that the solution to the problem is unique and is one of the hypotheses of Ω′.

*m*

^{Ω}(

*A*) to every proposition

*A*of the power set 2

^{Ω′}= {{

*H*

_{1}}, {

*H*

_{2}}, …, {

*H*

_{ N}}, {

*H*

_{1},

*H*

_{2}}, …, Ω}. In the current application the independent sensors correspond to the different characteristic distances and the hypotheses

*H*

_{ i}correspond to the six basic facial expressions plus

*Neutral*. The first step in the classification process then is to perform an intermediate modeling between the numerical values of the characteristic distances

*D*

_{ i}and the required expressions. More precisely, the Basic Belief Assignment related to the characteristic distance states is defined (see Equation 1 below). Then using the rules (see Table 1) between the symbolic states and the facial expressions, the BBAs of the facial expressions according to each characteristic distance are deduced. Finally, the combination process of the BBAs of all the distance states (and then the corresponding expressions) leads to the definition of the BBAs of the facial expressions using all the available information (see Fusion process section).

*m*

_{ Di}

^{Ω Di}of each characteristic distance state

*D*

_{ i}is defined as

_{ D i}= {

*C*

_{ i}

^{+},

*C*

_{ i}

^{−},

*S*

_{ i}}, the power set

*C*

_{ i}

^{+}}, {

*C*

_{ i}

^{−}}, {

*S*

_{ i}}, {

*S*

_{ i},

*C*

_{ i}

^{+}}, {

*S*

_{ i},

*C*

_{ i}

^{−}}, {

*S*

_{ i},

*C*

_{ i}

^{+},

*C*

_{ i}

^{−}}} is the frame of discernment (the set of possible propositions and subset of propositions), {

*S*

_{ i},

*C*

_{ i}

^{+}} vs. {

*S*

_{ i},

*C*

_{ i}

^{−}} is the doubt (or hesitation) state between the state

*C*

_{ i}

^{+}vs.

*C*

_{ i}

^{−}and the state

*S*

_{ i}. The piece of evidence

*D*

_{ i}is obtained by the function depicted in Figure 5.

*A*) is the belief in the proposition

*A*∈

*A*in case of doubt proposition. This is the main difference when compared with the Bayesian model, which implies equiprobability of the propositions of

*A*.

*A*is called the focal element of

*A*) whenever the belief on

*A*

*A*) > 0. Total ignorance is represented by

_{ D i}) = 1. To simplify, the proposition {

*C*

_{ i}

^{+}} is noted

*C*

^{+}and the subset of propositions {

*S*

_{ i},

*C*

_{ i}

^{+}} is noted

*S*∪

*C*

^{+}(i.e.,

*S*or

*C*

^{+}that corresponds to the doubt state between

*S*and

*C*

^{+}).

*Bubbles*experiment and, more generally, in real life. Thus, instead of using all the characteristic distances (Hammal et al., 2007), only those revealed by the Gaussian apertures are used. The TBM is well suited for this: It facilitates the integration of a priori knowledge and it can deal with uncertain and imprecise data, which is the case with

*Bubbles*stimuli. Moreover, it is able to model the doubt between several facial expressions in the recognition process. This property is important considering that “binary” or “pure” facial expressions are rarely perceived (people usually display mixtures of facial expressions (Young et al., 1997). Also, the proposed method allows Unknown expressions, which correspond to all facial deformations that cannot be categorized into one of the predefined facial expressions.

*Bubbles*mask (i.e., the collection of Gaussian apertures that sample a face on a particular trial), the intersection between the

*Bubbles*mask and the contours of the facial features is performed in two steps. First, the segmentation of the permanent facial features is made manually on the original frame (i.e., before the application of the Bubbles mask, see Data extraction section). The characteristic points corresponding to each contour are manually detected. Figure 6c shows an example of the corresponding contours. However, it should be noted that even human experts do not obtain perfect segmentation results and a weak dispersion of the detected points appears, which leads to (sometimes large) imperfections in the corresponding contours. Most importantly, however, the characteristic distances are measured based only on the characteristic points and not on the corresponding contours. Thus, the small dispersion errors of the characteristic points do not significantly affect the classification process. This claim is based on the results of a quantitative evaluation using a ground truth corresponding to the results of the manual detection of the characteristic points by human experts (see Hammal et al., 2006).

*Bubbles*mask is done revealing a subset of the contours of the permanent facial features and then of the corresponding characteristic points (see Figures 6c and 6d). The appearance intensity of the contours and of the characteristic points varies according to the size, the position, and the number of the Gaussian apertures (see The Bubbles experiment of Smith et al. (2005) section and Figure 6d). However, as reported below only the characteristic points are used for the computation of the characteristic distances. The characteristic points for which the pixel intensities are different from 0 are identified (red crosses in Figure 6e). Finally, all distances computed from contour points different from 0 are identified and taken into account in the classification process (see Figure 6e).

*D*

_{i}is considered as the mean between its corresponding left and right side values as

*D*

_{i1}and

*D*

_{i2}correspond, respectively, to the left and right sides of the characteristic distance

*D*

_{i}, except for

*D*

_{3}and

*D*

_{4}, which concern the mouth (see Figure 6e).

*α,*0 ≤

*α*≤ 1, which allows computing the new piece of evidence noted

^{ α}

*m*(see Equation 3 and Smets, 2000) for each proposition

*A*according to its current piece of evidence

*m*and the discounting rate

*α*as

*A*. If the distance is fully reliable

*α*= 1, then

*m*is left unchanged (i.e.,

^{α}

*m*(

*A*) =

*m*(

*A*)). If the distance is not reliable at all

*α*= 0,

*m*is transformed into the vacuous BBA (i.e.,

^{α}

*m*(

*A*) = 0).

*Bubbles*experiment, the revealed facial parts used for the classification process appear with different intensities. This can be understood as differences in reliability for the corresponding distances.

*Discounting*was used to weight the contribution of each characteristic distance

*D*

_{ i}according to its intensity represented by

*inten*(

*D*

_{ i}). This leads to five

*discounting*parameters

*α*

_{ i}(1 ≤

*i*≤ 5), one for each characteristic distance

*D*

_{ i}.

*α*

_{ i}can be computed by learning (Elouadi, Mellouli, & Smets, 2004) or by optimizing a criterion (Mercier, Denoeux, & Masson, 2006) when the reliability of the sensors is uncertain or unknown. In the current work, the reliability of the sensors (the characteristic distances) is known and is equal to their appearance intensity after the application of the Bubbles mask. Then the corresponding reliability parameters

*α*

_{i}are equal to

*inten*(

*D*

_{ i}).

*D*

_{ i}is computed by measuring the distance between two points relative to their distance in the neutral state. As reported above, each distance is considered only if the intensities of its two associated points are both different from 0. Then its intensity is taken as the mean of the intensities of its associated points. For example

*α*

_{1}the discounting parameter of

*D*

_{1}was computed as

*P*1) and inten(

*P*2) correspond, respectively, to the intensities of pixels P1 and P2, which are different from 0 (see Figure 6c).

*α*

_{ i}is tested allowing us to reach our goal of analyzing the response of the system to the inhibition or the discounting of the required information.

*D*

_{ i}≠ 0) were fully reliable (i.e.,

*α*

_{ i}= 1 for 1 ≤

*i*≤ 5; see the second simulation in Simulations section).

*α*

_{ i}of all the characteristic distances used were set, the corresponding BBAs were redefined according to Equation 3.

*D*

_{ i}states.

^{Ω}, where Ω = {

*Happiness*(

*E*

_{1}),

*Surprise*(

*E*

_{2}),

*Disgust*(

*E*

_{3}),

*Fear*(

*E*

_{4}),

*Anger*(

*E*

_{5}),

*Sadness*(

*E*

_{6}),

*Neutral*(

*E*

_{7})} is the set of expressions.

*m*

_{ D i}

^{Ω}is derived for each characteristic distance

*D*

_{ i}. In order to combine all this information, a fusion process of the BBAs

*m*

_{ D i}

^{Ω}of all the states of the characteristic distances is performed using the conjunctive rule of combination noted ⊕ (see Equation 5 (Denoeux, 2008; Smets, 2000); Equation 6 shows the mathematical definition of the corresponding symbol) and results in

*m*

^{Ω}the BBA of the corresponding expressions

*D*

_{ i}and

*D*

_{ j}with two BBAs

*m*

_{ D i}

^{Ω}and

*m*

_{ Dj}

^{Ω}derived on the same frame of discernment, the joint BBA

*m*

_{ D ij}is given using the conjunctive combination (orthogonal sum) as

*A, B,*and

*C*denote propositions, the sign

*B*

*C*denotes the conjunction (intersection) between the propositions

*B*and

*C*. This leads to propositions with a lower number of elements and with more accurate pieces of evidence.

*E*

_{ e}and their possible combinations. Making a decision is associated with a risk except if the result is sure (

*m*(

*E*

_{ e}) = 1). As it is not always the case (more than one expression can be recognized at once), several decision criteria can be used (Denoeux, 2008; Smets, 2000).

*BetP*(see Equation 7 and Smets, 2005), which only deals with singleton expressions:

*BetP*(

*C*) corresponds to the pignistic probability of each one of the hypothesis

*C*of

*A, ϕ*corresponds to the conflict between the sensors, and

*Card*(

*A*) corresponds to the number of elements (hypothesis) of

*A*.

*Bubbles*mask are identified. Figure 7 presents an example of the information displayed during the analysis of Anger expression. In this example all the characteristic distances are identified and used, but it is not always the case that all the characteristic distances are identified as explained above. The interface is divided into five different regions: in the upper left region, the frame to be analyzed; in the upper middle region, the result of the BBAs of the expressions (in this case only the Anger expression appears with a piece of evidence equal to 1); in the upper right region, the decision result based on the pignistic probability with its value; in the lower left region, the states of the characteristic distances and their pieces of evidence; in the lower right region, the corresponding facial feature deformations.

*P*> 0.01). The classification rate for Happiness (45%) is lower than that in the first simulation and lower than that for humans. Similar to the performance obtained using all characteristic distances, the worst classification rate was obtained with Sadness (28%).

*D*

_{4}, the only one available for this particular combination of facial expression image and Gaussian apertures. Here, the characteristic distances required to distinguish between these expressions are inhibited. Third, the inhibition reduced the Ignorance rates for Disgust, Fear, and Sadness. These results mean that some characteristic distances are necessary for the recognition of some expressions while others increase the doubt and then their inhibition increases the recognition.

*P*> 0.05).

*R*

_{Happiness}= 0.02,

*R*

_{Disgust}= 0.035,

*R*

_{Surprise}= 0.04,

*R*

_{Anger}= −0.02,

*R*

_{Fear}= 0.003,

*R*

_{Sadness}= −0.02. Based on these results it is clear that even if the classification rates of the model and human classifier are comparable, they do not have the same behavior on a trial-by-trial basis. The difference must pertain to the information used for the recognition. The next section assesses this possibility.

*E*

_{ e}and the independent variables correspond to the five characteristic distances

*D*

_{ i}. For example, Happiness (

*E*

_{1}), on a given trial, could be defined as

*x*

_{1 t}, …,

*x*

_{ it}correspond to the appearance intensities of the characteristic distances

*D*

_{1}, …,

*D*

_{ i}.

*E*

_{1}, then based on all the available data, we obtain

*n*corresponds to the number of times

*E*

_{1}is presented and recognized and

*x*

_{ ni}corresponds to the appearance intensity (see Discounting section) of the characteristic distance

*D*

_{ i}during the recognition of the expression

*E*

_{1}at time

*n*.

*d*

_{ E e}corresponds to the coefficients of the characteristic distances reflecting their importance for the recognition of each facial expression

*E*

_{ e}, 1 ≤

*e*≤ 6.

*R*

^{2}are measured and reported in Figure 13.

^{3}Except for

*Sadness,*the values of

*R*

^{2}are positive and very high, which reflects a good fit of the data and thus a high confidence in the coefficients obtained.

*Anger,*there is an excellent correspondence between the most important characteristic distances for the proposed model and the facial cues used by the ideal observer (or model) of Smith et al. This model uses all the information available to perform the task optimally. These results allow the conclusion that the characteristic distances used summarize the most important information necessary for the classification of the facial expressions in the CAFE database and that the rules (i.e., Table 1) we used reflect ideal but not human information usage. However, the visual cues used by human observer are different from those used by the Smith et al. model observer and the model proposed here. In some cases, human observers show a partial use of the optimal information available for the classification of facial expressions (Smith et al., 2005). For example, humans use the mouth but not eyes for

*Happiness*and they use the eyes but not the mouth for

*Fear*. In other cases, humans use information, which is not optimal: for example the nasolabial furrow in the case of

*Disgust*and the wrinkles on the forehead in the case of

*Sadness*. Given that humans easily outperform machines at recognizing facial expressions in everyday situations, it appears likely that their alleged “suboptimalities,” in fact, reflect robust everyday facial expression statistics, not present in the CAFE face image set. Thus it seems promising for a future implementation of our model to use these “suboptimal” features for the facial features classification (e.g., nasolabial furrow in the case of

*Disgust*) and to take into account their relative importance in the classification process.

^{2}Hammal–Caplier database is composed of 19 subjects that displayed 4 expressions (Smile, Surprise, Disgust, and Neutral). Eleven subjects were used for the training and 8 subjects for the test. Each video recording starts with neutral state, reaches the apex of the expression, and goes back to the neutral state. The sequences were acquired in 5-second segments at 25 images/second.