Understanding how humans analyze facial expressions of emotion is key in a large number of scientific disciplines—from cognition to evolution to computing. An important question in the journey to understanding the perception of emotions is to determine how these expressions are perceived at different image resolutions or distances. In the present work, we have addressed this question.
The results reported above uncovered the recognition rates for six of the most commonly seen emotional expressions (i.e., happy, sad, angry, disgust, fear, surprise) and neutral as seen at five distinct resolutions. We have also studied the confusion tables, which indicate which emotions are mistaken for others and how often. We have seen that two of the emotions (happy and surprise) are easily recognized and rarely mistaken for others. Two other emotions (sadness and anger) are less well recognized and show strong asymmetric confusion with other emotions. Sadness is most often mistaken for neutral, anger for sadness and disgust. Yet, neutral is almost never confused for sadness, and sadness is extremely rarely mistaken for anger. The last two emotions (fear and disgust) were poorly recognized by our subjects. Nonetheless, their confusion patterns are consistent. Anger is very often mistaken for disgust. In fact, anger is sometimes classified more often as disgust than in its own category. Fear is commonly mistaken for surprise and, to a lesser degree, disgust, at short and mid-resolutions (i.e., 1 to 1/4). At small resolutions (i.e., 1/8), fear is also taken to be joy and sadness.
The results summarized in the preceding paragraph suggest three groups of facial expression of emotions. The first group (happy and surprise) is formed by expressions that are readily classified at any resolution. This could indicate that the production and perception systems of these facial expressions of emotion coevolved to maximize transmission of information (Fridlund,
1991; Schmidt & Cohn,
2001). The second group (angry and sad) is well recognized at high resolutions only. However, with their reduced recognition rates even at the highest resolution, the mechanisms of production and recognition of these expressions may not have coevolved. Rather, perception may have followed production, since recognition of these emotions at proximal distance could prove beneficial for survival to either the sender or receiver. The third group (fear and disgust) includes expressions that are poorly recognized at any distance. One hypothesis (Susskind et al.,
2008) is that they are used as a sensory enhancement and blocking mechanism. Under this view, without the cooperation of a sender willing to modify her expression, the visual system has had the hardest work in trying to define a computational space that can recognize these expressions from a variety of distances. As in the first group, the emotions in this third group are recognized similarly at all distances—except when the percept is no longer distinguishable at resolution 1/16.
An alternative explanation for the existence of these three groups could be given by the priors assigned to each emotion. For example, University students and staff fell generally safe and happy. As a consequence, expressions such as happy could be expected, whereas fear may not.
Perhaps more intriguing is the asymmetric patterns in the confusion tables. Why should fear be consistently mistaken for surprise but not vice versa? One hypothesis comes from studies of letter recognition (Appelman & Mayzner,
1982; James & Ashby,
1982). Under this model, people may add unseen features to the percept but will only rarely delete those present in the image. For instance, the letter F is more often confused by an E than an E is for an F. The argument is that E can be obtained from F by adding a non-existing feature, whereas to perceive F from an E would require to eliminate a feature. Arguably, the strongest evidence against this model comes from the perception of neutral in sad faces, which would require eliminating all image features indicating to the contrary.
However, to properly consider the above model, it would be necessary to know the features (dimensions) of the computational space of these emotions. One possibility is that we decode the movement of the muscles of the face, i.e., the AUs correspond to the dimensions of the computational space (Kohler et al.,
2004; Tian, Kanade, & Cohn,
2001). For example, surprise generally involves AUs 1 + 2 + 5 + 26 or 27. Fear usually activates 1 + 2 + 5 + 25 + 26 or 27 and it may also include AUs 4 and 20. Note that the AUs in surprise are a subset of those of fear. Hence, according to the model under consideration, it is expected that surprise will be mistaken for fear but not the other way around. Yet, surprise is not confused as fear, but fear is mistaken for surprise quite often. This means that active AUs such as 4, 20, or 25 should be omitted from the analysis. A more probable explanation is that the image features extracted to classify facial expressions of emotion do not code AUs. Further support for this latest point is given by the rest of the mistakes identified in
Table 1. Sadness is confused for disgust, even though they do not share
any common AU. Disgust and anger only share AUs that are not required to display the emotion. In addition, for anger to be mistaken as sadness, several active AUs should be omitted.
We have also considered the subtraction model (Appelman & Mayzner,
1982; Geyer & DeWald,
1973), where E is most likely confused for F because it is easier to delete a few features than to add them. This model is consistent with the confusion of fear for surprise but is inconsistent with all other misclassifications and asymmetries. The results summarized in the last two paragraphs are consistent with previous reports of emotion perception in the absence of any active AU (Hess, Adams, Grammer, & Kleck,
2009; Neth & Martinez,
2009; Zebrowitz, Kikuchi, & Fellous,
2007). In some instances, features seem to be added while others are omitted even as distance changes (Laprevote et al.,
2010).
It could also be expected that expressions involving larger deformation are easier to identify (Martinez,
2003). The largest shape displacement belongs to surprise. This makes sense, since this expression is easily identified at any resolution. The recognition of surprise at images of 15 × 10 pixels is actually better than that of fear and disgust in the full resolution images (240 × 160 pixels). Happiness also has a large deformation and is readily successfully classified. However, fear and disgust include deformations that are as large (or larger) than happiness. Yet, these are the two expressions that are recognized most poorly.
Another possibility is that only a small subset of AUs is diagnostic. Happy is the only expression with AU 12, which uplifts the lip corners. This can make it readily recognizable. Happy plays a fundamental role in human societies (Russell,
2003). One hypothesis is that it had to evolve a clearly distinct expression. Some AUs in surprise seem to be highly diagnostic too, making it easy to confuse fear (which may have evolved to minimize sensory input) for surprise. In contrast, sadness activates AU 4 (which lowers the inner corners of the brows) and disgust activates AU 9 (which wrinkles the nose). These two AUs are commonly confused for one another (Ekman & Friesen,
1978), suggesting that they are not very diagnostic.
Differences in the use of diagnostic features seem to be further suggested by our results of women versus men. Women are generally significantly better in correctly identifying emotions and make less misclassifications. Other studies suggest that women are also more expressive than men (Kring & Gordon,
1998). Understanding gender differences is important not only to define the underlying model of face processing but also in a variety of social studies (Feingold,
1994).
Before further studies can properly address these important questions, we need a better understanding of the features defining the computational model of facial expressions of emotion. The above discussion strongly suggests that faces are not AU-coded, meaning that the dimensions of the cognitive space are unlikely to be highly correlated with AUs. Neth and Martinez (
2010) have shown that shape has a significant contribution in the perception of sadness and anger in faces and that these are loosely correlated to AUs. Similarly, Lundqvist, Esteves, and Öhman (
1999) found that eyebrows are generally best to detect threatening faces, followed by the mouth and eyes. The results reported above suggest that this order would be different for each emotion class.