Abstract
The face is equipped with a large number of independent muscles that generate observable face movements such as nose wrinkling or smiling. These individual movements are called Action Units (AUs) in the Facial Action Coding System (FACS). FACS identifies about 40 AUs, each one of which has a variable amplitude. Here, we developed a two-stage deep neural network (DNN) that accurately categorized the underlying AUs from a sequence of image frames. We first trained a 10-layer ResNet AU decoder with 800,000 independent facial images of randomly activated AU combinations. Each image was generated by a 3D animation system that rendered the combination of up to 4 randomly selected AUs (from 42 possible AUs) with variable amplitude. The training outputs were the predicted amplitudes (range 0~1) of each AU (42 AUs). We next trained a LSTM (Long Short Term Memory) network to aggregate the predicted AU amplitudes learned from multiple images (i.e. a sequence of 30 frames) via ResNet. The final output is a binary AU vector that indicates the activated AUs over the sequence. We test our DNN on a dataset of 720 dynamic facial expression models, where each face model consists of 30 sequential image frames and the corresponding activated AUs. The d' of each AU's prediction demonstrates that the prediction of most AUs are reliable. Moreover, by comparing the representational dissimilarity matrices (RDMs) between each pair of AU vectors, we can observe that the output similarity pattern matches the input. To our knowledge, the proposed two-stage DNN is the first network to treat the AU prediction as a decoder problem, which can not only predict the activation of AUs, but also can predict the amplitude of activation for multiple AUs. This work will help further in decoding the relationship between emotion and action units.
Meeting abstract presented at VSS 2018