Abstract
Algorithms based on deep convolutional neural networks (DCNNs) have made impressive gains on the problem of recognizing faces across changes in appearance, illumination, and viewpoint. These networks are trained on a very large number of face identities and ultimately develop a highly compact representation of each face at the network's top level. It is generally assumed that these representations capture aspects of facial identity that are invariant across pose, illumination, expression, and appearance. We analyzed the top-level feature space produced by two state-of-the-art DCNNs trained for face identification with >494,000 images of 10,575 individuals (Chen, 2016; Sankaranarayanan, 2016). In one set of experiments, we trained classifiers to predict image-based properties of faces using the networks' top-level feature descriptions as input. Classifiers determined face yaw to within 9.5 degrees and face pitch (frontal versus offset) at 67% correct. Top-level features also predicted whether the input came from a photograph or video frame with 87% accuracy. In a second experiment, we compared top-level feature codes of different views of the same identities to develop an index of feature invariance. Surprisingly, we found that invariant coding was a characteristic of individual identities, rather than individual features - with some identities encoded invariantly whereas others were not. In a third analysis, we used t-Distributed Stochastic Neighbor Embedding to visualize the top-level DCNN feature space for the Janus CS3 dataset (cf. Klare et al., 2015) containing over 69,000 images of 1,894 distinct identities. This visualization indicated that image quality information is retained in the top-level DCNN features, with poor quality images clustering at the center of the space. The representation of photometric details for face images in top-level DCNN features echoes findings of object category-orthogonal information in macaque IT cortex (Hong et al., 2016), reinforcing the claim that coarse codes can effectively represent complex stimulus sets.
Meeting abstract presented at VSS 2017