Abstract
A deep convolutional neural network was trained to classify person categories from digital photographs. We studied the specific tuning properties of nodes in this network at different hierarchical levels and explored the emergent properties when the network was connected as a recurrent system to optimize its visual input with respect to various high level criteria. The network design was borrowed from the GoogLeNet architecture and trained from scratch on 584 diverse person categories (e.g., economist, violinist, grandmother, child) from the ImageNet dataset. With the trained model we used gradient descent to generate preferred stimulus images for each convolutional node class in the network. At the first convolutional layer, these preferred images were simple grating patterns, monochrome or color, varying in orientation and spatial frequency and indistinguishable from the types of preferred images for the same network architecture trained on non-human categories. Nodes in the next two convolutional layers showed increasingly complex pattern specificity, without any recognizable person-specific features. Person-specific node were apparent in all subsequent convolutional layers. In the first layers where person-specific feature arise, they appear to be generic face detectors. In subsequent layers, detailed eye, nose and mouth detectors begin to appear, as well as nodes selective for more restrictive person categories — selective for age and gender, for example. At still higher levels, some nodes became selective for complete human bodies. The larger receptive fields of these higher level nodes provide the spatial scope for increasing degrees of context specificity: in particular, there are nodes which prefer a main human figure to one or more smaller figures in the background. These modeling results emphasize the importance of both facial features and figure ensembles in recognition of person categories.
Meeting abstract presented at VSS 2016