Purchase this article with an account.
Yufei Wang, Garrison Cottrell; Recognizing Urban Tribes with pre-trained Convolutional Neural Networks. Journal of Vision 2015;15(12):1171. doi: 10.1167/15.12.1171.
Download citation file:
© 2017 Association for Research in Vision and Ophthalmology.
In the past few years, the power of Convolutional Neural Networks (CNNs) has been especially strong in image categorization. However, the analysis the social features of images of groups of people has not attracted much research. Analysis of social group is difficult in that group categories are semantically ambiguous, and have high intra-class variance. We investigate the generalization ability of pre-trained CNN features on social group recognition. We propose a CNN-feature based architecture for social group recognition and test it on an 11-category "urban tribes" dataset. Our model takes in both individual images and global scene images, and features are extracted by a fine tuned pre-trained CNN. The pre-trained CNN architecture we use is the one developed by Krizhevsky et.al. (2012), which was pre-trained on the 1000-class Imagenet dataset. In our recognition scheme, patches from the images representing individuals are first extracted, then these individual patches and the original complete scene are processed by the CNN, and the two penultimate fully connected-layer activations are extracted as features. SVM classifiers are then used to predict probabilities from both individual and scene features, and are then combined to get the final recognition result. Our result on the urban tribes dataset is 71.23%, which is a boost of performance from the previous state of the art of 46%. We further investigate why features extracted from pre-trained CNN are useful for the urban tribe recognition task. For an input image, there is a correlation between the probability of it being in Imagenet classes and being in urban tribes classes. Moreover, the degree of correlation is related to the recognition rate of different urban tribes classes. This may indicate that the “generic” features extracted by pre-trained CNN networks are not so generic. However, the actual relationship between the two types of categories is still mysterious in most cases.
Meeting abstract presented at VSS 2015
This PDF is available to Subscribers Only