Abstract
Humans can easily detect, recognize, and classify a range of actions very quickly. Despite enormous research efforts, what spatial-temporal features should be encoded and what statistics of these features are in natural human actions are still unknown. In this work, we proposed natural action structures, i.e., multi-size, multi-scale, spatial-temporal concatenations of features, as the basic encoding units of natural human actions. We took several steps to compile these structures. First, we sampled a large number of sequences of circular patches at multiple spatial and temporal scales. The spatial and temporal scales were so coupled that the sequences at finer spatial scales had shorter durations. Second, we performed independent component analysis on the patch sequences and classified the obtained independent components into clusters using the k-mean method. Finally, we compiled a large set of natural action structures with each corresponding to a unique combination of the clusters at all the spatial and temporal scales. We examined the statistics of these natural action structures and selected a set of highly informative structures for action recognition. To evaluate the utilities of these natural action structures, we used them as inputs to two widely used methods for pattern recognition, i.e., Latent Dirichlet Allocation and Support Vector Machine, to classify a range of human actions in the popular KTH and Weizmann datasets. We found that natural action structures obtained in this way achieved a significantly better recognition performance than simple spatial-temporal features and that the performance was better than or comparable to the best current models. We thus concluded that natural action structures can be used as the basic encoding units of human actions and activities and may hold the key to the understanding of human ability of action recognition.
Meeting abstract presented at VSS 2012