Abstract
In recent work we introduced the idea of Category-Consistent Features (CCFs): commonality-based generative features that occur both frequently and consistently across the exemplars of an object category (Yu, Maxfield, & Zelinsky, 2016, Psychological Science). Attention can be guided preattentively to even complex object categories, but computationally-explicit theories of categorical guidance are still in their infancy. Here we show that limited automatic object detection and localization can be obtained by identifying the CCFs of a category using a convolutional neural network (CNN; VGG-19, Simonyan & Zisserman, 2015, ICLR) and then integrating their collective activations over proto-objects. Given a set of exemplar images for an object category, and a global pool of visual features from a pre-trained CNN, the CCF filters (CNN-CCFs) for this category are selected as the filters that are strongly and consistently activated by this category's exemplars. The collective responses of the CNN-CCFs are then pooled into proto-objects (merged visual fragments following a superpixel segmentation), and these are integrated into larger objects by computing the geodesic distances between the segmented proto-objects based on their boundary strengths. The intensity-based geodesic distances between proto-object boundaries act as a spatially-focused attentional "hand" that binds the proto-objects into a stable object, similar to the account from Coherence Theory (Rensink, 2000, Visual Cognition). Using this two-step approach we achieved, in a fully unsupervised way, state-of-the art object class localization accuracies in VOC2007, VOC2012, and the Object Discovery datasets with CorLoc scores of 41.2%, 47.5%, and 85.7%, respectively. Using a held-out set of ImageNet categories for testing, we also showed that our method is able to localize unknown categories that were not used for pre-training the CNN, again achieving state-of-the art accuracy (70.7%). Our results suggest that the locations of objects, both learned and novel, can be computed (without supervision) from CCFs integrated across proto-objects.
Meeting abstract presented at VSS 2017