September 2017
Volume 17, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   August 2017
Object detection and localization for free from category-consistent CNN features.
Author Affiliations
  • Hieu Le
    Computer Science Department, Stony Brook University
  • Chen-Ping Yu
    Psychology Department, Harvard University
  • Dimitris Samaras
    Computer Science Department, Stony Brook University
  • Gregory Zelinsky
    Computer Science Department, Stony Brook University
    Psychology Department, Stony Brook University
Journal of Vision August 2017, Vol.17, 1248. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Hieu Le, Chen-Ping Yu, Dimitris Samaras, Gregory Zelinsky; Object detection and localization for free from category-consistent CNN features.. Journal of Vision 2017;17(10):1248.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

In recent work we introduced the idea of Category-Consistent Features (CCFs): commonality-based generative features that occur both frequently and consistently across the exemplars of an object category (Yu, Maxfield, & Zelinsky, 2016, Psychological Science). Attention can be guided preattentively to even complex object categories, but computationally-explicit theories of categorical guidance are still in their infancy. Here we show that limited automatic object detection and localization can be obtained by identifying the CCFs of a category using a convolutional neural network (CNN; VGG-19, Simonyan & Zisserman, 2015, ICLR) and then integrating their collective activations over proto-objects. Given a set of exemplar images for an object category, and a global pool of visual features from a pre-trained CNN, the CCF filters (CNN-CCFs) for this category are selected as the filters that are strongly and consistently activated by this category's exemplars. The collective responses of the CNN-CCFs are then pooled into proto-objects (merged visual fragments following a superpixel segmentation), and these are integrated into larger objects by computing the geodesic distances between the segmented proto-objects based on their boundary strengths. The intensity-based geodesic distances between proto-object boundaries act as a spatially-focused attentional "hand" that binds the proto-objects into a stable object, similar to the account from Coherence Theory (Rensink, 2000, Visual Cognition). Using this two-step approach we achieved, in a fully unsupervised way, state-of-the art object class localization accuracies in VOC2007, VOC2012, and the Object Discovery datasets with CorLoc scores of 41.2%, 47.5%, and 85.7%, respectively. Using a held-out set of ImageNet categories for testing, we also showed that our method is able to localize unknown categories that were not used for pre-training the CNN, again achieving state-of-the art accuracy (70.7%). Our results suggest that the locations of objects, both learned and novel, can be computed (without supervision) from CCFs integrated across proto-objects.

Meeting abstract presented at VSS 2017


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.