Abstract
What are the visual feature representations of common object categories and how are these representations used in goal-directed behavior? Our previous work on this topic (VSS2015) introduced the idea of category-consistent features (CCFs), those maximally-informative features that appear both frequently and with low variability across category exemplars. Here we extend this idea by using a deep (6-layer) convolutional neural network (CNN) to learn CCFs. The four convolutional layers corresponded to visual processing in the ventral stream (V1, V2, V4, and IT), with the number of filters (neurons) at each layer reflecting relative estimates of neural volume in these brain areas. The CNN was trained on 48,000 images of closely-cropped objects, divided evenly into 48 subordinate-level categories. Filter responses (firing rates) for 16 basic-level and 4 superordinate-level categories were also obtained from this trained network (68 categories in total), with the filters having the least variable and highest responses indicating a given category's CCFs. We tested the CNN-CCF model against data from 26 subjects searching for targets (cued by name) from the same 68 superordinate/basic/subordinate-level categories. Category exemplars used for model training and target exemplars appearing in the search displays were disjoint sets. A categorical search task was used so as to explore model predictions of both attention guidance (time between search display onset and first fixation of the target) and object verification/detection (time between first fixating the target and the present/absent target judgment). Using the CNN-CCF model we were able to predict, not just differences in mean guidance and verification behavior across hierarchical levels, but also mean guidance and verification times across individual categories. We conclude that the features used to represent common object categories, and the behaviorally-meaningful similarity relationships between categories (effects of hierarchical level) can be well approximated by features learned by CNNs reflecting ventral stream visual processing.
Meeting abstract presented at VSS 2016