Abstract
Deep neural networks (DNNs) have been shown to correlate with object, face and scene recognition neural markers in the human brain. DNNs are typically trained to optimize recognition accuracy on large-scale datasets and learn task-specific internal representations. However, our brain learns to develop high-level representations for categories across multiple recognition tasks (e.g. objects, scenes, actions etc). DNNs might be closely aligned to neural representations in the brain if trained similarly on a diverse set of categories. Here we investigate if DNNs can learn internal units that capture a common representational space for objects, places, faces and actions. To train such neural networks we combine categories from different datasets such as ImageNet, Places, Moments in Time and additionally introduced a novel people (faces in context) dataset. This new aggregated “seed” dataset consists of more than 1200 visual concepts combining objects, places, actions, people’s emotions, gender and age. After training state of the art DNNs on this dataset, we analyzed the representation learned by the internal units. Training on the Seed dataset leads to a higher interpretability compared to a DNN trained only for one category (i.e. like object classification). The diversity of the dataset enables internal units of the network to learn not only object and scene-specific units but also units for facial expression and actions. Thus, this gives rise to a representational space in the DNNs that might be more closely aligned to the neural representation learned by human brains. Together, the new Seed dataset and DNN models may establish a benchmark for computational neuroscience experiments dedicated to explore the learning, computation and representation of higher-level visual concepts.
Acknowledgement: This research was funded by NSF grant number 1532591, in Neural and Cognitive Systems and the Vannevar Bush Faculty Fellowship program funded by the ONR grant number N00014-16-1-3116.