Abstract
The ability of humans to interact with the world around us relies on our ability to infer what actions objects afford - often referred to as affordances. The neural mechanisms of object-action associations are realized in the visuomotor pathway where information about both visual properties and actions is integrated into common representations. However, explicating these mechanisms is particularly challenging in the case of affordances because there is no one-to-one mapping between visual features and inferred actions. To better understand the nature of affordances, we trained a deep Convolutional Neural Network (CNN) to predict affordances from images and to learn the underlying features or the dimensionality of affordances. We view this representational analysis as the first step towards a more formal account of how humans perceive and interact with the environment. To create an affordance dataset, we labeled each of over 500 object categories with up to 5 actions drawn from a pool of 56 action categories. Since each action label was object-based (e.g., "kick" for a ball and "drink" for water), these labels can be generalized as "image to affordance" mappings for large-scale image datasets such as ImageNet (Russakovsky, 2015). Using these datasets we then trained a CNN (VGG-19; Simonyan 2014) to predict affordances for images in ImageNet. Using a network with pre-trained weights we were able to predict affordances with an accuracy of 87%. In contrast, a network trained from scratch achieved an accuracy of 24% (where chance is 5%). To quantify the interpretability of hidden units at intermediate layers within both networks, we applied network dissection (Bau, 2017) to identify the features critical for classifying affordances. Such features form an underlying compositional structure for the general representation of affordances which can then be tested against human neural data.
Meeting abstract presented at VSS 2018