September 2018
Volume 18, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2018
Learning intermediate features of affordances with a convolutional neural network
Author Affiliations
  • Aria Wang
    Center for the Neural Basis of Cognition (CNBC), Carnegie Mellon University
  • Michael Tarr
    Center for the Neural Basis of Cognition (CNBC), Carnegie Mellon University Psychology Department, Carnegie Mellon University
Journal of Vision September 2018, Vol.18, 1267. doi:10.1167/18.10.1267
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Aria Wang, Michael Tarr; Learning intermediate features of affordances with a convolutional neural network. Journal of Vision 2018;18(10):1267. doi: 10.1167/18.10.1267.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

The ability of humans to interact with the world around us relies on our ability to infer what actions objects afford - often referred to as affordances. The neural mechanisms of object-action associations are realized in the visuomotor pathway where information about both visual properties and actions is integrated into common representations. However, explicating these mechanisms is particularly challenging in the case of affordances because there is no one-to-one mapping between visual features and inferred actions. To better understand the nature of affordances, we trained a deep Convolutional Neural Network (CNN) to predict affordances from images and to learn the underlying features or the dimensionality of affordances. We view this representational analysis as the first step towards a more formal account of how humans perceive and interact with the environment. To create an affordance dataset, we labeled each of over 500 object categories with up to 5 actions drawn from a pool of 56 action categories. Since each action label was object-based (e.g., "kick" for a ball and "drink" for water), these labels can be generalized as "image to affordance" mappings for large-scale image datasets such as ImageNet (Russakovsky, 2015). Using these datasets we then trained a CNN (VGG-19; Simonyan 2014) to predict affordances for images in ImageNet. Using a network with pre-trained weights we were able to predict affordances with an accuracy of 87%. In contrast, a network trained from scratch achieved an accuracy of 24% (where chance is 5%). To quantify the interpretability of hidden units at intermediate layers within both networks, we applied network dissection (Bau, 2017) to identify the features critical for classifying affordances. Such features form an underlying compositional structure for the general representation of affordances which can then be tested against human neural data.

Meeting abstract presented at VSS 2018

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×