Abstract
Deep convolutional neural networks, when trained on difficult supervised tasks such as object classification on large datasets, have shown the remarkable property of learning general representations useful for many other tasks and predictive of responses in the ventral visual stream. These successes have led to the pursuit of tasks that are able to generate a visual backbone in developmentally-realistic ways. Yet large gaps remain to be filled. -Lack of supervision Training must happen without large amounts of manually-labeled data. -Interaction and agency The developmental behavioral literature intertwines such developments with interaction with the world, explaining rich behaviors through this training. We built a simulated environment based on a game engine that provides 3D visual stimuli and allows for agent interactions with different types of objects. In this, we have an artificial agent gather experience, and with this experience, train a world model of self-supervised problems involving future, ego motion, and force predictions. We simultaneously train a model of future loss of the dynamical model. We execute an action policy determined by this predictive model that learns to focus its attention on what is interesting --- interaction with an object --- without explicitly encoding the notion of an object in the input. The policy adopts behaviors to put an object in view, approach it, and keep it in view. This leads to data collection that is more adversarial to the world model, allowing it to reach performance that a model trained on data through a random policy does not reach. The backbone developed is able to generalize to object localization through the training of a simple readout model without the backbone having been exposed to the true values of this problem. This represents first steps towards ecologically-realistic training of a vision system through an interactive, embodied process.
Meeting abstract presented at VSS 2018