Abstract
Recent advances in reinforcement learning demonstrate that navigation and prediction of novel views do not require the agent to have a 3D model of the scene. Here, we examine a reinforcement learning method that rewards an agent for arriving at a target image but does not generate a 3D 'map'. We compare this to a biologically motivated alternative that also avoids a 3D reconstruction; it is a hand-crafted representation based on relative visual directions (RVD) which has, by design, a high degree of geometric consistency. We tested the ability of both types of representation to support geometric tasks such as interpolating between learned locations. In both cases, interpolation is possible if two stored feature vectors in the network – each associated with a given location - are averaged and the mean vector is decoded to recover a mean location. The performance is much more variable for the reinforcement learning model than for the RVD model (about seven times greater standard deviation). We show the same result for interpolation of camera orientation. A tSNE projection of the stored vectors (into 2D) for each type of representation illustrates why performance of the two models should be different on these tasks. In the RVD model, the tSNE projection shows a regular pattern reflecting the geometric layout of the learned locations in space whereas, for the reinforcement learning model, the clustering of stored vectors reflects other factors such as the agent’s goals during training. Our comparison of these two models demonstrates that it is advantageous to include information about the persistence of features as the camera translates (e.g. distant features persist). It is likely that representations of this sort, storing high-dimensional state vectors instead of 3D coordinates, will be increasingly important in the search for robust models of human spatial perception and navigation.