Abstract
A critical question in visual processing is the degree to which egocentric and allocentric reference frames are utilized during target localization. For example Li et al (2017) tested their contributions using the cue conflict task on macaque monkeys, where the monkeys were presented with a target and an allocentric landmark. The landmark was then masked and shifted (or not shifted). During the shift paradigm the monkeys' final gaze position was siginificantly shifted towards the virtually shifted location of the target in allocentric coordinates. In the current work we attempted to model these results by utilizing a convolutional network (ConvNet) with a spatial transformer module. This model inputs a binary image containing a target localized at a particular spatial location as well as an allocentric landmark represented as the intersection of vertical and horizontal lines. It outputs a vector anchored at the (0,0) position on the image matrix, corresponding to the position on the array where the target has been calculated to lie in. The network achieves this through multilayer processing that begins by estimating and applying an affine transformation that accounts for differences in the target vs landmark coordinates, followed by convolution and regression for target localization. The affine transformation is learned through the spatial transformer which takes the image and applies the reverse of the transformations and then feeds the output to the convolutional and regressional layers (Jaderberg et al 2015). The model's outputs is in agreement with the findings in Li et al (2017): As the landmark is shifted away from the target, the network's choice is also shifted away from the target position. Future work will look to increase robustness in terms of target localization with respect to mutliple allocentric landmarks and to modify the model's architecture to include hand-crafted components to increase precision.
Meeting abstract presented at VSS 2018