Abstract
The most recent variations of convolutional neural networks (ConvNets) have managed to match and surpass human performance on classification of objects in images (Russakovsky et al., 2015; He et al., 2015). An open question remains whether humans and ConvNets process visual information in a similar fashion. It is known that humans perform object recognition best under certain conditions: e.g. when the object is shown in the canonical view (often three-quarter view) and when the object is presented on a homogenous background. In the current study we compare human and computer model performance on those different levels. Carrying forward the object classification task, we manipulate the relationship between object and background by presenting 3D models of objects A) in isolation, B) with a congruent background and C) with an incongruent background. Finally, we manipulate the viewpoint by presenting these 3D models of objects in different angles. We evaluate the performance of 40 human subjects with that of ConvNets with different depth and complexity. Preliminary results indicate an important, implicit, function of depth in CNN's to segregate the object from the scene. Overall, comparing the performance of humans and computer models on these more specific or detailed tasks will give a more fine-grained view of the similarity between both and could link more cognitive descriptions of behavior to ConvNets.
Meeting abstract presented at VSS 2017