Abstract
Identifying the visual features driving object recognition remains an experimental challenge. Classification images and related methods exploit the correlation between noise perturbations across thousands of stimulus repetitions and behavioral responses to identify image locations that strongly influence observers' decisions. These methods are powerful but inefficient, making them ill-suited for a large-scale exploration of visual features for object recognition. Here, we describe ClickMe.ai, a web-based experiment for large-scale collection of feature-importance maps for object images. ClickMe.ai pairs human participants with computer partners to recognize images from a large dataset. The experiment consisted of rounds of gameplay, where participants used the mouse to reveal image locations to their computer partners. Participants were awarded points based on how quickly their computer partner recognized the target object as an incentive for them to select features that are most diagnostic for visual recognition. We aggregated data over several months of gameplay – yielding nearly half a million feature-importance maps that are consistent across players. We validated the diagnosticity of the visual features revealed by ClickMe.ai with a rapid categorization experiment, in which the proportion of visible features was systematically masked during object recognition. This demonstrated that features identified by ClickMe.ai were sufficient and more informative for object recognition than those found to be salient. Finally, we found that image regions identified by ClickMe.ai are distinct from those used by a deep convolutional network (DCN), a leading machine vision architecture which is approaching human recognition accuracy. We further describe a method for cueing a DCN to these image regions identified by ClickMe.ai while they are trained to discriminate between natural object categories. DCNs trained in this way learned object representations that were significantly more similar to humans' and coincided with more effective predictions of human decisions during visual psychophysics tasks.
Meeting abstract presented at VSS 2018