Abstract
Within the last decade, Artificial Neural Networks (ANNs) have emerged as powerful computer vision systems that match or exceed human performance on some benchmark tasks such as image classification. But whether current ANNs are suitable computational models of the human visual system remains an open question: While ANNs have proven to be capable of predicting neural activations in primate visual cortex, psychophysical experiments show behavioral differences between ANNs and human subjects as quantified by error consistency. Error consistency is typically measured by briefly presenting natural or corrupted images to human subjects and asking them to perform an n-way classification task under time pressure. But for how long should stimuli ideally be presented to guarantee a fair comparison with ANNs? Here we investigate the role of presentation time and find that it strongly affects error consistency. We systematically vary presentation times from 8.3ms to >1000ms, followed by a noise mask, and measure human performance and reaction times on natural, lowpass-filtered and noisy images. Our experiment constitutes a fine-grained analysis of human image classification under both image corruption and time pressure, showing that even drastically time-constrained humans who are exposed to the stimuli for only a single frame, i.e. 8.3ms, can still solve our 8-way classification task with success rates above chance. Importantly, the shift and slope of the psychometric function relating recognition accuracy to presentation time also depends on the type of corruption. In addition we find that error consistency also depends systematically on presentation time. Together our findings raise the question of how to properly set presentation time in human-machine comparisons. Second, the differential benefit of longer presentation times depending on image corruption is consistent with the notion that recurrent processing plays a role in human object recognition, at least for images that are difficult to recognise.