Abstract
The primate visual system shows a high degree of selectivity while at the same time being robust to large transformations of the images. This remarkable degree of invariance and specificity can be achieved quite rapidly as evidenced by the short latency of selective spike responses in higher levels of visual cortex and also by psychophysics performance in rapid visual presentation tasks and backward masking tasks. The speed of processing in the visual system has led to the notion that feed-forward processing can account for several aspects of visual object recognition. Indeed, a purely feedforward computational model is very successful in explaining physiological responses in several areas of visual cortex from primary visual cortex through inferior temporal cortex and also several psychophysical observations where humans need to rapidly identify or categorize objects. Here we quantitatively explore the limits of feedforward processing by examining how well a purely feedforward model can perform identification and categorization tasks in images containing multiple objects or other scenarios where we parametrically vary the amount of clutter in the image. We first show that the performance of the model in response to isolated objects matches the electrophysiological properties of IT neurons in terms of their accuracy as well the robustness to changes in object scale and position. We then show that performance degrades with increasing number of objects or with increasing degrees of clutter in natural scenes. These results point to the limits of feedforward processing in visual object recognition and emphasize the role of attention in processing and interpreting complex natural scenes.
Ophthalmology Foundation Children's Hospital Boston, American Epilepsy Foundation, McGovern Institute