In this study, I attempted to combine the main psychophysical findings on visual crowding, attention, and central capacity limitations with hierarchical neural network model. I suggest that the similar principles of pooling and selection, at various levels of visual processing, can explain different psychophysical phenomena—visual crowding and central capacity limitations. An important factor seems to be the level where spatial attention is applied. Regardless of similar principles, the exact computations may vary across levels. For example, different pooling rules (averaging, max, correlation) may dominate at different levels of processing.
Recent crowding studies have reported many results that contradict simple pooling over fixed receptive fields. Different explanations from qualitative grouping account to relatively complex computational models have been proposed. I believe that a notion of saliency could be useful here. There is a lot of evidence for similarity-dependent lateral inhibition at different levels of biological visual systems. Similar calculations have been used in machine vision as well (e.g.,
Jarrett, Kavukcuoglu, Ranzato, & LeCun, 2009). Several simple cases from crowding experiments apparently fit well to this simple model. In more complex cases, saliency can be calculated at several levels of visual processing, and candidate objects at different levels may compete for access to further processing.
The present study suggests how classic ideas of attention and feature integration could be related to modern neural network models of vision. I suppose that the “simple” features to be combined should be a bit more complex, and the “spotlight” of attention has a minimum radius of about half of eccentricity. The amendments do not contradict the majority of earlier results because complex models can be reduced to simplified versions when traditional simple stimuli are used.