We can use Korte's (
1923) specification of crowding to analyze the neural mechanisms involved. Consider first the perceived masking in crowding, as depicted in
Figure 2b. This reduction in visibility of the elementary features can be understood in terms of the known connections between cells in the sequence of retinotopic maps through which visual information passes after arriving at V1. It is well established that the receptive field sizes increase by successive factors of about 2 from V1 > V2 > V3 > V3A/V4 (Zeki,
1978). If the letters are set to match the scale of the V1 receptive fields, the flanking letters at a distance of only one letter spacing should have no effect in V1 in terms of the classical receptive fields. They will, however, fall within the receptive fields of V2 and V3 keyed to the location of the test letter. The other property of the retinotopic maps is that they exhibit extensive recurrent inhibition between maps. Thus, the activation of the V2/V3 maps by the flanking letters will result in inhibitory suppression of the neurons responding to the target letters in V1. The suppression is not likely to be complete but to induce sufficient inhibitory reduction to account for the reduced visibility noted by Korte. The scale of the crowding effect, which is given by Pelli et al. (
2004) as approximately 0.4
E at any peripheral eccentricity
E, is such as to require the inhibitory fields to be located in V3, based on the quantification of its size scale. We take the operation of the inhibition to be a form of “contrast contrast”, the reduction in perceived contrast and elevation of detection threshold of a region when surrounded by high-contrast texture relative to their perception with a uniform surround (Cannon & Fullenkamp,
1991;
1993; Chubb, Sperling, & Solomon,
1989; Ejima & Takahashi,
1985; Ellemberg, Wilkinson, Wilson, & Arsenault,
1998; McDonald & Tadmor, 2006; Snowden & Hammett,
1998; Xing & Heeger,
2000,
2001). Such results can be explained by a quantitative spatial normalization model of the dual masking and sensitivity modulation of the visibility of a central target by flanking elements (Chen & Tyler,
2001,
2002). It was mentioned in the
Introduction section that previous analyses of the mechanism of crowding (Intriligator & Cavanagh,
2001; Pelli et al.,
2004; Põder,
2006) do not provide a complete account of the recognition process. Specifically:
Attention appears to speed processing of objects that are part of the same surface regardless of their absolute spatial location… Given the range of views concerning the mechanisms of attention, it is hard to describe the region over which attention operates in terms compatible with all the alternatives—space- or object-based selection, resource allocation, or filters. For simplicity, we use the term “selection” to describe the operation of attention and “region of selection” to describe the area over which it operates but we do not favor one model over another (Intriligator & Cavanagh, 2001, p. 173).
Despite progress in vision research, we still can only barely begin to answer a simple question like, “How do I recognize the letter A?” …This (nonlinear) assembly process is called “feature integration” (or “binding”). Feature integration may internally represent the combined features as an object, but we will not address that here (Pelli et al., 2004, p. 1137).