In crowding, perception of a target deteriorates when neighboring elements are added. Crowding is ubiquitous in natural vision since stimuli are rarely presented in isolation (except, perhaps, during psychophysics experiments). Classically, crowding was explained by local interactions that pooled information of the target and the flankers and thereby lose target information. However, such simple pooling models do not provide a comprehensive framework for understanding visual crowding since more recent research shows that the entire visual scene determines crowding and not only nearby elements (
Bornet, Doerig, Herzog, Francis, & Van der Burg, 2021;
Doerig et al., 2019;
Herzog, Sayim, Chicherov, & Manassi, 2015;
Herzog & Manassi, 2015;
Manassi, Sayim, & Herzog, 2012;
Manassi, Hermens, Francis, & Herzog, 2015;
Manassi, Lonchampt, Clarke, & Herzog, 2016). A clear-cut example is uncrowding, where the addition of elements to an already crowded target improves performance. For example, in
Figure 1, vernier offset discrimination strongly deteriorates when flanking lines are presented next to the vernier (
Figures 1A vs.
1B) (i.e., crowding). When these lines are replaced by longer lines (
Figures 1B vs.
1C), crowding diminishes, even though there is more flanker “signal.” Crowding further diminishes when more long flanking lines are presented, approaching performance in the unflanked, vernier-only condition (
Herzog et al., 2015;
Moore & Zheng, 2024).
Consequently, uncrowding challenges models in which flankers deteriorate performance because of locally added noise, increased lateral inhibition, or similar mechanisms. To account for these results, we proposed a two-stage model, in which, first, elements in a scene are grouped into separate objects and textures. Second, interference occurs with respect to the perceived grouping: Crowding occurs within grouped elements and not by, for example, pooling over small inflexible regions defined by the receptive field size of neurons.
Importantly, our hypothesis of grouping as a central mechanism for crowding is a perceptual hypothesis and, in this respect, is almost trivial or tautological: If the target is perceived to group with the flankers, there is crowding. If the perception of the target is perceived as separate, there is no crowding. Indeed, there are significant correlations between crowding and subjective ratings of how much the target stands out from the flankers (
Malania, Herzog, & Westheimer, 2007;
Manassi et al., 2012;
Saarela, Sayim, Westheimer, & Herzog, 2009;
Wolford & Chambers, 1983). The subjective perception of the stimulus configuration, therefore, is critical (
Herzog et al., 2015) and must be tested independently, in addition to performance with accuracy measures.
Perceptual grouping of elements is strongly influenced by three-dimensional (3D) information such as occlusion. In a recent study,
Moore and Zheng (2024) aimed to investigate whether mid-level mediation can induce ungrouping of the flankers from a vernier target, thus provoking uncrowding. They presented three separate short vertical gratings stacked vertically, with the middle one containing the vernier target (see
Figure 1D). As expected, crowding was strong because the vernier grouped with the elements of the central grating. Next, they added Pacmans to induce occlusion cues. They suggested that this change created the percept of a long grating behind horizontal bars (see
Figure 1E), which should lead to uncrowding (as in
Figure 1C). Under the same assumptions, perceiving the lines as three separate gratings with no induced horizontal bar (as in
Figure 1D) would lead to crowding.
Moore and Zheng (2024) found evidence only for crowding. They suggest, in line with low-level explanations of crowding, that the strong crowding comes from the line terminators at the grating endings and that the long lines in
Figure 1C simply “mask” the low-level terminators by continuing the lines. They suggest that similar “hidden” low-level cues may be in operation in the many other demonstrations where grouping was proposed to be key and, hence, question the grouping account in general.
Based on their findings,
Moore and Zheng (2024) conclude that midlevel mediation, such as 3D cues, does not significantly contribute to visual crowding in their experiment. While their title, “Limited Midlevel Mediation of Visual Crowding: Surface Completion Fails to Support Uncrowding,” suggests a broader critique of midlevel mediation, their conclusions in the text are explicitly confined to the specific uncrowding effect they tested. Moreover,
Moore and Zheng (2024) highlight several instances where the LAMINART model, an existing (mid-level model) of visual perception, failed to predict human performance in crowding tasks and conclude that LAMINART’s account, including its use of recurrent architecture and global influences, is insufficient to explain crowding in complex displays. While no model is perfect, we demonstrate that LAMINART can explain their experimental results based on connections formed between grouped elements without a need to consider occlusions.
The main conclusion of
Moore and Zheng (2024) depends on the assumption that the Pacmans induce occlusion cues that lead to the percept of a long grating occluded by two horizontal bars. While we agree that when viewed foveally for long durations and without a specific task, the stimuli induce those 3D occlusion cues, here, we empirically show that their stimuli, when presented in the periphery, do not seem to generate this percept. It is more likely that the occluders effectively function as additional distractors, particularly because they appear between the fixation point and the gratings.