Crowding is a prominent phenomenon in peripheral vision where nearby objects impede one's ability to identify a target of interest. The precise mechanism of crowding is not known. We used ideal observer analysis and a noise-masking paradigm to identify the functional mechanism of crowding. We tested letter identification in the periphery with and without flanking letters and found that crowding increases equivalent input noise and decreases sampling efficiency. Crowding effectively causes the signal from the target to be noisier and at the same time reduces the visual system's ability to make use of a noisy signal. After practicing identification of flanked letters without noise in the periphery for 6 days, subjects' performance for identifying flanked letters improved (reduction of crowding). Across subjects, the improvement was attributable to either a decrease in crowding-induced equivalent input noise or an increase in sampling efficiency, but seldom both. This pattern of results is consistent with a simple model whereby learning reduces crowding by adjusting the spatial extent of a perceptual window used to gather relevant input features. Following learning, subjects with inappropriately large windows reduced their window sizes; while subjects with inappropriately small windows increased their window sizes. The improvement in equivalent input noise and sampling efficiency persists for at least 6 months.

*N*

_{eq}, and (2) a “device” that reduces the available signal-to-noise ratio by effectively down-sampling the noisy input by a factor of

*η*before giving it to the ideal observer for identification (Tjan et al., 1995). These two limiting factors, equivalent input noise (

*N*

_{eq}) and sampling efficiency (

*η*), are macroscopic descriptors of a visual system since they encompass many possible mechanistic realizations.

*N*

_{eq}is to see it as an aggregated quantity of the stochastic noises internal to the visual system that are independent of the target signal for a given task. Noise may exist at different stages of visual processing and it can also be caused by other stimuli in a visual scene.

*N*

_{eq}represents a fundamental limitation in the precision of measurements at different levels of abstraction.

*η*is that it represents the proportion of relevant information in the form of independent statistical samples that the visual system is able to use when making a perceptual decision. The mechanistic instantiation of

*η*can be either deterministic or stochastic. Deterministic causes for

*η*< 1 include computational steps that do not use an accurate or complete specification of the signals to be identified (imprecise template) or consider a variety of possible signals that actually do not appear in the task (invariance, uncertainty). Stochastic causes are forms of additive internal noise with a power spectral density proportional to the signal energy of the input. This type of noise is often called a “multiplicative” noise.

*E*) of the target to reach a given accuracy criterion is linearly related to the power spectral density (

*N*) of the external noise with a non-positive intercept at

*E*= 0:

*E*vs.

*N*(EvN) function is inversely proportional to sampling efficiency (

*η*) and the negative of its horizontal intercept represents the amount of the equivalent input noise (

*N*

_{eq}) (Legge et al., 1987; Tjan et al., 1995). Specifically,

*m*

_{ideal}is the EvN slope of the ideal observer. This property of the ideal observer model presents a simple method for estimating

*η*and

*N*

_{eq}: a target is masked with white noise, and the contrast threshold for identifying the target at a given accuracy criterion is measured at various levels of the masking noise. The absolute value of

*N*

_{eq}and a relative value of

*η*are obtained by fitting a straight line to the EvN data set. The value of

*η*is the ratio of the EvN slope of the true ideal observer

^{1}to that of the modeled human observer (Equation 2). We do not need to explicitly compute the EvN slope of the ideal observer to assess the effects of crowding because the ideal observer is not affected by crowding (3); nevertheless, we provided the ideal observer slope in 3 for future reference.

*η*and

*N*

_{eq}masked only the target and not the flankers that closely flanked the target. Moreover, the flankers were presented at a fixed contrast, independent of target contrast. If a component of crowding is that erroneous spatial pooling causes random features from the flanker positions to be neurally superimposed on the target, then the presence of the flankers will behave like an additional source of noise, leading to an increase in

*N*

_{eq}. If flanker features interact with target features more selectively (e.g., a horizontal flanker feature suppresses the detection of a vertical target feature) or preferentially (e.g., a more reliably detected flanker feature is mistaken as a target feature), or if the visual system attempts to minimize crowding by being more stringent in its selection of target features, the equivalent number of target features utilized will be reduced when flankers are present, leading to a reduction in

*η*. Changes in

*η*and

*N*

_{eq}due to the presence of flankers as compared to the target-alone condition thus reveal different functional components of crowding.

*η*and

*N*

_{eq}, being macroscopic descriptors, do not uniquely correspond to a specific neural mechanism. The general notion of erroneous feature integration, for example, is not a precisely defined mechanism. An indiscriminate integration of flanker features at a later processing stage could yield a large

*N*

_{eq}, while a bias toward using features that are more reliably detected, whether appropriately from the target or inappropriately from the flankers, would lead to a decrease in

*η*. As always, the most reasonable mechanistic interpretation depends on the overall pattern of the empirical findings and parsimony of the interpretation.

*η*and

*N*

_{eq}, although both are available ( 1 and 3). Instead, we are more interested in how a subject's efficiency and equivalent noise are affected in the presence of crowding. Therefore, we will express a subject's efficiency and equivalent input noise in a target-flanked condition relative to those in a target-only (unflanked) condition. Specifically, we define the

*efficiency ratio*(

*η*

_{r}) as the ratio of the sampling efficiency between the flanked and unflanked conditions:

*m*is the slope of EvN line of Equation 1. We also define the

*equivalent noise difference*(Δ

*N*

_{eq}) as the difference between the flanked and unflanked conditions:

*η*

_{r}is the fraction of the quantity of target features used by the visual system in crowding relative to those used without crowding. Likewise, Δ

*N*

_{eq}can be thought of as the amount of the random flanker features, in units of noise, which are masking or mistaken for target features. Any changes in

*η*

_{r}and Δ

*N*

_{eq}after practice will inform us of the mechanistic nature of the reduction in crowding following learning.

Subject | Letter size in x-height |
---|---|

AL | 1.59° |

BW | 1.17° |

CT | 1.70° |

LM | 1.07° |

MB | 1.26° |

SL | 1.92° |

*k*+ 1) of any condition was tested only after all the conditions had been tested at least

*k*times.

^{−6}deg

^{2}. The mean luminance of the noise fields and the background luminance of the display were approximately 20 cd/m

^{2}. Figure 1 depicts a sample of the stimuli in all eight experimental conditions of the pre- and post-tests, along with the condition used for training.

*E*) in units of deg

^{2}is linearly related to the square of the measured threshold Weber contrast. The proportional constant ( 2) is the contrast energy, averaged over the 26 stimulus letters, when each letter is rendered with pixel luminance twice that of the background (a Weber contrast of 1.0).

*E*) vs. power spectral density of the masking noise (

*N*) by minimizing the squared residuals defined in log(

*E*) and scaled by the empirically determined standard error of log(

*E*). This is because the measurement error of

*E*generally increases with

*E,*with a variance proportional to

*E*

^{2}. Bootstrapping was used to estimate the median and the 95% confidence intervals for quantities of interest.

*F*(1,5) = 211.7,

*p*= 0.00003; post-test:

*F*(1,5) = 86.4,

*p*= 0.0002; follow-up:

*F*(1,4) = 259.1,

*p*= 0.00009). The key numerical values extracted from Figure 3 are given in Table A1 in 1. Training was effective in improving the accuracy of identifying letters in a flanked condition at full contrast without noise. The average accuracy during the training sessions improved from 50% in Day 1 to 61% in Day 6 ( Table A2,

*F*(1,5) = 37.6,

*p*= 0.002), replicating a finding in Chung (2007).

*η*

_{r}was 0.22 before training and 0.36 after training), and the equivalent noise difference was never close to zero (mean Δ

*N*

_{eq}was 211 × 10

^{−6}deg

^{2}before training and 133 × 10

^{−6}deg

^{2}after training; in comparison, the strongest external noise used in the experiments was 241 × 10

^{−6}deg

^{2}in power spectral density). The reduction in efficiency and increase in equivalent input noise due to crowding corresponds to the findings of Nandy and Tjan (2007) in that fewer appropriate features and more inappropriate features are being utilized in crowding, respectively.

*F*(1,5) = 6.46,

*p*= 0.052) and no significant reduction in the equivalent noise difference (a mean of 211 × 10

^{−6}deg

^{2}pre-test vs. 133 × 10

^{−6}deg

^{2}post-test,

*F*(1,5) = 1.36,

*p*= 0.30). Such group analyses are misleading, however, because each individual subject did have a significant reduction in crowding as shown in Figures 3 and 4. Table 2 summarizes the improvements in the efficiency ratio and equivalent noise difference for individual subjects. With the exception of subjects AL and CT, who improved in both efficiency and equivalent input noise, a majority of the subjects (four out of six) improved in only one of the two quantities. Furthermore, if we perform a median split on the data, we find that all three subjects with equivalent noise differences above the median improved by reducing their equivalent input noise, while all three subjects with efficiency ratios below the median improved by increasing their efficiency. The data suggest that these two forms of improvements can be mutually exclusive. We postulate that the practice-induced improvement may affect only a single perceptual factor. We shall return to this point in the Discussion section.

Subject | Improvement in | |
---|---|---|

η _{r} | Δ N _{eq} | |

AL | ✓ | ✓ |

BW | × | ✓ |

CT | ✓ | ✓ |

LM | ✓ | × |

MB | ✓ | × |

SL | × | ✓ |

Unflanked | ||||||
---|---|---|---|---|---|---|

Subject | m (unitless) | 95% Confidence interval | N _{eq} (deg ^{2}) × 10 ^{−5} | 95% Confidence interval | ||

Lower bound | Upper bound | Lower bound × 10 ^{−5} | Upper bound × 10 ^{−5} | |||

Pre-test | ||||||

AL | 452 | 401 | 511 | 3.21 | 2.67 | 3.74 |

BW | 555 | 489 | 844 | 1.89 | −1.24 | 2.05 |

CT | 647 | 582 | 720 | 2.06 | 1.95 | 2.54 |

LM | 469 | 421 | 519 | 3.22 | 2.70 | 3.32 |

MB | 432 | 383 | 510 | 3.11 | 2.50 | 3.90 |

SL | 655 | 561 | 1115 | 2.30 | −1.46 | 2.48 |

| ||||||

Post-test | ||||||

AL | 359 | 319 | 402 | 3.71 | 3.25 | 4.66 |

BW | 514 | 458 | 804 | 2.18 | −1.41 | 2.45 |

CT | 477 | 426 | 537 | 2.33 | 2.17 | 2.87 |

LM | 404 | 353 | 462 | 4.11 | 3.71 | 5.59 |

MB | 310 | 280 | 344 | 2.83 | 2.32 | 2.90 |

SL | 394 | 351 | 444 | 3.22 | 2.74 | 3.76 |

| ||||||

Follow-up test | ||||||

AL | 326 | 285 | 373 | 4.51 | 3.82 | 4.71 |

BW | NA | NA | NA | NA | NA | NA |

CT | 488 | 431 | 551 | 2.39 | 2.05 | 2.38 |

LM | 526 | 457 | 601 | 2.59 | 2.10 | 3.06 |

MB | 288 | 256 | 328 | 3.91 | 3.27 | 4.46 |

SL | 593 | 524 | 867 | 1.40 | −1.04 | 1.59 |

| ||||||

Flanked | ||||||

Subject | m (unitless) | 95% Confidence interval | N _{eq} (deg ^{2}) × 10 ^{−5} | 95% Confidence interval | ||

Lower bound | Upper bound | Lower bound × 10 ^{−5} | Upper bound × 10 ^{−5} | |||

Pre-test | ||||||

AL | 1756 | 725 | 2585 | 27.5 | 15.4 | 73.0 |

BW | 2523 | 1312 | 3988 | 35.9 | 20.6 | 79.7 |

CT | 6596 | 4154 | 9021 | 10.1 | 7.38 | 18.1 |

LM | 2051 | 1645 | 2512 | 10.9 | 8.57 | 18.8 |

MB | 2612 | 2133 | 3115 | 6.59 | 4.28 | 8.44 |

SL | 2019 | 997 | 3109 | 39.0 | 25.7 | 99.4 |

| ||||||

Post-test | ||||||

AL | 660 | 530 | 801 | 14.7 | 12.3 | 18.5 |

BW | 3658 | 3097 | 4227 | 4.44 | 3.68 | 5.65 |

CT | 1461 | 1196 | 1785 | 4.49 | 2.86 | 7.08 |

LM | 1269 | 786 | 1709 | 20.2 | 15.8 | 40.8 |

MB | 769 | 648 | 910 | 7.08 | 5.32 | 8.28 |

SL | 871 | 686 | 1070 | 10.7 | 7.63 | 14.3 |

| ||||||

Follow-up test | ||||||

AL | 998 | 826 | 1174 | 7.13 | 5.19 | 9.27 |

BW | NA | NA | NA | NA | NA | NA |

CT | 2247 | 1836 | 3768 | 2.17 | −1.23 | 2.61 |

LM | 1731 | 1334 | 2147 | 9.72 | 8.01 | 14.8 |

MB | 775 | 640 | 927 | 9.13 | 7.11 | 11.0 |

SL | 1865 | 1587 | 2166 | 3.12 | 2.90 | 4.37 |

Subject | Average accuracy during training | ||
---|---|---|---|

Day 1 | Day 6 | (Day 6) − (Day 1) | |

AL | 0.488 | 0.563 | 0.075 |

BW | 0.462 | 0.548 | 0.086 |

CT | 0.474 | 0.630 | 0.156 |

LM | 0.579 | 0.634 | 0.055 |

MB | 0.514 | 0.672 | 0.158 |

SL | 0.508 | 0.639 | 0.131 |

*Contrast energy*of a stimulus is defined as the sum of the squared pixel contrast over the signal region of the stimulus multiplied by the area of a stimulus pixel. The “signal region” for the current experiment is defined to be the same as the rectangular region masked by the external noise. The threshold contrast energy (

*E*) reported in the current study is the contrast energy at threshold contrast (

*c*) averaged over the 26 letter stimuli. Specifically,

*x*= Δ

*y*= 0.0777° is the width and height of a stimulus pixel,

*S*is the signal region, and

*t*

_{ j,i}is the contrast of pixel

*i*of letter

*j*when the letter is presented at a contrast of 1.0. For the letter stimuli used in the current experiment, the scaling constant

*E*and

*c*

^{2}varies with letter size and was numerically determined for each subject (in units of deg

^{2}): 3.3441 (AL), 2.1874 (BW), 4.3254 (CT), 1.9462 (LM), 2.3135 (MB), and 5.2404 (SL).

*Noise power spectral density*(

*N*) for the white noise used in the experiments (pixel-wise contrast noise of independent and identically distributed (iid) Gaussian with zero mean) is equal to the variance of a noise pixel divided by the 2-sided bandwidth of the noise; the 2-sided bandwidth of the noise is equal to the reciprocal of the area of a stimulus pixel. That is,

*c*

_{noise}is the rms contrast of the noise.

*I,*the statistically optimal decision rule is to make the response that “center letter is

*r*” for the most probable

*r*:

*I*

_{target}is the region of the image where the target letter is presented. Whether the target letter is flanked is irrelevant because the ideal observer does not have spatial uncertainty and the target letter and flankers do not overlap spatially. Applying Bayes' rule, ignoring scaling factors that do not depend on

*r,*and knowing that (1) each letter is equally likely to be the target, and (2) the masking contrast noise comprises of an independent and identically distributed Gaussian on each pixel with a mean of 0 and a standard deviation of

*c*

_{noise}, we have

*c*is the test contrast of the target letter, and

*T*

_{ r}is the template of letter

*r*at a contrast of 1.0.

*c*that led to letter identification accuracy of 50% with

*c*

_{noise}set at a convenient value of 1.0 (luminance is unbounded in numerical simulation, hence no issue of noise clipping). Using an efficient implementation of the ideal observer described in Tjan and Legge (1998),

^{2}we ran 20 simulations, each consisting of 7800 trials to test each letter 300 times with different noise samples per simulation. Corresponding to the letter stimuli used with each subject in the experiment, the ideal observer slopes (

*m*

_{ideal}) were found to be (±

*SE*): 8.126 ± 0.025 (AL), 8.336 ± 0.029 (BW), 8.192 ± 0.028 (CT), 8.488 ± 0.030 (LM), 8.103 ± 0.031 (MB), and 8.201 ± 0.025 (SL). The ratio

*m*

_{ideal}/

*m*is the sampling efficiency for a human observer with an EvN slope of

*m*. The subjects' EvN slopes from all the test conditions are provided in Table A1. (The same

*m*

_{ideal}for a subject applies to all conditions for that subject.)

^{1}We distinguish between an ideal observer, which is the statistically optimal observer for the given stimuli and task, and an ideal observer model, which is a model of a human observer based on an ideal observer with respect to the stimuli, task, and the explicitly stated limiting factors, such as internal noise and down-sampling.

^{2}The last line of Equation A3 in Tjan and Legge (1998) should read: −2

*a*

^{2}

*XT*− 2

*aσNT*+

*a*

^{2}

*TT*. The error was typographical and did not affect their implementation.