Objects in natural scenes are spatially broadband; in contrast, feature detectors in the early stages of visual processing are narrowly tuned in spatial frequency. Earlier studies of feature integration using gratings suggested that integration across spatial frequencies is suboptimal. Here we re-examined this conclusion using a letter identification task at the fovea and at 10 deg in the lower visual field. We found that integration across narrow-band (1-octave) spatial frequency components of letter stimuli is optimal in the fovea. Surprisingly, this optimality is preserved in the periphery, even though feature integration is known to be deficient in the periphery from studies of other form-vision tasks such as crowding. A model that is otherwise a white-noise ideal observer except for a limited spatial resolution defined by the human contrast sensitivity function and using internal templates slightly wider in bandwidth than the stimuli is able to account for the human data. Our findings suggest that deficiency in feature integration found in peripheral vision is not across spatial frequencies.

*selection*along the dimension of spatial frequency (for the purpose of letter identification) is optimal in both the fovea and the periphery.

*integration*across spatial frequencies efficient in the fovea and in the periphery for the identification of an isolated letter? “Feature” refers to any aspect of a stimulus that carries information relevant to a given task. A feature is therefore task dependent. A feature that is useful for detecting a target may not be useful for discriminating it from a distracter. Letters, similar to most visual forms, comprise of a broad range of spatial-frequency components that bear shape information. Efficient integration across spatial frequencies is thus a prerequisite for efficient form vision.

*f*/2 and 2

*f*). We also measured the threshold for identifying letters composed by adding these two frequency components together.

*f*

_{1}and

*f*

_{2}, respectively. We can form a composite stimulus by simply adding the two narrow-band components ( Figure 1C).

*c*

_{ f}be the contrast threshold for identifying a narrow-band letter at center frequency

*f*and

*CS*

_{ f}= 1/

*c*

_{ f}be the corresponding contrast sensitivity. If there is no overlap in the frequency domain between the components with center frequencies

*f*

_{1}and

*f*

_{2}(i.e., the components are orthogonal, with a zero dot product), then for an ideal observer limited by an additive Gaussian equivalent input noise, the contrast sensitivity for the composite can be predicted from those for the components ( 3):

*a posteriori*ideal observer but differs from it by making use of only a constant fraction of the signal-to-noise ratio of the stimulus (Legge, Kersten, & Burgess, 1987; Peli, 1990; Tjan, Braje, Legge, & Kersten, 1995; see also 3). If the component stimuli are handled independently by the visual system and if the information across frequency channels is optimally combined, then Φ will be equal to 1. If on the other hand the information across the frequency channels is not optimally combined, Φ will be less than 1.

*a posteriori*observer with additive Gaussian equivalent input noise. The derivation of the base case for Equation 1 relies on the linear proportionality between signal contrast and the Euclidean distance between pairs of alternatives in the internal decision space. For a 2-way discrimination task, this means that we are assuming that the psychometric function of

*d*′ vs. contrast is linear, which has been shown to be the case for a contrast discrimination task when the target contrast is above detection threshold (“supra-threshold”) (Legge et al., 1987). The details of this derivation are provided in 3. 3 also shows that this base case can be generalized to other types of observer models for which the psychometric function is not linear. In particularly, we are able to show that Equation 1 holds either analytically or is a good approximation for observers with contrast-dependent noise (“multiplicative” noise) and in supra-threshold conditions for observers with a nonlinear transducer.

*f*

_{1}and

*f*

_{2}, of the component stimuli is important. If there is a large difference in contrast sensitivity between the components, then it follows from Equation 1 that the sensitivity for the composite under optimal integration will be very similar to the sensitivity for the more sensitive component, making it difficult to distinguish between optimal and sub-optimal integration. We chose to address this issue by first measuring an observer's spatial tuning function (contrast sensitivity vs. stimuli center frequency) for letter identification. Previous work (Chung et al., 2002) has shown that such a tuning function is roughly symmetric about the peak tuning frequency when frequency is expressed in log units. Given the tuning function, we chose two components of roughly equal sensitivity by selecting their center frequencies as plus and minus one octave from the peak tuning frequency:

*n*th repeat of a particular center-frequency block occurred only after all frequencies had been tested for at least

*n*− 1 blocks. This was done in order to distribute the blocks of each condition evenly throughout the experiment to prevent practice effects from confounding experimental manipulations.

*x*% to a filtered letter that was derived from an unfiltered letter of

*x*% Weber contrast and a filter with a modular transfer function of unit peak amplitude.

*peripheral*letter acuity in both foveal and peripheral viewing conditions. The letter size (the height of a lowercase “x”) used for each of our subjects is shown in Table 1. The letter images were each spatially filtered by a set of raised cosine filters (Alexander, Xie, & Derlacki, 1994; Chung et al., 2002; Chung, Levi, & Legge, 2001; Peli, 1990). Each filter has a bandwidth (full-width at half-height) of 1 octave and is radially symmetric in the log-frequency domain. The transfer function of the filter at radial frequency

*f*

_{r}is given by

*f*

_{ctr}is the spatial frequency corresponding to the peak amplitude of the filter (center frequency) and

*f*

_{cut}is the frequency at which the amplitude of the filter drops to zero (cutoff frequency). Figure 2 depicts examples of filtered images for the letter “a”.

Subject | Letter size |
---|---|

ASN | 0.96° |

BW | 1.15° |

PLB | 1.11° |

JS | 1.18° |

*e*) of the Gaussian envelope of the gratings was set to 4.7 deg. QUEST procedure was used to adjust the contrast of the gratings to achieve an accuracy level of 79%.

*f*

_{ctr}) were logarithmically spaced at 1.25, 1.77, 2.5, 3.54, 5.0, and 7.07 cycles/letter, respectively. Figure 2 shows an example of the filtered stimuli. For each trial, the nominal contrast of the filtered letters was adjusted using a QUEST procedure to achieve a performance threshold of 50% correct (chance was 1/26 or 3.8%). There were 240 trials per center frequency, broken into 4 blocks of 60 trials each.

*r*

^{2}> 0.97). The equation of the parabolic function is given by

*CS*

_{f}is the contrast sensitivity at frequency

*f, A*is the peak sensitivity,

*f*

_{peak}is the peak tuning frequency, and

*σ*is the bandwidth of the function in octaves.

*x*-height). For the same letter size, the average peak tuning frequency in the periphery was lower than that in the fovea (

*t*(6) = 3.7258,

*p*< 0.01). The average tuning bandwidth was about 2.05 and 1.69 (octaves) in the fovea and periphery, respectively ( Table 2). The tuning bandwidths in the periphery were slightly but significantly lower than that that in the fovea (

*t*(6) = 7.3294,

*p*< 0.01). As will be shown with the ideal-observer analysis, an LTF with a narrower bandwidth is indicative of the observer using perceptual templates of

*broader*bandwidths. These results are generally consistent with those obtained in Chung et al. (2002). By using the same letter size in both the fovea and the periphery conditions, the current study is more sensitive to differences in spatial tuning properties between foveal and peripheral vision.

Subject | f _{peak} | Bandwidth | ||
---|---|---|---|---|

Fovea | 10° | Fovea | 10° | |

ASN | 2.54 | 2.07 | 1.96 | 1.67 |

BW | 2.99 | 2.23 | 2.09 | 1.72 |

PLB | 2.46 | 2.29 | 2.14 | 1.62 |

JS | 2.82 | 2.25 | 2.00 | 1.74 |

*f*

_{peak}; (b) the high frequency component by setting the center frequency to twice

*f*

_{peak}; and (c) the composite stimulus by summing the two components at the same contrast ratio as the components would in an unfiltered letter (i.e., a contrast ratio of 1 in units of nominal contrast). See Figure 3 for an example of the stimuli for this experiment. Akin to Experiment 1, we measured the nominal contrast thresholds of these three stimulus categories by using an adaptive QUEST procedure to attain a performance level of 50% accuracy. The integration index was then calculated according to Equation 2.

^{1}

*r*

^{2}= 0.97). The internal noise of the model was adjusted such that at the 50% correct threshold criterion, the peak amplitude of the model's most sensitive tuning function (obtained with the 1-octave templates that matched the bandwidth of the stimuli) is the same as the peak amplitude of the subject's tuning function. Figure 8 compares the bandwidths and peak tuning frequencies of the tuning functions for our family of ideal-observer models to that of the human observers.

^{2}Ideal-observer models that use internal templates beyond 3 octaves wide produce tuning functions that are significantly narrower than the human curves ( Figure 8A) and also lead to integration indices that are significantly greater than 1.0. The latter is due to the fact that the templates used for the component stimuli are no longer orthogonal. The integration indices for these nonorthogonal channels increase monotonically with increasing degree of overlap.

^{3}

*f*) =

*k*× CSF(

*f*) × LSF(

*f*). This simple model argues against the need to posit an active channel selection mechanism that scales with the size of the stimulus. This model corresponds to our ideal-observer model with 1-octave templates. Our simulation results show that while the Chung et al. model is able to predict human performance for cross-spatial-frequency integration, a better fit to the human letter tuning functions can be achieved with 2-octave templates at the same center frequency as the test stimulus. With or without modification to the template bandwidth, the view expressed in Chung et al. (2002) against an active scale-dependent channel selection process remains unchanged. Our results also support the general claim of Chung et al. that the periphery visual system, like the foveal system, uses the appropriate set of spatial-frequency features from the input when performing a letter identification task.

*A*is the peak sensitivity of the CSF,

*f*is the spatial frequency,

*CS*

_{ f}is the contrast sensitivity at frequency

*f, f*

_{peak}is the spatial frequency at peak sensitivity, and

*σ*is the bandwidth of one limb of the function in octaves. Equations A1 and A2 provide accurate interpolation of the CSF within the relevant range of spatial frequencies needed for simulating our ideal-observer model.

*S,*and the ideal observer (a maximum

*a posteriori*(MAP) classifier). A more detailed description of the decision rule for such a classifier can be found elsewhere (Tjan et al., 1995). Here we restate the derivations that are specific to our current application.

*G*(

*f*) denote the transfer function of the CSF ( 1) and

*N*

_{int}be the noise source with each noise pixel being normally distributed with a mean of zero and standard deviation

*σ*. Then the resulting input that is fed into the MAP classifier is given by

*F*{·} and

*F*

^{−1}{·} represent forward and inverse Fourier transform operations and

*S*is the filtered letter stimulus.

*T*) that is the most probable given the input

*I*:

*c*is the internal template contrast, which is not necessarily the same as the stimulus contrast. Since the prior probability P(

*cT*

_{ j}) is constant within the range of feasible contrasts (all letters are equally likely), and P(

*I*) does not depend on

*T*

_{ j}, we can further simplify the posterior probability as follows:

*T*

_{ j MAP}where

**Claim 1**:

*Let c*

_{ f}

*be the contrast threshold in nominal contrast units for identifying a narrow-band letter at center frequency f. If there is no overlap in the frequency domain between the components with center frequencies f*

_{ 1}

*and f*

_{ 2}

*(i.e., the components are orthogonal, with a zero dot product), then for an ideal observer limited by an invariant (stimulus-independent) additive white input noise (at the front-end), the contrast sensitivity (reciprocal of the threshold contrast) for the composite can be predicted from those for the components:*

*a posteriori*classifier) with white additive input noise makes decisions by maximizing the posterior probability given by Equation B3. The posterior is computed in terms of the squared Euclidean distance between the input image (

*I*) and the letter templates at the test contrast (

*cT*

_{ j}), normalized by the noise variance, which is a constant for invariant additive noise. As a result, the pairwise Euclidean distances between the letters at the test contrast jointly determine the average accuracy (Tjan et al., 1995, 2). Hence, without lost of generality, we can confine our derivation to the Euclidean distance between two randomly chosen letters. At a given criterion accuracy, the Euclidean distance between the generic letter pair is a constant.

*A*and

*B*be the templates of two letters such that at contrast

*c,*the signals corresponding to these letters are

*cA*and

*cB,*and their Euclidean distance is

*c*∥

*A*−

*B*∥. For a composite at contrast

*c,*the nominal contrast of its components is also

*c*by definition. This definition of contrast conveniently reflects the equality:

*cA*

_{ f 1+ f 2}=

*c*(

*A*

_{ f 1}+

*A*

_{ f 2}) =

*cA*

_{ f 1}+

*cA*

_{ f 2}, where

*c*is the nominal contrast of

*A*

_{ f 1},

*A*

_{ f 2}, and

*A*

_{ f 1+ f 2}. Now consider the Euclidean distance between two composite letters at nominal contrast

*c*:

*A*

_{ f 1}

*A*

_{ f 2}or

*A*

_{ f 1}

*B*

_{ f 2}) are zero since the components are orthogonal. Equivalently:

*c*

_{ f 1},

*c*

_{ f}

_{2}, and

*c*

_{ f 1+ f 2}are the threshold nominal contrasts for the components

*f*

_{1}and

*f*

_{2}and the composite

*f*

_{1}+

*f*

_{2}, respectively. Expressing the composite of Equation C4 in terms of its components using Equation C3, we have:

**Corollary 1.1**:

*Equation C1*

*holds if the additive input noise of the ideal observer is a multivariate Gaussian and not necessarily uncorrelated or white*.

*P*be the “pre-whitening” matrix. We can follow the derivation of Equation C1 in 3 by replacing

*A*and

*B*by

*PA*and

*PB,*respectively.

**Corollary 1.2**:

*Equation C1*

*holds if a linear filter is placed at the front-end of an ideal observer (as in the CSF-limited ideal observer) before the input noise*.

*F*is linear filter, we can derive Equation C1 by replacing

*A*and

*B*by

*FA*and

*FB,*respectively, in 3.

**Corollary 1.3**:

*Equation C1*

*holds for an observer that is otherwise a Gaussian-noise ideal observer but with sampling efficiencies η*

_{1}

*for f*

_{1}

*and η*

_{2}

*for f*

_{2}.

*η,*then the effective Euclidean distance between a generic pair of letters is scaled by

*η*. In the derivation of 3, this scaling can be implemented by replacing ∥

*A*

_{ f 1}−

*B*

_{ f 1}∥ by

*η*

_{1}∥

*A*

_{ f 1}−

*B*

_{ f 1}∥ and ∥

*A*

_{ f 2}−

*B*

_{ f 2}∥ by

*η*

_{2}∥

*A*

_{ f 2}−

*B*

_{ f 2}∥, starting at Equation C5.

**Corollary 1.4**:

*Equation C1*

*holds for an observer with a contrast nonlinear transducer situated after the input noise*.

**Claim 2**:

*Equation C1*

*holds if a stimulus-dependent Gaussian noise source is added after the invariant input noise, such that the variance of this second noise source is proportional to the sum of the contrast energy of the signal and the variance of the invariant input noise. Such a noise source is often called a “multiplicative” noise. The invariant noise is commonly referred to as an “additive” noise. The additive and multiplicative noises are stochastically independent*.

*m*be the proportionality constant that relates the variance of the multiplicative noise to the sum of the contrast energy and the variance of the additive noise. Let

*a*be the image area in units of pixels,

*X*be an arbitrary letter and [.] denote expected value. The equality of effective Euclidean distances at threshold nominal contrast can be expressed as:

*m*)

*σ*

^{2}and rearrange terms, we have:

*f*

_{1}and

*f*

_{2}components are orthogonal to each other,

**Corollary 2.1**:

*Equation C1*

*holds under supra-threshold conditions for an observer whose front-end consists of an additive “peripheral” noise, followed by a pixel-wise logarithmic compressive nonlinearity, followed by a second additive “central” noise, and is otherwise ideal*.

*d*′ in the case of a 2-way discrimination) between a pair of alternatives.

**Corollary 2.2**:

*For an observer with an expansive nonlinearity in between the peripheral and central noise, the threshold contrast sensitivities to the components and the composite stimuli approach*

*Equation C1*

*in supra-threshold conditions as stimulus contrast increases*.

*f*

_{low}= 1.63 and

*f*

_{high}= 6.51 cpd in the fovea and

*f*

_{low}= 1.3 and

*f*

_{high}= 5.2 cpd in the periphery. The compound gratings were formed by combining (in sine phase) the corresponding component gratings according to the ratio of their detection thresholds, estimated from the subject's CSF. For example, the detection threshold in the periphery of ASN was 0.86% in Weber contrast for

*f*

_{low}and 1.92% for

*f*

_{high}; as a result, the contrast ratio of the components in the composite was 0.86 (

*f*

_{low}) to 1.92 (

*f*

_{high}). Similar in principle to the definition used for letters, we define the nominal contrast of a component or the composite to be the corresponding Weber contrast of the

*f*

_{low}component in the composite.

^{1}We chose to analyze our results using the ideal-observer model of Chung et al. (2002) because excluding the front-end CSF filter, the model is the white-noise ideal observer formulation for our octave bandwidth letter stimuli. We note that the ideal observer tuning function of Chung et al. (2002, Figure 10) is band-pass while that of Solomon and Pelli (1994, Figures 4c and 4d) is low-pass, even though both groups used a white-noise ideal observer. This discrepancy in the ideal-observer tuning functions is superficial since the two groups measured the tuning function in different units. Following the standard engineering conventions, Solomon and Pelli measured amplitude gain per unit

*linear*bandwidth per unit radius. In contrast, staying native to the stimulus units, Chung et al. measured “letter sensitivity” (equivalent to amplitude gain) in units of per

*octave*bandwidth per 2

*π*radii. Chung et al. used octave units because they tested their human observers with octave-wide filtered letters. For Chung et al., the linear bandwidth decreases in proportion to decreasing spatial frequency. As a result, letter sensitivity also decreases with frequency, reaching zero at DC (since the linear bandwidth of a one-octave width stimulus at DC is zero). We can derive the exact formula equating the letter sensitivity

*H*(

*f*) of Chung et al. to the amplitude gain

*G*(

*f*) of Solomon and Pelli. Let

*f*) be the fraction of signal energy utilized by the ideal observer for the letter-identification task over all orientations and for all radial spatial frequencies less than

*f*. The

*gain*of Solomon & Pelli is equivalent to

*letter sensitivity*of Chung et al. is

*G*(

*f*) is low-pass. By Equation N3, the sensitivity of Chung et al.,

*H*(

*f*), should be bandpass, which is the case. Furthermore, Equation N3 suggests that the low-frequency falloff of

*H*(

*f*) should approach a log–log slope of 1.0 at low radial frequencies, which is also the case (Chung et al., 2002, Figure 10).

^{2}BW's fovea LTF is peculiar since it straddles the 4-octave model LTF at the low frequencies and the 1-octave model LTF at the high frequencies. Our model used a single template bandwidth and underestimated the peak tuning frequency by 0.2 octave, an error that is less than but comparable to those observed in the foveal condition in Chung et al. (2002).

^{3}We observe that the integration index of the model with 2-octave wide templates is very slightly but consistently lower than the observer model with the templates matched to the signal (1-octave wide). Recall that a template for the composite condition is equal to the sum of the templates for the corresponding components. A component template that is 2-octaves wide (full width at half height) can “see” very little of the other signal component in the composite condition because the 1-octave signal components are two octaves apart. Thus, as far as the narrow-band signals are concerned, these 2-octave templates appear independent. However, the component templates see identical noise in the frequency range where they overlap and are not independent with respective to noise. As a result, the signal-to-noise ratio seen by a composite template is lower than the sum of the signal-to-noise ratios seen by the component templates, hence a slight reduction in integration index. The 2-octave templates are not special. If the 1-octave signal components were three octaves apart, then we would observe a slightly decreased integration index for models with template bandwidths between one to three octaves.