Psychometric functions were measured in a yes/no detection task, as a function of the amplitude of a horizontal 4-cpd raised-cosine windowed sinewave (wavelet) target in cosine phase (see
Figure 1A). On each trial, a natural background was randomly sampled (without replacement) from a large set of natural backgrounds having two levels of amplitude-spectrum similarity
SA and five levels of image similarity
SI|A. On half the trials the target was added to the center of the background. To obtain the backgrounds, we started with a large set of calibrated, high-resolution (4284 × 2844), 14-bits per color natural images that were converted to grayscale and cropped into background patches the size of the target. These background patches were then sorted into a three-dimensional histogram having 10 levels of luminance, 10 levels of RMS contrast, and 10 levels of similarity between the amplitude spectrum of the background and target (for more details see
Sebastian et al., 2017 and
https://natural-scenes.cps.utexas.edu). In the present experiment, all the backgrounds were sampled without replacement from bins having the same level of luminance and contrast, and two different levels of amplitude-spectrum similarity. For each level of amplitude-spectrum similarity the backgrounds were sorted into five bins (quintiles) of image similarity.
Amplitude-spectrum and image similarity are defined on the mean-subtracted target and background (i.e., the target and background are each set to have a mean of zero). The amplitude-spectrum similarity between the target and background was defined to be the cosine similarity between the amplitude spectrum of the mean-subtracted target and background (the dot product of the two spectra divided by the product of their Euclidean norms)
\begin{eqnarray}{S_A} = \frac{{{{{\bf A}}_t}}}{{||{{{\bf A}}_t}||}} \cdot \frac{{{{{\bf A}}_B}}}{{||{{{\bf A}}_B}||}} \quad \end{eqnarray}
The image similarity between target and background was defined to be the cosine similarity between the mean-subtracted target and background:
\begin{eqnarray}{S_{I|A}} = \frac{{{{{\bf I}}_t}}}{{||{{{\bf I}}_t}||}} \cdot \frac{{{{{\bf I}}_B}}}{{||{{{\bf I}}_B}||}} \quad \end{eqnarray}
We use “|A” in the subscript to emphasize that this cosine similarity was computed conditional on the value of
SA, even though the formula (
Equation 2) does not directly depend on
SA. The two levels of amplitude-spectrum similarity were 0.18 (low similarity) and 0.38 (high similarity). These two levels correspond to the second and ninth bin of the 10 bins in
Sebastian et al. (2017). They were picked to be near the ends of the range in natural images, yet to have a large number of image patches. The five levels of image similarity were defined to be the midpoint of the quintiles of image similarity within each of the two amplitude-spectrum similarity bins (−0.15, −0.06, 0.00, 0.06, 0.15 for the high amplitude similarity bin, and −0.05, −0.02, 0.00, 0.02, 0.05 for the low-amplitude similarity bin).
At the beginning of each trial, the central fixation cue was displayed for 750 ms and was then extinguished for 250 ms. Next, the stimulus was displayed for 250 ms, followed by a response interval of one second. Feedback was given on each trial. The displayed background patches had a diameter of 516 pixels (4.3°) and, hence, included the context region surrounding the smaller 96-pixel diameter background patch (0.8°) used to determine the luminance, contrast and similarities in the target region. The stimuli were displayed at 120 pixels/deg, on a Sony GDM-FW900 CRT Monitor with a total background size of 19.2° × 12°. The luminance of the screen outside the background patch was set to the mean luminance of the background patches, which was always 50 cd/m2. The target was present on half of the trials. When present, the target appeared in the center of the background. The amplitude a of the target was defined to be the square root of the sum of the squared pixel values (the square root of the target energy). For plotting convenience, we divided the actual RMS amplitude by 97.8; thus an amplitude of 1 in the plots corresponds to 97.8.
For presentation, the 14-bit gray-scale images were clipped to the upper ninety-ninth percentile gray level, gamma-compressed based on the measured gamma of the monitor, and then quantized to the range of 0–255 gray levels. The linear relationship between the desired and displayed luminance was verified with a photodiode following the calibration procedure.
Figure 2 shows examples of the stimuli from the two levels of amplitude spectrum similarity and the two extreme levels of image similarity (first and fifth quintiles).
Psychometric functions were measured on three human observers who all had normal or corrected-to-normal spatial vision. The observer's head was stabilized with a chin and head rest. For each amplitude-spectrum similarity there were 2000 trials spread over four sessions (10 target amplitudes × 50 trials × 4 sessions = 400 trials per image similarity quintile). Target amplitude and amplitude-spectrum similarity were blocked, whereas image similarity was unblocked. For each target amplitude within a quintile, the numbers of hits and correct rejections were converted to a value of discriminability
\(d^{\prime}\) and criterion γ, using the standard formulas from signal-detection theory. The values of
\(d^{\prime}\) were then converted to a value of
PCmax (the percent correct if the criterion were placed optimally) using the standard formula
\(P{C_{\max }} = \Phi ( {{{d^{\prime}} / 2}})\). Finally, thresholds were estimated by maximum-likelihood fitting the values of
PCmax with a generalized cumulative Gaussian function:
\begin{eqnarray}P{C_{\max }}\left( {a\left| {\alpha ,\beta } \right.} \right) = \Phi \left[ {\frac{1}{2}{{\left( {\frac{a}{\alpha }} \right)}^\beta }} \right] \quad \end{eqnarray}
We define threshold to be the target's RMS amplitude a where d′ = 1, and thus the estimated threshold is simply the estimated value of α.