Detection of target objects in the surrounding environment is a common visual task. There is a vast psychophysical and modeling literature concerning the detection of targets in artificial and natural backgrounds. Most studies involve detection of additive targets or of some form of image distortion. Although much has been learned from these studies, the targets that most often occur under natural conditions are neither additive nor distorting; rather, they are opaque targets that occlude the backgrounds behind them. Here, we describe our efforts to measure and model detection of occluding targets in natural backgrounds. To systematically vary the properties of the backgrounds, we used the constrained sampling approach of Sebastian, Abrams, and Geisler (2017). Specifically, millions of calibrated gray-scale natural-image patches were sorted into a 3D histogram along the dimensions of luminance, contrast, and phase-invariant similarity to the target. Eccentricity psychometric functions (accuracy as a function of retinal eccentricity) were measured for four different occluding targets and 15 different combinations of background luminance, contrast, and similarity, with a different randomly sampled background on each trial. The complex pattern of results was consistent across the three subjects, and was largely explained by a principled model observer (with only a single efficiency parameter) that combines three image cues (pattern, silhouette, and edge) and four well-known properties of the human visual system (optical blur, blurring and downsampling by the ganglion cells, divisive normalization, intrinsic position uncertainty). The model also explains the thresholds for additive foveal targets in natural backgrounds reported in Sebastian et al. (2017).

*L*), contrast (

*C*), and the spatial similarity (defined later) to the target (

*S*). Natural target objects often have sharp boundaries and contain one or a few dominant orientations; therefore, we measured detection performance in natural backgrounds for four targets — vertical edge, horizontal edge, bowtie (oriented to have only horizontal and vertical edges), and spot (center-surround) — all having the same mean luminance (see Figure 1).

*L*,

*C*, and

*S*), while the other two dimensions were held at their median values. An eccentricity psychometric function describes accuracy (hits and false alarms) as a function of the retinal eccentricity of the target. We chose to measure eccentricity psychometric functions for two reasons. First, for occluding targets it is often not possible to measure thresholds by varying target luminance or contrast, because performance is always well above chance. Second, when looking for a target under natural conditions, its detectability only varies when the fixation location is varied. In other words, eccentricity psychometric functions are an appropriate measure of detectability under natural conditions.

*L*of the patch, the root-mean-squared (RMS) contrast

*C*of the patch, and the cosine similarity

*S*between the amplitude spectrum of the patch and that of the target. The cosine similarity is the dot product of the normalized amplitude spectra of the patch and target. This measure of similarity captures similarity in orientation and spatial frequency and is independent of the phase spectra of the patch and target. Because the similarity measure depends on the specific target, similarity was computed separately for each of the four targets. The mathematical definitions of the patch statistics are given in the Appendix.

^{2}, and the contrast at 33% RMS. On target absent trials, no change was made to the background. The images were then gamma-compressed based on the measured gamma of the monitor (Sony GDM-FW900) and quantized to 8-bit precision (maximum gray level = 97 cd/m

^{2}). The images were presented at a display resolution of 60 pixels per degree (full display size 1920 × 1200 pixels).

*e*is the eccentricity,

*d*′

_{f}is the detectability in the center of the fovea, β is a steepness parameter, γ is a bias parameter,

*e*

_{2}is the eccentricity at which detectability reaches half max (\(d^{\prime} = {{{{d^{\prime}}_f}}/ 2}\)), and Φ( · ) is the standard normal integral function. An eccentricity threshold was defined for each bin as the eccentricity corresponding to a detectability of 1.0 (bias corrected accuracy of 69%). The parameters of each fitted psychometric function were obtained by maximizing likelihood (see Appendix). Standard errors of the thresholds were computed by bootstrap resampling.

*LCS*space. The symbols in Figure 4A show example eccentricity psychometric functions (bias corrected) measured for the three observers for the four targets. The curves show the fits of Equations 1 and 2. As can be seen, the psychometric functions are similar across observers. The black symbols in Figures 4B–D show the average eccentricity thresholds for the three observers along the dimensions of luminance, contrast, and similarity, respectively. The blue symbols (with different saturation) show the thresholds for the individual observers. Note that low thresholds correspond to conditions where the target is less visible in the periphery (low sensitivity).

^{2}) or fourth (21.5 cd/m

^{2}) luminance bins. Both bins have a background luminance that is near the mean luminance of the target (17.8 cd/m

^{2}). Eccentricity thresholds decreased monotonically with background RMS contrast. Along the contrast dimension, eccentricity thresholds were highest at low contrast and saturated to a minimum by the highest contrast bin tested (.81 RMS). The dominant effect of similarity was a decrease in eccentricity threshold with increasing similarity.

*LCS*bin tested in the behavioral experiment, and for each of five retinal eccentricities (see below). The image size was the same in the behavioral experiment (4 degrees). A duplicate of the image was created and a target was placed in the center of the image. The cropping procedure was identical to the creation of the experimental stimulus.

*d*′) varies continuously as a function of retinal location. However, it is not practical to compute the model observer's detectability for every retinal eccentricity; therefore, we computed detectability at the six retinal eccentricities where ganglion-cell sampling decreased by successive factors of two. We then fit the model observer's detectabilities with the same smooth function used to fit the human observers’ detectabilities (see Equations 1 and 2): \(d^{\prime}( e ) = {d^{\prime}_f}{{e_2^\beta } / {( {{e^\beta } + e_2^\beta } )}}\). From these smooth functions (which fit very well) we were then able to compute the predicted detectabilities at all eccentricities, as well as the predicted eccentricity thresholds.

*h*(

_{o}**x**), of the human eye when the pupil has a diameter of 4 mm (Watson & Yellot, 2012; Watson, 2013). We call this blurred image the level-0 image:

**x**= (

*x, y*) is a pixel location, and \(*\) represents the operation of convolution. (Note that here bold letters represent vectors.)

*r*-1; thus,

*t*(

**x**) and the silhouette template

*s*(

**x**) for each target (see next two subsections).

*l*and the pattern of luminance modulation about that mean

*t*(

**x**):

*t*(

**x**) sums to zero (see Appendix).

**t**

_{r}‖ = 1. Figure 5 (left) illustrates the unscaled pattern template for the vertical-edge target, before blurring and down-sampling.

_{T}(

**x**) (an indicator function) is 1.0 in the target region and is 0.0 elsewhere, and 1

_{T′}(

**x**) is 1.0 over an expanded target region that preserves the shape of the target as closely as possible, but contains exactly twice as many pixels, and thus

*s*(

**x**) sums to 0.0. In the present case, the silhouette template has a circular center-minus-surround structure where the center, 1

_{T}(

**x**), has the diameter of the target and the surround, 1

_{T′}(

**x**), a diameter that is \(\sqrt 2 \) larger. The silhouette template response to the stimulus at pyramid level

*r*is obtained by taking the dot product of the blurred and down-sampled silhouette template with the blurred and down-sampled image on that trial:

**s**

_{r}‖ = 1. Figure 5 (middle) illustrates the unscaled silhouette template for all four targets, before blurring and down-sampling.

**I**

_{r}(

**x**) is the luminance gradient, and \({\bf{n}}_r^ \bot \,({\bf{x}} )\) is the unit vector perpendicular to the boundary. The boundary pixel locations are defined in the Appendix. The gradients were computed using a pair of orthogonally oriented derivative-of-Gaussian filters with a standard deviation σ

_{r}matched to the center size of ganglion cell RFs at that the given level of the pyramid. Derivative-of-Gaussian filters are steerable (Freeman & Adelson, 1991), whereby the same gradient computation in any direction can be determined from the output of the pair of orthogonal filters. Thus, the gradient for each boundary pixel

**x**was determined by computing:

_{o}is the standard deviation of the Gaussian approximation to the optical point-spread function (recall that the standard deviation of the blur kernel at each pyramid level is 1). We defined the unit vector perpendicular to the boundary to be the unit gradient vector calculated for a uniform target on a uniform background. The local luminance and RMS contrast was computed under a Gaussian envelope having the same standard deviation (σ

_{r}) used to compute the gradients. When the target is present, the gradient will tend to be normal to the boundary at location

**x**and hence the magnitude of the dot products in Equation 11 will tend to be larger when the target is present. Figure 5 (right) shows an example of the boundary pixels and a unit normal vector.

*d*′

_{0}is the detectability without uncertainty, and

*u*is the uncertainty constant. This formula is an approximation that holds quite accurately for all the cases we have simulated so far. Assuming that the uncertainty standard deviation is proportional to ganglion cell spacing, we used the measurements of errors in target localization as function of retinal location from Michel & Geisler (2011) to estimate that the uncertainty standard deviation in the center of the fovea is approximately 5 arc minutes (10 times the foveal ganglion-cell spacing). We then assumed that the standard deviation was the same number of ganglion cells at each level of the pyramid (i.e. 10 times the ganglion-cell spacing corresponding to that level of the pyramid). Finally, we then measured simulated psychometric functions (for detection of additive targets in 1/f noise) for each pyramid level, and fit the psychometric function with Equation 13 to estimate the value of the uncertainty constant. The estimated values of

*u*for levels 1 through 6 are: 3.64, 6.96, 12.86, 25.11, 39.63, and 50.01.

*R*represents either the pattern-template response

*R*(Equation 8), the silhouette template response

_{P}*R*(Equation 10) or the edge energy response

_{S}*R*(Equation 11). However, we note that for the edge energy response the normalization is only by local luminance and contrast. The optimal decision variable is the log likelihood ratio of the normalized response given the response means and standard deviations estimated from thousands of target present and absent trials, in the given background bin. The distributions are approximately Gaussian distributed, but with different means and standard deviations for target present and target absent. Thus, on both target-present and target-absent trials, the decision variable has (approximately) a generalized chi-squared distribution. To determine the detectability without intrinsic position uncertainty,

_{E}*d*′

_{0}, we integrate the generalized chi-squared distributions on those sides of the optimal decision criterion (bound) corresponding to errors to obtain the error rate

*p*, which is then converted to detectability using the standard formula:

_{e}*d*′

_{0}= 2Φ

^{−1}(1 −

*p*), where Φ

_{e}^{−1}is the inverse of the standard normal integral function. (Code for integrating generalized chi-squared distributions is available at https://github.com/abhranildas/classify.) Finally, we include the effect of intrinsic position uncertainty at the eccentricity corresponding the given level of the pyramid to obtain the predicted detectability

*d*′

_{r}.

*d*′

_{0}, as a function of the three background dimensions (rows), at three different eccentricities (columns), for each cue (colors). For all three cues and eccentricities, detectability decreases monotonically with background contrast and similarity. However, as a function of background luminance, detectability values for the edge and silhouette cue decrease to a minimum at approximately the luminance of the target and then increase. When the background luminance is near the target luminance, the pattern cue tends to provide the most useful information. When the background luminance is very different from the target luminance, then the edge and silhouette cues tend to provide the most information.

**u**

_{p},

**u**

_{a},

**Σ**

_{p}, and

**Σ**

_{a}are the mean vectors and covariance matrices of the target-present and target-absent distributions. This decision variable, for both target present and target absent, is also a generalized chi-squared distribution and hence detectabilities can be computed in the same way as for the single cues. The drawback of this approach is that the covariance matrices must be estimated for all conditions. In addition, one might wonder whether such a representation and computation is biologically plausible. However, we note that it is not implausible that the visual system has at least some implicit knowledge of the approximate correlations of the three cues at different retinal locations.

*d*′

_{0}. The black symbols and curves in Figures 8A–C show the predicted thresholds for independent cue combination (Equation 15, η = 0.73), and the blue symbols (and error bars) show the thresholds (and confidence intervals) of each subject in each condition (from Figures 4B–D).

*R*

^{2}= 0.59). The efficiency parameter was estimated separately to obtain the best fit for each version of the model.

*L*×

*C*×

*S*. Detectability goes down as the standard deviation of the cue responses on target-absent trials increases.

*a*∝

_{t}*L*×

*C*×

*S*, and that this behavior was accurately predicted with only the pattern-template cue together with divisive normalization by the product of background luminance, contrast, and similarity. This result is consistent with the present study because in the Sebastian et al. experiment the silhouette and edge cues provide no information (and hence could be down-weighted in the model observer), and because intrinsic position uncertainty has a minimal effect in the fovea. In addition, it is easy to show, using Equation 13, that position uncertainty only affects overall efficiency, not the prediction of separable multidimensional Weber's law (see Appendix). In short, the Sebastian model for additive targets is a special case of the current model observer for occluding targets.

*Journal of Vision,*14(8):22, 1–38. [CrossRef]

*Visual Neuroscience,*7, 531–546. [CrossRef]

*Journal of Vision,*10(2):23, 1–15 [CrossRef]

*Journal of Vision,*9(10):1, 1–19. [CrossRef]

*Nature,*226, 177–178. [CrossRef]

*Journal of Vision,*14(12):22, 1–22. [CrossRef]

*The Handbook of Medical Image Perception and Techniques, Second Edition*. Cambridge, London, UK: Cambridge University Press.

*Science,*214, 93–94. [CrossRef]

*Perception and Psychophysics,*39, 87–95. [CrossRef]

*Nature Reviews Neuroscience,*13, 51–62. [CrossRef]

*Journal of Comparative Neurology,*300(1), 5–25. [CrossRef]

*Journal of Comparative Neurology,*292(4), 497–523. [CrossRef]

*Journal of Vision,*18(10), 549–549. [CrossRef]

*Vision Research*, 47, 2901–2911. [CrossRef] [PubMed]

*Journal of Neuroscience,*27(6), 1266–1270. [CrossRef]

*Annual Review of Neuroscience,*30, 1–30. [CrossRef]

*Nature Neuroscience,*14(9), 1195–1201. [CrossRef]

*IEEE Transactions on Pattern Analysis and Machine Intelligence,*13(9), 891–906. [CrossRef]

*Annual Review of Psychology,*59, 167–192. [CrossRef]

*Vision Research,*51, 771–781. [CrossRef]

*Journal of Vision,*18(2): 1, 1–10.

*Signal Detection Theory and Psychophysics*. New York, NY: Wiley & Sons.

*Computational Models of Visual Perception*(pp. 119–133). Cambridge, London, UK: The MIT Press.

*Visual Neuroscience,*9, 191–197.

*Annual Review of Psychology,*49, 503–535.

*Zweite Mittlg. S. B. Preuss. Akad. Wiss.,*p. 641.

*Journal of the Optical Society of America,*70, 1458–1471.

*Vision Research,*48(5), 635–54.

*Vision. A Computational Investigation into the Human Representation and Processing of Visual Information*. New York, NY: W.H. Freeman and Company.

*Journal of Vision,*11(1):18, 1–18.

*Journal of General Physiology,*34, 463–474.

*Signal Processing: Image Communication,*17(10), 807–823.

*Journal of the Optical Society of America A,*2, 1508–1532.

*Spatial Vision,*10, 437–442.

*Nature Neuroscience,*11, 1129–1135.

*Transactions of the IRE Professional Group on Information Theory,*4, 171–212.

*Vision Research,*37, 3225–3235.

*Proceedings of the National Academy of Sciences,*14(28), E5731–E5740.

*Proceedings of the National Academy of Sciences*, https://www.pnas.org/content/117/47/29363.

*Animal Camouflage: Mechanisms and Function*. Cambridge, London, UK: Cambridge University Press.

*Journal of the Optical Society of America,*62, 1221–1232.

*Perception & Psychophysics,*29, 521–534.

*Journal of Vision,*12(7):6, 1–19.

*Journal of Vision,*13(6):18, 1–11.

*Journal of Vision,*14(7):14, 1–17.

*Journal of Vision,*12(10):12, 1–16.

*Nature,*341, 643–646.

*Trends in Cognitive Sciences,*15(4), 160–168.

*Vision Research,*23(9), 873–882.

*Visual Neuroscience,*26, 93–108.

**x**= (

*x*,

*y*), ρ defines the radius of the target region and was set as 10 pixels (at 60 pixels/deg), and δ defines the radius of the interior circular region of the target. The radius for the interior region of the spot was seven pixels. All the target patterns satisfied the property that \(\sum\limits_{\bf{x}} {t({\bf{x}})} = 0\).

*l*:

*T*(

**x**) =

*t*(

**x**) +

*l*.

*A*(

_{T}*u*,

*v*) and the Fourier amplitude spectrum of the patch

*A*(

_{I}*u*,

*v*),

*u*and

*v*are the horizontal and vertical spatial frequency. In other words,

*S*is the dot product of the amplitude spectra, represented as vectors normalized to a length of 1.0. To prevent artifacts in the Fourier transform, the amplitude spectrum of the target was obtained by first windowing the target with a circular aperture having a raised cosine ramp width at the edge of two pixels:

*x*

_{1},

*y*

_{1}) and the radius (in pixels) of the target region be ρ

_{1}, then the center and the radius at level

*r*are given by \(( {{x_r},{y_r}} ) = ( {{{{x_1}} / {{2^{r - 1}}}},{{{y_1}} / {{2^{r - 1}}}}} )\), and \({\rho _r} = {{{\rho _1}} / {{2^{r - 1}}}}\). For each image pixel location (

*x*,

_{i}*y*) the direction of the pixel from the center is θ

_{i}_{i}= atan 2 (

*y*−

_{i}*y*,

_{r}*x*−

_{i}*x*) and the real-valued location on the boundary (

_{r}*x*,

*y*) is given by

*x*= ρ

_{r}cos θ

_{i}+

*x*, and

_{r}*y*= ρ

_{r}sin θ

_{i}+

*y*. An image pixel location is defined to be a boundary location if \(\sqrt {{{( {y - {y_i}} )}^2} + {{( {x - {x_i}} )}^2}} < 0.5\).

_{r}*n*defines the number of bins and

*x*

_{min }and

*x*

_{max }define the minimum and maximum of the lower and upper bins. The bin boundaries are determined from the spacing rule

*i*is the index for the

*i*

^{th}bin. The values for

*x*

_{min }and

*x*

_{max }were determined as the 5

^{th}and 95

^{th}percentile of the scene statistics distributions for each dimension. The Table 1 shows the center values of the bins.

*N*

_{·}(

*e*) are the number of hits, false alarms, misses and correct rejections and

**θ**is the vector of parameters with the maximum log likelihood:

*a*is the target amplitude, and

*L*,

*C*, and

*S*are the background luminance, contrast, and similarity. Setting the detectability to 1.0 shows that the model observer's threshold satisfies separable Weber's law:

*a*∝

_{t}*L*×

*C*×

*S*. Text Equation 13 shows that the effect of intrinsic position uncertainty on detectability is given by:

*e*and

*u*are constants, we still have separable Weber's law:

*a*∝

_{t}*L*×

*C*×

*S*.