**Classification image analysis is a powerful technique for elucidating linear detection and discrimination mechanisms, but it has primarily been applied to contrast detection. Here we report a novel classification image methodology for identifying linear mechanisms underlying shape discrimination. Although prior attempts to apply classification image methods to shape perception have been confined to simple radial shapes, the method proposed here can be applied to general 2-D (planar) shapes of arbitrary complexity, including natural shapes. Critical to the method is the projection of each target shape onto a Fourier descriptor (FD) basis set, which allows the essential perceptual features of each shape to be represented by a relatively small number of coefficients. We demonstrate that under this projection natural shapes are low pass, following a relatively steep power law. To efficiently identify the observer's classification template, we employ a yes/no paradigm and match the spectral density of the stimulus noise in FD space to the power law density of the target shape. The proposed method generates linear template models for animal shape detection that are predictive of human judgments. These templates are found to be biased away from the ideal, overly weighting lower frequencies. This low-pass bias suggests that higher frequency shape processing relies on nonlinear mechanisms.**

*response = signal 1*trials and subtracting the two means for the

*response = signal 0*trials (Ahumada, 2002; Beard & Ahumada, 1998; Murray et al., 2002). We henceforth refer to this method for template estimation as the noise-averaging method.

*x*,

*y*) coordinates in the image. Thus, to understand shape perception, noise must be added not to the gray levels of the image but to the spatial coordinates of the shape.

*x*,

*y*) of the polygon is represented as a coordinate

*x*+

*yi*in the complex plane. By taking the Fourier transform of the complex vector representing these vertices, we obtain the FD representation, which represents the complex amplitude of the shape at each frequency over the index space of the vector. We stress that by coding a shape as a function of arc length rather than polar angle, the FD representation generates a

*complete*(i.e., invertible) description of an arbitrary polygon in the plane and, thus, can faithfully represent general shapes, including natural shapes. This is quite distinct from radial basis functions, which can only represent shapes that are functions of polar angle (e.g., convex shapes, star-like shapes). Importantly, one can capture the main features of a natural shape using only a small number of the lowest FD frequency components, limiting the dimensionality of the stimulus to a manageable level.

*N*-vector

**s**representing the sequence of

*N*vertices

*s*,

_{j}*M*-vector

**S**, where

*k*indexes frequency:

*k*= 0 is the DC component that determines the location of the stimulus (fixed to zero), the sequence

*k*= −1, −2, …, −

*M*represents low to high negative frequencies and the sequence

*k*= 1, 2, …,

*M*− 1 represents low to high positive frequencies. (Although for real input signals Fourier coefficients for negative frequencies are simply the conjugates of the coefficients for the corresponding positive frequencies, this is not the case for complex signals.)

*M*≪

*N*. In the section on dimensionality, we demonstrate through simulation how using a smaller value for

*M*leads to more accurate estimation of template coefficients. In our psychophysical experiments, we set

*M*= 16; Figure 3 shows an example. Because we must represent both real and imaginary FD coefficients, this shape is described by a total of 64 parameters. These low-pass animal shapes will be used as signal 1 for our discrimination experiments. For signal 0, we construct a stimulus consisting only of the first fundamental (

*S*

_{−1}+

*S*

_{+1}) of the FD representation for the corresponding animal shape with all other coefficients (including DC) set to zero. The resulting signal 0 stimulus traces an ellipse roughly approximating the animal shape. Note that, because the DC component and first fundamental are matched between signals 1 and 0, discrimination must be based on the 29 positive and negative higher frequency components

*k*= ±2, …, ±(

*M*− 1), −

*M*.

**S**corrupted by a complex Gaussian noise vector

**N**added in the FD domain:

**n**is the inverse Fourier transform of

**N**. Note that because the Fourier transform is linear,

**n**will also be Gaussian.

*α*in the range 1.3 to 1.7. (Power law fits explain between 48% and 65% of the variance in these three examples.)

**W**that maximizes SNR and, hence, discriminability

*d*′. If the noise were white, the ideal observer would form a real-valued scalar decision variable

*R*given by

**W**=

**S**

_{1}−

**S**

_{0}is the signal difference,

**W**

*is the Hermetian conjugate (i.e., the conjugate transpose) of*

^{H}**W**and the operator Re() takes the real part of its argument.

*k*(Kay, 1998). Because in our case the noise at each frequency is independent, Σ is diagonal, and the ideal template can be written as

**S**falls roughly as

*R*can also be computed in the spatial domain:

**w**is the inverse Fourier transform of

**W**. In other words, the observer's computation can be thought of as an inner product in either the FD or spatial domains, and the corresponding templates are related through the Fourier transform.

**W**can be computed as

*i*and the observer indicated signal

*j*(Ahumada, 2002; Beard & Ahumada, 1998; Murray et al., 2002).

**ΔS**. However, for the GLM method, the estimated template

**ΔS**.

**W**is only identifiable up to a scale factor. To facilitate comparison with the ideal observer, we scale the classification image

*β*that minimizes a measure of deviation from the signal difference

**ΔS**. We expect uncertainty in the estimated coefficients of the observer template

**W**to scale with the standard deviation of the stimulus noise

*σ*. We therefore determine the scale factor

_{k}*β*that minimizes the

*weighted*squared deviation between the observer template and the signal difference:

**Δs**. Note that the fundamental of the FD representation forms an ellipse in the spatial domain that acts as a kind of “scaffold” that higher FD frequencies modulate. Because the fundamental for signal 0 and signal 1 are matched, the fundamental for the signal difference is zero. As a result, rendering the signal difference or estimated human template without the fundamental generates an uninterpretable squiggle. This is rectified by adding the fundamental to the ideal and estimated human templates prior to displaying in the spatial domain.

*M*of the low-pass FD shape representation affects the accuracy of template estimation, we conducted a second simulation, again using the rabbit shape as signal but varying the dimensionality

*M*of both the signal and the noise components of the stimulus (i.e., the number of low-frequency harmonics) and adjusting the gain of the stimulus noise to maintain 75% correct performance. For each value of

*M*, we ran 30 simulated experiments of 1,500 trials each and used the GLM method to estimate the observer template. Figure 5b shows that, as the stimulus dimensionality

*M*increases, the mean error of estimated observer template coefficients also increases for all estimated FD frequencies. This result demonstrates the importance of using a low-dimensional FD subspace to obtain accurate shape template estimation.

*d′*) (Figure 6). We found that although both FD amplitude and phase carry information, phase is, on average, somewhat more informative than amplitude (mean

*d′*of 1.2 for phase vs. 0.75 for amplitude). A similar pattern is seen for the other two shapes used in this study.

_{−1}+

*S*

_{+1}) of the FD representation for the corresponding animal shape (signal 1), which traces an ellipse roughly approximating the animal shape.

*σ*is the standard deviation of the added Gaussian noise at frequency

_{k}*k*and

*α*is the power-law exponent of the spectral density of the natural shape being discriminated (signal 1). Note that independent noise was added to real and imaginary components of each FD coefficient.

^{1}

*α*exponents of the best-fitting power laws) for templates estimated from all trials, signal 0 trials only or signal 1 trials only,

*F*(1, 12) = 0.29,

*p*= 0.6. This suggests that these nonlinearities have more to do with the phase tuning of human shape-discrimination mechanisms.

*F*(1, 12) = 23.5,

*p*= 0.0004.

*M*based upon the estimated human shape template and a linear model

_{H}*M*based upon the ideal template. To avoid overfitting,

_{I}*M*templates and responses were computed using leave-one-out cross-validation over the 1,500 trials of the experiment.

_{H}*t*score method employed by Morgenstern and Elder (2012), which is related to the measure of choice probability introduced by Britten, Newsome, Shadlen, Celebrini, and Movshon (1996). Although the original choice probability method was nonparametric, in our experiments the Gaussian nature of the stimulus noise means that model responses will also be Gaussian distributed, making a parametric approach appropriate.

*t*score method measures the agreement between the scalar values of the model decision variable and the binary responses of the human observer. The premise is that if the model decision variable is causal on the human responses, its value should be predictive of those responses. To assess this, the trials are first partitioned into a subset in which the stimulus contained signal 0 and a subset in which the stimulus contained signal 1. Then, for each of these subsets, the

*t*score for the difference in the mean model response when the observer responded signal 1 versus when they responded signal 0 is computed:

*R*is the model response,

_{M}*R*is the human response, and

_{H}*n*

_{1}and

*n*

_{0}are the number of trials in which the human observer responded signal 1 and signal 0, respectively. To be consistent with human judgments, the model should generate high values when the observer responds signal 1 and low values when the observer responds signal 0, thus producing a large positive

*t*score.

*t*scores for a linear template model

*M*based on the estimated human template with those for a linear template model

_{H}*M*based on the ideal template;

_{I}*t*scores are generally higher for

*M*,

_{H}*F*(29, 1) = 10.3,

*p*= 0.0033, indicating that the model based upon the estimated human template is more consistent with human behavior than the ideal template model.

*M*and

_{H}*M*to include added internal zero-mean Gaussian noise.

_{I}*M*employs the estimated observer template, and

_{H}*M*employs the ideal template. The gain of the internal noise was adjusted so that the proportion correct of the model matched that of the human observer. Specifically, we measured proportion correct for the model over a range of noise gains, fit a sigmoid function, and then from this function estimated the noise gain that would generate the proportion correct attained by the human observer.

_{I}*p*is proportion correct. The estimated observer template model

_{c}*M*is consistently more predictive of human judgments than the ideal template model

_{H}*M*, and a three-way ANOVA (model × shape × observer) reveals that this difference (main effect of model) is significant,

_{I}*F*(1, 12) = 26.42,

*p*= 0.0002. This result clearly indicates the utility of the shape-classification image method: It produces a model that is significantly more predictive of human performance than could be obtained by simply degrading an ideal observer model with noise.

*M*with human judgments is in the 70%–76% range. Should this be considered good? If

_{H}*M*were a perfect model of the human observer, we would expect its agreement with the human observers to be comparable to its

_{H}*internal*consistency, i.e., the agreement between its responses to identical stimuli (same stimulus noise sample, but different internal noise samples). The results of this analysis are shown in Figure 14b. We observe that the internal consistency of the noisy ideal observer

*M*is at chance levels, indicating that internal noise is the dominant factor limiting its performance. In contrast, we observe much higher internal consistencies for our noisy observer template model

_{I}*M*, in the range of 72%–90%, indicating that stimulus noise and internal noise jointly determine its performance. Importantly, we note that the internal consistency of

_{H}*M*is considerably higher than its agreement with our human observers. This discrepancy shows that

_{H}*M*is not a perfect model of human shape detection; deviations could include both inaccuracies in the estimated template as well as unmodeled nonlinearities in the human visual detection mechanism.

_{H}*M*is from being a perfect model, it would be helpful to also know the internal consistency of the human observer when presented with repeated trials of exactly the same stimulus (same signal, same noise sample). This internal consistency yields an upper bound on the agreement any model with the same internal consistency could hope to achieve with the human data. If model

_{H}*M*achieves agreement near to this upper bound, it should be judged a good model. This motivates our second experiment.

_{H}*t*scores for a (noiseless) linear template model

*M*based on the estimated human template with those for a (noiseless) linear template model

_{H}*M*based on the ideal template;

_{I}*t*scores are generally higher for

*M*,

_{H}*F*(29, 1) = 25.1,

*p*= 2.5 × 10

^{−5}, indicating that the model based upon the estimated human template is more consistent with human behavior than the ideal template model.

*M*and

_{H}*M*to match the performance of our human observers. Figure 17a shows the trial-by-trial agreement of the two resulting models with the human data (i.e., the proportion of model responses matching human responses) together with the average agreement between different human observers as a reference. A three-way ANOVA (model × shape × observer) reveals a significant main effect of model,

_{I}*F*(2, 20) = 31.78,

*p*= 6.2 × 10

^{−7}, and a (protected) least significant difference (LSD) post hoc test indicates that the model based on the estimated observer templates

*M*is significantly more predictive of human responses than the noisy ideal observer model

_{H}*M*,

_{I}*t*(2) = 5.36,

*p*= 3.12 × 10

^{−5}. At the same time, the agreement between different human observers exceeds that between

*M*and human observers,

_{H}*t*(2) = 2.43,

*p*= 0.025. We conclude from this analysis that, although the noisy observer template model

*M*provides a better account of human judgments than the noisy ideal template model

_{H}*M*, there is an important aspect of human shape discrimination, common to our three observers, that is captured by neither model.

_{I}*M*hovers around chance levels, indicating that internal noise is the dominant factor limiting its performance, and the internal consistency of the noisy observer template model

_{I}*M*is consistently higher, indicating that both stimulus and internal noise jointly limit its performance. A three-way ANOVA (model × shape × observer) reveals a significant effect of model on internal consistency,

_{H}*F*(2, 20) = 33.96,

*p*= 3.7 × 10

^{−7}. Post hoc tests, again using Fisher's protected LSD, reveal that model

*M*has significantly higher internal consistency (lower internal noise) than

_{H}*M*,

_{I}*t*(8) = 5.54,

*p*= 2.0 × 10

^{−5}. At the same time, our human observers have higher internal consistency than

*M*,

_{H}*t*(8) = 2.50,

*p*= 0.02.

*M*) curves suggests that

_{H}*M*does a reasonable job in capturing this systematic inefficiency.

_{H}*M*approaches the internal consistency of human observers when matched to human performance (Figure 17). Finally, a plot of performance (proportion correct) versus internal consistency (Figure 18) reveals that human performance is limited by a substantial degree of systematic inefficiency, roughly matched by the systematic inefficiency of our estimated observer templates (model

_{H}*M*).

_{H}*M*model are closest to ideal for the turtle shape, which is the most low pass of the three shapes tested (Figure 4).

_{H}*kσ*, where

*k*is the FD frequency. To take this into account, the observer should attenuate these higher shape frequencies in the linear template. This hypothesis could be tested in the future by measuring performance as a function of the variance of added phase noise to identify the equivalent internal phase noise for both high and low frequencies (Pelli & Farell, 1999).

*S*, but also their phase-invariant moduli

_{k}*S*template coefficients. We found that, in fact, the agreement with the human data for the two models was very similar, and the low-pass fall off was nearly identical for the linear and phase-invariant coefficients. We conclude from this analysis that nonlinearities in human shape discrimination cannot be accounted for by a simple shift from linear to phase-invariant encoding of the stimulus at higher frequencies.

_{k}*is*based on an incoherent (phase-invariant) energy pooling but over highly

*localized*linear filters. Similarly, it is possible that in our experiments higher shape frequencies are coded at least partially incoherently by localized shape mechanisms and combined through nonlinear (e.g., energy) pooling. Candidate localized shape encoding mechanisms include

*shapelets*(Dubinskiy & Zhu, 2003) and

*formlets*(Elder et al., 2013). It is also quite possible that higher FD frequency components are not processed independently from other components. For example, coding of higher FD frequencies may be conditioned upon phase alignment with lower FD frequencies.

*Proceedings of SPIE*, 4324 (pp. 114–122). Bellingham, WA: SPIE. https://doi.org/10.1117/12.431179.

*IEEE Transactions on Medical Imaging*, 21 (5), 429–440.

*Journal of the Optical Society of America A*, 24 (12), B110–B124.

*Journal of Vision*, 2 (1): 8, 121–131, https://doi.org/10.1167/2.1.8. [PubMed] [Article]

*Investigative Ophthalmology and Visual Science*, 40 (4), S572.

*Psychological Review*, 61 (3), 183–193.

*Nature*, 514 (7521), 223–227.

*Proceedings of SPIE*, 3299 (pp. 79–85). Bellingham, WA: SPIE. https://doi.org/10.1117/12.320099.

*Pattern recognition and machine learning*. New York: Springer.

*Journal of Theoretical Biology*, 38, 205–287.

*Visual Neuroscience*, 13, 87–100.

*Current Opinion in Neurobiology*, 17, 140–147.

*Journal of Experimental Psychology: Human Perception and Performance*, 36 (4), 976–993.

*Proceedings of the 9th IEEE ICCV*(pp. 249–256). Los Alamitos, CA: IEEE Society.

*Image and Vision Computing*, 31, 1–13.

*Journal of Vision*, 9 (7): 7, 1–20, https://doi.org/10.1167/9.7.7. [PubMed] [Article]

*Neuroreport*, 9 (2), 303–308.

*Proceedings of the National Academy of Sciences, USA*, 103 (47), 18014–18019.

*Journal of the Optical Society of America A*, 4 (12), 2379–2394.

*Journal of Vision*, 8 (9): 4, 1–15, https://doi.org/10.1167/8.9.4. [PubMed] [Article]

*IEEE Transactions on Computers*, C-21, 195–201.

*Signal detection theory and psychophysics*. New York: Wiley.

*IEEE Transactions on Medical Imaging*, 26 (2), 648–659.

*Cognition*, 18 (1–3), 65–96.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 18 (3), 267–278.

*Fundamentals of statistical signal processing: Detection theory*. Englewood Cliffs, NJ: Prentice-Hall.

*International Journal of Computer Vision*, 15, 189–224.

*Journal of Vision*, 8 (16): 10, 1–19, https://doi.org/10.1167/8.16.10. [PubMed] [Article]

*Journal of Vision*, 14 (12): 24, 1–19, https://doi.org/10.1167/14.12.24. [PubMed] [Article]

*Cognitive Science*, 13, 357–387.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 8 (1), 34–43.

*Journal of Neuroscience*, 32 (11), 3679–3696.

*Journal of Vision*, 11 (5): 2, 1–25, https://doi.org/10.1167/11.5.2. [PubMed] [Article]

*Vision Research*, 123, 26–32.

*Journal of Vision*, 2 (1): 6, 79–104, https://doi.org/10.1167/2.1.6. [PubMed] [Article]

*Journal of Vision*, 7 (2): 5, 1–26, https://doi.org/10.1167/7.2.5. [PubMed] [Article]

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2, 301–312.

*Journal of the Optical Society of America A*, 16 (3), 647–653.

*The Bell System Technical Journal*, 24 (1), 46–156.

*International Journal of Computer Vision*, 70 (1), 55–75.

*Journal of Vision*, 2 (1): 7, 105–120, https://doi.org/10.1167/2.1.7. [PubMed] [Article]

*On growth and form*. Cambridge, UK: Cambridge University Press.

*Nature*, 381, 520–522.

*Investigative Ophthalmology and Visual Science*, 39 (Suppl. 4), S912.

*Perception & Psychophysics*, 33 (2), 113–120.

^{1}Although we defined correctness in terms of the signal (0 or 1) used to generate the stimulus and not in terms of the ideal observer response, for the range of noise gains used in our psychophysical experiments, the ideal observer always generated the correct response, and so the two were equivalent.

*z*=

*x*+

*iy*be a complex normal random variable, where the real and imaginary components may have different means but identical variance:

*z*can also be represented in polar coordinates (

*r*,

*θ*), where

*x*=

*r*cos

*θ*,

*y*=

*r*sin

*θ*. We wish to identify the probability density of

*z*in this polar coordinate frame.

*r′*on the domain

*G*(

*x*) is the cumulative distribution of a standard normal variable

*x*.

*θ*is symmetric about

*θ*

_{0}. We have verified this equation by sampling

*z*and comparing against the resulting empirical density of the phase.

**w**can be computed as

*i*and the observer indicated signal

*j*(Ahumada, 2002; Murray et al., 2002).

*m*-dimensional real-valued random vector representing a visual stimulus, where

**s**is a binary signal variable taking one of two values

**s**or

_{0}**s**and

_{1}*β*is also a random variable. We let

**w**is an

*m*-vector representing the observer template. Without loss of generality, we assume that

**U**be an orthonormal rotation matrix with first column

**w**, so that

**n′**, respectively.

**U**and making the substitution

**w**can be obtained by premultiplying

*r*given by

**w**is the complex-valued observer template and

**w**

*and*

_{x}**w**

*are the real and imaginary components of*

_{y}**w**and

**w**

*can be obtained by normalizing the biased estimate (Equation 13) by the covariance of*

_{xy}**w**

*with the real and imaginary coefficients of the complex-valued template*

_{xy}**w**, Equation 28 also yields an unbiased estimate

*is block-diagonal and can be written as*

_{xy}