Classification images have recently become a widely used tool in visual psychophysics. Here, I review the development of classification image methods over the past fifteen years. I provide some historical background, describing how classification images and related methods grew out of established statistical and mathematical frameworks and became common tools for studying biological systems. I describe key developments in classification image methods: use of optimal weighted sums based on the linear observer model, formulation of classification images in terms of the generalized linear model, development of statistical tests, use of priors to reduce dimensionality, methods for experiments with more than two response alternatives, a variant using multiplicative noise, and related methods for examining nonlinearities in visual processing, including second-order Volterra kernels and principal component analysis. I conclude with a selective review of how classification image methods have led to substantive findings in three representative areas of vision research, namely, spatial vision, perceptual organization, and visual search.

^{ SR }is the sample average of noise fields in a stimulus–response class of trials, e.g.,

^{12}is the average of the noise fields over all trials where the stimulus contained signal 1 but the observer identified it as signal 2. 1 gives a summary of the notation used throughout the article.

^{11}and

^{21}will show what features the observer took to be similar to signal 1 and dissimilar to signal 2, since they are averages of noise fields over trials where the observer identified the stimulus as signal 1. If we expect these two images to be similar, then we can sum them to reduce sampling noise. For the same reasons, we might expect

^{12}and

^{22}to show what features the observer took to be similar to signal 2 and dissimilar to signal 1, and we could sum them as well. If we believe that these two summed images,

^{11}+

^{21}and

^{12}+

^{22}, are on average photographic negatives of one another, since they are based on noise fields that led to opposite responses, then we can reduce sampling noise further by adding one to the negative of the other. This sequence of averages, sums, and differences leads to Equation 1.

**n**and the observer's responses

*r*(a random variable where

*r*= 1 or

*r*= 2 on each trial) is

*σ*

_{ n }is the pixelwise standard deviation of the noise field

**n**and

*σ*

_{ r }is the standard deviation of

*r*. With zero-mean noise (

*E*[

**n**] = 0) and an unbiased observer (

*E*[

*r*] = 1.5, since an unbiased observer gives responses 1 and 2 equally often), Equation 2 becomes

*σ*

_{ n }

*σ*

_{ r }to eliminate the scale factor leads to

^{*R }is the sample average of the noise fields over all trials where the observer gave response

*R*. Thus,

**c**

_{corr}is the average of the noise fields over all trials where the observer responded

*r*= 2, minus the average over all trials where the observer responded

*r*= 1, regardless of which signal was shown. Equations 1 and 7 are both weighted sums in which noise fields from trials where the observer responded

*r*= 2 are weighted positively, and noise fields from trials where the observer responded

*r*= 1 are weighted negatively. Thus, a classification image calculated using the standard method in Equation 1 is, loosely speaking, similar to a map showing the correlations between stimulus fluctuations at each pixel and the observer's responses.

^{1}

*x*(

*t*) and a time-varying output

*y*(

*t*), such as a photoreceptor whose input is the luminance at a retinal location and whose output is a membrane potential. The system may be internally complex and may have an intricate relationship between input and output, for example, showing temporal inhibition, gain control, and so on. Volterra (1930) and Wiener (1958) showed that under certain broad conditions (e.g., the system must be time-invariant and have finite memory), such a system can be approximated as a sum of simple subsystems: a zero-order subsystem, plus a first-order subsystem, plus a second-order subsystem, etc. Each subsystem responds to the input in a straightforward way. In Volterra's framework, the output of the zero-order subsystem is a constant

*H*

_{0}, independent of the input. The output

*H*

_{1}(

*t*) of the first-order subsystem is a weighted sum of past inputs, weighted according to a function

*h*

_{1}(

*t*

_{1}) called the first-order

*kernel*:

*H*

_{2}(

*t*) of the second-order subsystem is a weighted sum of pairwise products of past inputs, weighted according to the second-order kernel

*h*

_{2}(

*t*

_{1},

*t*

_{2}):

*n*th-order subsystem is a weighted sum of

*n*-wise products of past inputs, weighted according to the

*n*th-order kernel

*h*

_{ n }(

*t*

_{1},

*t*

_{2}, …,

*t*

_{ n }):

*n*-dimensional convolution.) The output of the system is approximated as the sum of the outputs of the subsystems:

*x*−

*x*

_{0}), (

*x*−

*x*

_{0})

^{2}, (

*x*−

*x*

_{0})

^{3}, etc. In fact, the Volterra series has been called a “Taylor series with memory” (Schetzen, 1980, p. 200), as it allows the estimate of

*y*(

*t*) to depend not only on powers of

*x*(

*t*) at time

*t,*but also on powers and products of past values

*x*(

*t*−

*t*

_{1}).

*G*

_{ i }based on kernels

*g*

_{ i }. The relationship between

*G*

_{ i }and

*g*

_{ i }is similar to the relationship between

*H*

_{ i }and

*h*

_{ i }, but Wiener introduced some refinements that make

*G*

_{ i }easier to use for modeling physical systems. For a thorough account of Volterra and Wiener kernel methods, see Schetzen (1980).

*g*

_{ i }simply by giving it a white noise input and measuring correlations between its input and its output. They showed that the zero-, first-, and second-order Wiener kernels can be measured as

*x*(

*t*) is the zero-mean white noise input and

*K*=

*E*[

*x*(

*t*)

^{2}] is its power spectral density. We can find these expected values by averaging over time (i.e., we assume ergodicity): we give the system a white noise input, and over many values of time

*t,*we find the averages of

*y*(

*t*),

*x*(

*t*−

*t*

_{1})

*y*(

*t*), and

*x*(

*t*−

*t*

_{1})

*x*(

*t*−

*t*

_{2})(

*y*(

*t*) −

_{0}). The discovery that Wiener kernels can be estimated this way held out the possibility of using simple physical measurements to completely characterize the input–output patterns of complex systems.

*linear observer model*is the natural starting point for understanding classification image methods (Abbey, Eckstein, & Bochud, 1999; Ahumada, 2002; Murray, Bennett, & Sekuler, 2002; Solomon, 2002).

**s**

^{1}and

**s**

^{2}, shown in a noise field

**n**that varies from trial to trial. We will let the random variable

*k*be the signal number (1 or 2) on any given trial, so the stimulus is

**g**=

**s**

^{ k }+

**n**, where the components of vectors

**g**,

**s**

^{ k }, and

**n**encode the stimulus contrast at each pixel. We will represent

**g**,

**s**

^{ k }, and

**n**as

*n*× 1 column vectors, even when the stimuli are shown as two-dimensional images in the experiment. The linear observer model assumes that the observer has two templates,

**t**

^{1}and

**t**

^{2}, that are internal representations of the signals; we also represent these as

*n*× 1 column vectors. The observer computes decision variables

*d*

^{1}and

*d*

^{2}by taking the dot product of the two templates with the stimulus. The observer may also add samples from independent, equal-variance internal noise sources,

*z*

^{1}and

*z*

^{2}, to the dot products. That is, the decision variables are

^{ T }is the matrix transpose operation, so

*p*

^{ T }

*q*=

*p*[

*i*]

*q*[

*i*] is the dot product of column vectors

*p*and

*q*. (I use brackets to refer to vector components, because later I will use subscripts to refer to samples from a random variable.) The model assumes that the observer identifies the signal as

**s**

^{1}if

*d*

^{1}plus some constant

*a*is larger than

*d*

^{2}. That is, the response variable

*r*is

*a*allows the model observer to be biased toward choosing one response more often than the other.

*r*= 2 if

**w**=

**t**

^{2}−

**t**

^{1}and an internal noise source

*z*with a variance

*σ*

_{ z }

^{2}that is twice the variance of

*z*

^{1}and

*z*

^{2}:

**t**

^{1}and

**t**

^{2}, since the model depends only on the difference template

**w**. Later, when we discuss experiments with more than two signals, it will be useful to have the multiple-template notation in place (see 12 section).

**g**=

**s**

^{ k }+

**n**and shows how the observer categorizes them. For simplicity, suppose the stimulus is an image with just two pixels, so that we can represent all possible stimuli on a two-dimensional plane (Figure 2a). If the linear observer has no internal noise, then the decision space is divided into two regions:

**w**

^{ T }

**g**<

*a,*where the observer gives response 1, and

**w**

^{ T }

**g**≥

*a,*where the observer gives response 2. The border that divides the two, where

**w**

^{ T }

**g**=

*a,*is a line that is perpendicular to the template

**w**(the black arrow in Figure 2a) and distance

*a*/∣

**w**∣ from the origin.

**s**

^{1}but the observer identified it as

**s**

^{2}(Figure 2a, large green circle). Because the noise is circularly symmetric, the expected value of the

**s**

^{1}stimuli (small green circles) on the

**s**

^{2}side of the decision line is shifted from the overall mean of stimulus (large white circle) in a direction that is perpendicular to the decision line and so in the same direction as the template. Thus, the average of the noise on all such trials is a vector that is proportional to the template (in expected value). The same is true for the averages of the other three stimulus–response classes of noise fields. Taking into account the direction of each shift (some in the direction of the template and some in the opposite direction), we can estimate the template by combining the averages as in Equation 1. That is, the expected value of the classification image is proportional to the linear observer's template.

**w**and the stimuli

**g**=

**s**

^{ k }+

**n**are

*n*-dimensional vectors, the decision surface

**w**

^{ T }

**g**=

*a*is an

*n*-dimensional hyperplane perpendicular to the template, and the conditional averages of the noise fields are vectors parallel to the template.

**w**

^{ T }

**g**<

*a*region are identified as signal 2, and some in the

**w**

^{ T }

**g**≥

*a*region are identified as signal 1 (Figure 2b). However, the conditional expected values of the noise fields are still shifted perpendicular to the decision line, so Equation 1 still gives an unbiased estimate of the template.

*efficient*way of calculating classification images under some circumstances. Suppose we wish to calculate a classification image simply by taking a weighted sum of the noise fields in the four stimulus–response categories, and we wish to choose the weights based on the signal-to-noise ratio of the noise fields in each category, in order to maximize the signal-to-noise ratio of the final classification image. If the observer is unbiased, and if the observer's performance is constant over the course of the experiment, then the standard method is the weighted sum that has the maximum signal-to-noise ratio as an estimate of the observer's template (Abbey & Eckstein, 2002b; Ahumada, 2002; Murray et al., 2002). This means, for instance, that the standard method in Equation 1 is more efficient than the correlation method in Equation 7, which is often used in auditory research and has sometimes been used in vision research as well. Conveniently, the standard method is the optimal weighted sum regardless of the variance of the observer's internal noise.

*general*linear model, the dependent variable is a normal random variable

*y*whose variance is fixed and whose mean

*μ*

_{ y }=

*E*[

*y*] depends linearly on the covariates:

**x**= (

*x*[1], …,

*x*[

*p*])

^{ T }is a column vector of covariates and

*β*= (

*β*[1], …,

*β*[

*p*])

^{ T }is a column vector of regression coefficients. (As noted earlier, I use brackets to refer to vector components.) This model underpins many common statistical methods for handling continuous data, including multiple linear regression and ANOVA. Binary response probabilities have sometimes been modeled in this framework; in such a model, the covariates

**x**represent the stimulus, the regression coefficients

*β*represent the observer's template, the dependent variable

*y*represents the observer's responses, and the expected value of the dependent variable

*E*[

*y*] represents the observer's response probabilities. However, difficulties arise from the fact that the variance of a Bernoulli random variable depends on its mean, and that probabilities are limited to the range [0, 1] whereas the dependent variable

*y*in this model can take on any value (Dobson & Barnett, 2008). Ahumada and Lovell (1971) found a partial solution to this problem by having observers make four-point confidence rating responses instead of just yes–no detection responses and calculating classification images by regressing the rating responses against noise fields. However, rating responses are not a simple linear transformation of the decision variable (Egan, Schulman, & Greenberg, 1959), so this approach does not completely resolve the mismatch between the general linear model and the linear observer model. Nevertheless, Levi and Klein (2002) reported that this method gave higher signal-to-noise ratios than a weighted sum method, although they did not give details of the comparison.

*generalized*linear model (GLM) is an extension of the general linear model that allows the dependent variable to be continuous or categorical and allows the mean of the dependent variable to be a nonlinear function of the covariates. In the GLM, the dependent variable

*y*is a random variable with mean

*μ*

_{ y }=

*E*[

*y*], and

*μ*

_{ y }is a possibly nonlinear function of a linear transformation of the covariates:

*y*is a random variable in the exponential family, which includes the Bernoulli, binomial, multinomial, Poisson, exponential, and normal distributions, among others. The function

*g*is a smooth, monotonic function, called the link function, that relates the mean of the dependent variable to the covariates. Methods for finding maximum likelihood estimates of the GLM regression coefficients

*β*are available in most statistical software packages. Dobson and Barnett (2008) give a clear introduction to the GLM, and McCullagh and Nelder (1989) give a more thorough treatment.

*i,*

**g**

_{ i }=

**n**

_{ i }, the observer responds

*r*= 2 with some probability

*p*and

*r*= 1 with probability 1 −

*p*. (Here, we subscript

**g**,

**k**, and

**n**to indicate that, as components of a stimulus shown on a specific trial number

*i,*they are now samples from random variables, not random variables.) That is, the dependent variable is a Bernoulli random variable and, thus, belongs to the exponential family. We can define

*y*=

*r*− 1, so that

*y*encodes the observer's responses as 0 and 1, and then, the mean of the dependent variable is

*μ*

_{ y }=

*E*[

*y*] =

*p,*which according to the linear observer model is

*x, μ, σ*), this becomes

^{2}

*β,*which are the observer's template and criterion, expressed as multiples of the internal noise standard deviation. This is a very different approach to calculating classification images than the weighted sum method in Equation 1. Knoblauch and Maloney (2008) compared the two methods in simulations of linear model observers. They compared the methods by examining the mean squared residual of the least-squares fit of classification images to the simulated observer's template. They found that when the model observer had no internal noise, the GLM consistently had lower residuals, but when the model observer had realistic amounts of internal noise (e.g., internal-to-external noise ratio ≥0.5 (Neri, 2010a)), the two methods performed equally well.

^{3}Abbey and Eckstein (2001) reported that a similar maximum likelihood method performed better than a weighted sum method on data from human observers, but they used a suboptimal weighted sum method similar to the correlation method in Equation 7, and their maximum likelihood method incorporated a smoothing prior, so there is not necessarily a contradiction between their results and those of Knoblauch and Maloney. In auditory research, Tang and Richards (2005) found little difference between correlation methods, least-squares regression, logistic regression, and probit regression applied to psychophysical data.

*T*

^{2}test is useful in such cases. The

*T*

^{2}test is a generalization of Student's

*t*-test and relies on the fact that classification images based on weighted sums are, to a very good approximation, multivariate normal. Abbey and Eckstein showed how to use the

*T*

^{2}test to determine whether a classification image's mean is equal to a hypothesized image (such as an ideal observer's template (Geisler, 1989)) and whether two classification images are significantly different. The latter test is useful for detecting nonlinearities in observers' decision strategies, because as we will see later, nonlinearities often lead to differences between classification images calculated from subsets of trials where different signals were shown, e.g., the averages

^{12}−

^{11}and

^{22}−

^{21}, which according to the linear observer model should have identical expected values when the observer is unbiased. Abbey and Eckstein also showed how to apply

*T*

^{2}tests to any linear transformation of a classification image, such as a classification image that has been downsampled or averaged along some dimension in order to increase its signal-to-noise ratio.

*t*-test can show which pixels in a classification image are significantly different from zero, but with hundreds or thousands of pixels, it is important to correct for multiple comparisons, ideally with a method not as conservative as Bonferroni correction. This problem is compounded by the common practice of blurring classification images to increase their signal-to-noise ratio, which introduces correlations among neighboring pixels. Chauvin, Worsley, Schyns, Arguin, and Gosselin (2005) showed that statistical tests based on random field theory, common in neuroimaging studies, can be used for this purpose. They described methods for testing whether a single pixel is significantly different from the image mean and for testing whether a cluster of pixels above some intensity level is larger than expected by chance. These methods can be applied to smoothed or unsmoothed classification images.

*shrinkage*(Duda et al., 2000)). They also tested a prior that penalized high spatial frequencies in the classification image, thereby incorporating a form of smoothing into the estimation process. They made maximum a posteriori estimates of classification images by maximizing likelihood functions that incorporated one of these two priors. In a Gaussian bump detection task with human observers, they found that both priors increased the signal-to-noise ratio of the classification image over a maximum likelihood estimate with no prior and did not introduce obvious artifacts.

*sparse*and

*smooth*. They developed a framework that can accommodate a wide range of sparse and smooth priors. To illustrate their method, they implemented a prior that represents classification images in an overcomplete Gaussian pyramid,

^{4}a multiscale representation that codes an image as a sum of two-dimensional Gaussians at various scales (Burt & Adelson, 1983). The prior penalized large coefficients in the pyramid representation (again, shrinkage), so that only Gaussians that played a strong role in explaining observers' responses appeared in the classification image. Mineault et al. made maximum a posteriori estimates of classification images by maximizing a likelihood function that incorporated this prior. Using simulated and human observer data, they showed that the prior substantially improved the signal-to-noise ratio of classification images.

*m*-alternative experiments.

*m*-alternative classification images using the GLM as follows. Suppose we have signals

**s**

^{1}, …,

**s**

^{ m }, and on each trial, the observer views a noisy stimulus

**s**

^{ k }+

**n**and tries to identify the signal. We can extend the linear observer model by giving the observer templates

**t**

^{1}, …,

**t**

^{ m }. The observer calculates decision variables

*d*

^{1}, …,

*d*

^{ m }by finding the dot product of each template with the stimulus and adding internal noise,

*d*

^{ i }=

**t**

^{ iT }(

**s**

^{ k }+

**n**) +

*z*

^{ i }. The observer identifies the stimulus by choosing the template that elicits the largest decision variable

*d*

^{ i }, adjusted by a criterion

*a*

^{ i }, so the observer's response is

*r*=

*d*

^{ i }−

*a*

^{ i }) (Van Trees, 1968/2001, p. 154). The response variable

*r*is now a multinomial random variable: for each stimulus, there are

*m*possible responses with probabilities

*p*

^{1}, …,

*p*

^{ m }(which sum to one). The template with the largest dot product is most likely to be chosen, but because the decision variables include internal noise there is always some probability that another template will be chosen instead.

*m*-alternative model is redundant, because only the differences between templates affect the observer's responses. We can remove this redundancy by taking one template, say

**t**

^{1}, as a reference

^{5}and subtracting it from all templates:

**w**

^{ i }=

**t**

^{ i }−

**t**

^{1}. Thus, the difference templates

**w**

^{2}, …,

**w**

^{ m }are free to vary, but

**w**

^{1}is always zero. The model observer with decision variables

*d*

^{ i }=

**w**

^{ iT }(

**s**

^{ k }+

**n**) +

*z*

^{ i }gives the same responses as the original model observer. Thus, the most we can expect from a classification image experiment is to recover the linear observer's difference templates

**w**

^{ i }, not the individual templates

**t**

^{ i }. For similar reasons, we can assume without loss of generality that criterion

*a*

^{1}= 0.

_{ i }to be the observer's decision variables without internal noise, adjusted by the criteria:

^{ i }=

**w**

^{ iT }(

**s**

^{ k }+

**n**) −

*a*

^{ i }. Multinomial logistic regression assumes that the response probabilities

*p*

^{ i }are softmax functions of

^{ i }:

_{ i }has the greatest probability of being chosen, but templates with lower values have some probability of being chosen instead.

*i*th classification image contains a positive image of template

*i*and a negative image of template 1. Clear estimates of the target and reference templates appear in each classification image. There are also faint images of other templates, e.g., a ghost of template 2 in classification image 3, suggesting that the estimates are weakly biased. This bias might be due to the small discrepancy between the simulated observer (whose responses were based on the max rule with Gaussian noise) and the assumptions of multinomial logistic regression (which models responses using the softmax rule).

*m*-alternative experiments are already available in statistical software packages that incorporate the GLM, such as R and the MATLAB Statistics Toolbox (R Development Core Team, 2010; The MathWorks, Natick, MA). This illustrates one benefit of formulating observer models in terms of established statistical frameworks. As with the GLM in yes–no experiments, it remains to be seen how well this method works with data from human observers.

*m*-alternative experiments. They divided trials into

*m*groups according to which signal was shown and analyzed the groups separately. Within each group, they measured the pixelwise correlation between the noise fields and the correctness of the observer's responses, coded as correct = 1 and incorrect = 0. They took the correlation map

**c**

^{ i }for each group to be an estimate of the observer's template for the signal

**s**

^{ i }shown in that group. (That is, they used linear discriminant analysis in a one-against-the-rest fashion (Duda et al., 2000).) To validate this method, they reported simulations of a linear model observer that used five templates and made responses by finding the template that had the largest dot product with the stimulus. Their simulations showed that the correlation maps were similar to the simulated observer's templates.

*m*templates. To see why, consider a group of trials that show a particular signal. Within this group, the observer will tend to give correct responses when the noise is similar to the template for that signal and incorrect responses when the noise is similar to one of the other templates. Accordingly, the classification image will have a component similar to the target template but also components similar to the negative images of the other templates. The amplitudes of the negative template images will depend on the probabilities of the various types of errors (i.e., stimulus

*k*=

*i,*response

*r*=

*j*), and so the mixture of negative templates will differ from one classification image to another. It also seems unlikely that the template images will combine additively. This means that contrary to Dai and Micheyl's suggestion, the classification image

**c**

^{ i }does not converge to template

**t**

^{ i }. Further work on this method may find a way of separating the components, but as discussed earlier, the most we can expect is to recover a linear observer's template differences

**t**

^{ i }−

**t**

^{ j }, not the templates

**t**

^{ i }themselves.

*m*− 1 other templates, so the unwanted negative images of the other templates simply changed the amplitude of the classification image and were not visible as distinct artifacts. The fourth column in Figure 3 shows classification images calculated using Dai and Micheyl's method in the simulation discussed earlier (see 2 for details). Here, it is apparent that each classification image contains both the target template and negative images of the other templates.

*bubbles*. This method uses multiplicative noise instead of additive noise and identifies the stimulus regions that help an observer to identify a stimulus correctly. In a typical bubbles experiment, the noise field is a sum of many small, randomly placed two-dimensional Gaussian bumps. The stimulus is one of

*m*signals multiplied pointwise by the noise field, with the result that most of the signal is eliminated, and only randomly placed fragments of the signal remain visible (i.e., the fragments at the locations of the Gaussian bumps). The observer attempts to identify the signal from the visible fragments. The experiment is analyzed by calculating a

*bubbles image*that shows the extent to which each visible stimulus fragment increases the probability that the observer makes a correct response. Currently, the properties of the bubbles method are best understood in the context of the linear observer model. Bubbles images have been shown to recover less information about a linear observer's template than a classification image does, so the bubbles method is most promising for investigating nonlinear decision strategies (Gosselin & Schyns, 2004; Murray & Gold, 2004a, 2004b).

*potent information*(Gosselin & Schyns, 2002). It may be possible to formalize this notion in the GLM framework, using the noise fields as covariates, the correctness of the observer's response as the dependent variable, and potent information maps as the regression coefficients to be estimated. This approach would permit maximum likelihood estimates of bubbles images and could be developed into a GAM (following Knoblauch and Maloney) or modified to incorporate a smooth and sparse prior on bubbles images (following Mineault et al.).

*variance*of the noise fields in each stimulus–response category in a manner analogous to Equation 1:

**n**

_{VAR}

^{ SR }is the variance of the noise fields in a stimulus–response category of trials, e.g.,

**n**

_{VAR}

^{12}is the variance of the noise fields on trials where the stimulus contained signal

**s**

^{1}and the observer responded

*r*= 2. Neri and Heeger's first- and second-order classification images revealed a surprising detection strategy. Three types of events made the observer more likely to say that the target was present: (a) there was a burst of high-energy noise, positive or negative in contrast, at the target location just before the time when the target might appear; (b) there was a burst of positive-contrast noise at the time and location of the target; or (c) there was a burst of negative-contrast noise at the time of and spatially adjacent to the location of the target. Neri and Heeger proposed a neural circuit composed of simple and complex cells as a detection mechanism that is consistent with these classification images. A further experiment suggested that the first-order mechanism is responsible for identification, and the second-order mechanism is responsible for detection. Other researchers have also used second-order classification images (Knoblauch & Maloney, 2008; Murray, 2002), and classification images based on the Fourier power spectrum have been informative as well (Gold, Cohen, & Shiffrin, 2006; Solomon, 2002; Taylor, Bennett, & Sekuler, 2009).

*late*, monotonic nonlinearity does not interfere with the usual method of calculating classification images. Suppose we modify the linear observer model in Equations 20 and 21 by introducing a monotonic nonlinearity

*f*on the decision variable:

*d*=

*f*

^{−1}(

*d**) and criterion

*a*=

*f*

^{−1}(

*a**), so methods appropriate for linear observers can be used to estimate the template

**w**. (The observation that a late, monotonic nonlinearity does not affect the information carried by a decision variable is sometimes called Birdsall's theorem (Lasley & Cohn, 1981).)

**w**as before and also a second-order Volterra kernel, represented as a symmetric matrix

*W*=

*W*

^{ T }. The decision rule is then

**w**), plus a weighted sum of all products of pairs of stimulus elements (with weights in

*W*), plus internal noise.

^{ SR }is the sample covariance matrix of the stimuli in a stimulus–response class of trials, e.g.,

^{12}is the sample covariance matrix over all trials where the stimulus contained signal

**s**

^{1}but the observer gave response

*r*= 2. Neri showed that this method is strictly correct only when the observer is unbiased, but using simulations of model observers he showed that it is reasonably accurate over a wide range of biases. In subsequent work, Neri demonstrated that characteristic patterns in second-order kernels can be used to understand mechanisms of brightness perception and texture perception (Neri, 2009) and to identify simple mechanisms that often appear in models of visual processing, such as divisive normalization and max-rule uncertainty mechanisms (Neri, 2010b, 2010c).

*any*phase appeared in the noise; over trials, Gabor-like noise patterns at different phases would average to zero, and the signal-absent classification image would be empty. Solomon (2002) reported further experiments along these lines (including a classification image analysis on the power spectrum of the stimuli) and reached similar conclusions.

**s**

^{1}is blank and signal

**s**

^{2}is a sine-phase Gabor. A phase-uncertain observer could use an energy detection strategy, in which the decision variable is the squared dot product of the signal with a sine-phase template

**t**

^{ S }, plus the squared dot product of the signal with an orthogonal cosine-phase template

**t**

^{ C }:

*x*

_{1}is in the direction of the sine-phase template

**t**

^{ S }, the second axis

*x*

_{2}is in the direction of the cosine-phase template

**t**

^{ C }, and the remaining axes are in orthogonal directions (Figure 4). In this representation, the observer's decision rule is to respond “present” if

*x*

_{1}

^{2}+

*x*

_{2}

^{2}≥

*a,*and “absent” otherwise. That is, the response depends on whether the stimulus is inside or outside a circle in the

*x*

_{1}

*x*

_{2}plane and does not depend on the stimulus position along the remaining axes: the decision surface is an

*n*-dimensional cylinder. (To simplify the exposition, we have assumed that the observer has no internal noise, but this does not affect the conclusions we will reach.)

^{12}−

^{11}is zero. On signal-present trials, the expected value of the noise fields on trials where the observer responds “present” (yellow points) is proportional to the signal, and the expected value on trials where the observer responds “absent” (blue points) is proportional to the negative of the signal, so the expected value of the signal-present classification image

^{22}−

^{21}is proportional to the signal.

**s**

^{2}=

**s**

^{||}+

**s**

^{⊥}, where

**s**

^{||}is the component in the

*x*

_{1}

*x*

_{2}plane and

**s**

^{⊥}is the component perpendicular to the

*x*

_{1}

*x*

_{2}plane. (Here, || means “parallel” and ⊥ means “perpendicular.”) In this case, only the component

**s**

^{||}that has a nonzero dot product with the observer's templates affects the observer's responses, and the expected value of the classification image on signal-present trials is proportional to

**s**

^{||}, the projection of the signal into the subspace spanned by the observer's templates.

*f*spectrum (Rajashekar, personal communication), but the method is intriguing nevertheless.

*x*

_{1}

*x*

_{2}plane and the remaining principal components in orthogonal directions. The “signal-present” response stimuli form a Gaussian cloud with a cylinder removed, which will have maximum variance in the

*x*

_{1}

*x*

_{2}plane and smaller variance along the remaining axes. Here, we expect to find the two largest principal components in the

*x*

_{1}

*x*

_{2}plane and smaller principal components in orthogonal directions. Rajashekar et al. reported simulations showing that PCA does in fact recover the orthogonal templates of such a phase-uncertain observer.

*m*sequences to estimate receptive fields more efficiently (Sutter, 1987) and using natural images to characterize neural responses under realistic stimulus conditions (David, Vinje, & Gallant, 2004), whereas there has been little work along these lines in psychophysics (but see Dobres & Seitz, 2010 on

*m*sequences and Abbey & Eckstein, 2001 on natural images). Victor (2005) discusses similarities between classification image methods, neural receptive field mapping methods, and neuroimaging analysis methods and also gives a clear overview of how nonlinearities, internal noise, and high dimensionality pose difficult problems that these methods must overcome. Wu et al. (2006) give an in-depth review of recent work on methods for neural receptive field mapping.

*is*the experimental manipulation. Such studies are useful for understanding perception of cluttered scenes, including natural scenes and medical images. They can also help refine our use of other methods that use visual noise, such as band-pass masking (Solomon & Pelli, 1994) and noise masking functions (Pelli & Farell, 1999). Abbey and Eckstein (2007) and Conrey and Gold (2009) measured classification images in white, low-pass, high-pass, and band-pass noise and found that observers' templates were different in different types of noise. The most notable pattern was that in low-pass noise, observers shifted to using higher spatial frequencies, but in high-pass noise they were unable to shift to lower spatial frequencies. Furthermore, Abbey and Eckstein (2009) found that observers' templates changed as a function of noise amplitude: in stronger noise, observers shifted to lower spatial frequencies. Classification image studies and other types of noise-based studies typically assume that noise does not substantially change observers' strategies, but these findings show that it can sometimes have a moderate effect. These findings are consistent with earlier performance-based studies (e.g., Burgess, 1999), but classification images have revealed in greater detail how different types of noise affect observers' templates.

*σ*= 0.175°) or large (

*σ*= 0.8°). Classification images showed that the saccadic template was a Gaussian bump with a weak inhibitory surround that was approximately the optimal size in the small-bump condition but much too small in the large-bump condition. Thus, although the template that guides saccades is flexible, it seems to be sufficiently constrained that it cannot even correctly match circular patterns of various sizes.

*first*saccade had a strong influence on the location of the

*second*saccade, indicating that the last 100 ms was dead time only for the first saccade and that information continued to be gathered during this time to target the second saccade.

*What have we learned?*The linear observer model that underpins the most straightforward interpretation of classification images, which some researchers expressed misgivings about when classification images began to be used in visual psychophysics, has turned out not to put a strong limit on the usefulness of these methods. The linearity assumption can be tested, and it often turns out to be valid, at least over the very small range of stimuli used in a typical classification image experiment (Abbey & Eckstein, 2002a; Murray et al., 2005). Furthermore, departures from linearity are sometimes unimportant, as when we draw conclusions simply from the fact that a classification image shows strong correlations between a stimulus region of interest and the observer's responses. Finally, there are many ways of modifying the method to incorporate nonlinearities in visual processing, including the general-purpose Volterra and Wiener kernel frameworks and more specific modifications based on models of nonlinearities in visual processing.

*Where to from here?*One promising line for future work, already begun by some investigators, is to explore the relationship between classification images and mainstream statistical frameworks, such as the general linear model, the generalized linear model, and the generalized additive model. This work may lead to better ways of estimating classification images and new ways of broadening the linear observer model. It should also allow psychophysicists to draw on the findings of professional statisticians and reduce the need to develop ad hoc methods.

The experiment | |

m | number of signals and response alternatives |

k | signal number (1, …, m) |

g | stimulus |

s ^{ i } | signal |

n | external noise |

r | response number (1, …, m) |

The linear observer model | |

t ^{ i } | template |

w, w ^{ i } | template difference, t ^{ a } − t ^{ b } |

z, z ^{ i } | internal noise |

d, d ^{ i } | decision variable |

a, a ^{ i } | response criterion |

The classification image | |

c, c ^{ i } | classification image |

n ― ^{ ij } | sample average of external noise images over trials where the signal was s ^{ i } and the observer gave response r = j |

n ― ^{*j } | sample average of external noise images over trials where the observer gave response r = j |

General | |

E[x] | expected value of random variable x |

corr[x, y] | Pearson correlation between random variables x and y |

Φ(x, μ, σ) | normal cumulative distribution function |

v ^{ T }, A ^{ T } | transpose of vector v or matrix A |

*Simulation.*The simulation used four 5 × 9 pixel signals (Figure 3, first column). The background pixels had value 0.0, and the foreground pixels had value 1.0. On each trial, the stimulus was one of the four signals, chosen randomly, in Gaussian white noise with mean 0.0 and standard deviation 1.0. The model observer identified the stimulus by taking its dot product with four templates (Figure 3, second column) and choosing the template with the largest dot product. The model observer had no internal noise. (Further simulations with an internal-to-external noise ratio of 1.0 gave similar results, though with more sampling noise evident in the classification images.) The simulation ran for 10,000 trials. The model observer gave 72% correct responses and was unbiased.

*Logistic regression.*One set of classification images was calculated using multinomial logistic regression (Figure 3, third column). The dependent variable was the observer's response number (1 through 4). The covariates were the 45 noise pixel values and three dummy variables that encoded the signal number. Signal 1 was encoded as (0, 0, 0), signal 2 as (1, 0, 0), signal 3 as (0, 1, 0), and signal 4 as (0, 0, 1). Thus, the covariate vectors had 48 components. The regression coefficient vector for template 1 was taken as the reference, so each classification image shows an estimate of

**t**

^{ i }−

**t**

^{1}. The regression coefficients were estimated using the mnrfit multinomial logistic regression function in the MATLAB Statistics Toolbox (The MathWorks, Natick, MA).

*Correlation method.*A second set of classification images was calculated using Dai and Micheyl's (2010) correlation method (Figure 3, fourth column). Trials were divided into four groups according to which signal was shown. The classification image for each group was calculated as the pixelwise correlation between the 45 noise pixel values and the correctness of the model observer's responses, coded as correct = 1 and incorrect = 0.

*r*= 1 or

*r*= 2 (Duda et al., 2000, p. 120, Equation 106; Fisher, 1936).

**n**as covariates and to use dummy variables to encode the signals as levels of a factor, instead of using the pixels of the full stimulus

**g**=

**s**

^{ k }+

**n**as covariates (Knoblauch & Maloney, 2008; also see 2). One reason for doing so is that if the signals do not appear in the covariates, then any signal-like patterns in the classification image must reflect the observer's decision strategy. If the signals appear in the covariates, then signal-like patterns in the classification image could be artifacts of the estimation procedure (e.g., the result of bias).

*c*′ = −0.5 (Macmillan & Creelman, 2005)), incorporated the exact value of the criterion into the model used to make the GLM estimate, and used the weighted sum estimate that is appropriate for unbiased responses, thus giving a potential advantage to the GLM estimate; and (b) they held signal contrast constant as they varied internal noise strength, so increases in internal noise were confounded with decreases in performance. I have repeated their simulations with an unbiased model observer at a constant performance level, and I found very similar results.