We consider estimation and statistical hypothesis testing on classification images obtained from the two-alternative forced-choice experimental paradigm. We begin with a probabilistic model of task performance for simple forced-choice detection and discrimination tasks. Particular attention is paid to general linear filter models because these models lead to a direct interpretation of the classification image as an estimate of the filter weights. We then describe an estimation procedure for obtaining classification images from observer data. A number of statistical tests are presented for testing various hypotheses from classification images based on some more compact set of features derived from them. As an example of how the methods we describe can be used, we present a case study investigating detection of a Gaussian bump profile.

^{2}tests. The tests include departure from a hypothesized mean classification image, two tests for differences between classification images, and a test for a nonlinear observer response function. A simple case study is presented as an example of how these methods can be used to make inferences from classification image data.

**g**. We use the convention of bold lowercase symbols to indicate vector quantities, bold uppercase to indicate matrix quantities, and nonbold symbols to indicate scalars. We denote the images corresponding to each alternative of a forced-choice trial as

**g**

^{+}and

**g**

^{−}for signal-present and signal-absent images, respectively. When necessary, we use an index,

*j*, to denote the experimental trial. In this case,

**g**

^{+}

_{j}denotes the signal-present image vector for the

*j*th trial.

**b**, is presumed to be identical in both alternatives. In many cases the background component is simply a uniform luminance that boosts the image to the middle of the display range. However, our formulation is general enough to allow for a background that varies from trial to trial. The noise component is presumed to be independent, and hence a different vector, in each of the two alternatives. The noise field is therefore denoted

**n**

^{+}for the signal-present image and

**n**

^{−}for the signal-absent image to indicate this dependence. Finally, the signal profile is denoted

**s**. This profile is added only to the target image. The analysis in this work is confined to the signal-known-exactly paradigm, and hence the signal vector is fixed throughout all trials. Note that for contrast-discrimination experiments the contrast pedestal is incorporated into

**b**, and hence

**s**is actually the difference signal. The 2AFC images can be written mathematically as

**b**. However, we allow for a general noise covariance matrix,

**K**, requiring only that this matrix be known and nonsingular. The covariance matrix governs the noise-correlation structure in each image. If white noise is used, then

_{n}**K**=

_{n}*σ*

^{2}

**I**where

*σ*

^{2}is the pixel variance and

**I**is the identity matrix. The pdf of the noise vectors is given by (Mardia, Kent, & Bibby, 1979) We write

**n**∼ MVN(0,

**K**) to indicate that

_{n}**n**is distributed according to this pdf.

*λ*=

*w*(

**g**). The responses to the signal-present and signal-absent images are defined by

*λ*

^{+}=

*w*(

**g**

^{+}) and

*λ*

^{−}=

*w*(

**g**

^{−}), respectively. Human observers will often give different decisions from the same set of images in repeated trials, a characteristic of internal noise in the observer (Pelli, 1981; Burgess & Colborne, 1988). Internal noise is incorporated into the internal response by allowing random components in

*w*(

**g**).

**w**is the set of weights used to create the response variable. As such,

**w**(often called an observer template or filter) represents the summation strategy used by the observer to perform the task. The

*ɛ*

^{+}and

*ɛ*

^{−}and terms on the right-hand side of Equation 2 are scalar internal noise components. These components are presumed to be independent, zero-mean Gaussian random variables. We will specify the variance of

*ɛ*

^{+}and

*ɛ*

^{−}to be

*σ*

^{2}

_{ɛ}. The value of

*σ*

^{2}

_{ɛ}is not presumed to be known nor is it necessary for computing a classification image. Even though the internal noise term is specified as a scalar random variable, it is general enough to include noise from multiple independent sources. If we adopt the approach of equating internal noise in the observer with an equivalent noise source in the stimulus (Ahumada, 1987), then the internal noise component is defined by

*ɛ*=

**w**

^{t}

**n**

_{Eqv}, where

**n**

_{Eqv}is a vector of equivalent noise in the stimulus domain. In this case, the variance of the internal noise component is given by

*σ*

^{2}

_{ɛ}=

**w**

^{t}

**K**

_{Eqv}

**w**, where

**K**

_{Eqv}is the covariance matrix associated with the equivalent noise.

*o*, for a given trial as one if the observer correctly identifies the signal-present image and zero if the observer makes an incorrect choice. The score is defined in terms of the internal responses by where the step function is defined as one for arguments greater than zero and zero for arguments less than zero. We will assume continuous distributions for the internal responses, and hence the probability of a tie (

*λ*

^{+}=

*λ*

^{−}) can be neglected. In terms of the linear response model given in Equation 2, and the image generating equations in Equation 1, the trial score is defined as where Δ

**n**=

**n**

^{+}−

**n**

^{−}is the vector difference between the noise fields, and Δ

*ɛ*=

*ɛ*

^{+}−

*ɛ*

^{−}is the difference between internal noise components. Given the Gaussian assumptions we have made on

**n**

^{+}and

**n**

^{−}, the difference is Δ

**n**∼ MVN(0,2

**K**). For independent Gaussian-distributed internal-noise components, Δ

_{n}*ɛ*∼ N(0,2

*ɛ*

^{2}

_{ɛ}).

**b**, cancels out of the expression. Hence the mean background does not directly influence the trial score in the linear model. However, this does not imply that the background is irrelevant because the observer may accommodate the background indirectly by modifying the template, or the background may influence the magnitude of the internal noise.

*P*(Green & Swets, 1966). The proportion correct is equivalent to the ensemble mean score, where the angled brackets, <…>, indicate a mathematical expectation of the enclosed quantity. In this case, expectation is taken with respect to random variability in the images as well as random variability due to observer internal noise. Equation 5 forms the basis for analysis of forced-choice data with human observers. With human observers, the internal response variables are not observable. But the score in each trial of the experiment can be observed, allowing the proportion correct to be estimated as the observed proportion of correct responses, where

_{C}*o*is the score in the

_{j}*j*th trial, and

*N*is the total number of trials in the experiment. As a sample average, it is well known that

_{T}*^P*is an unbiased estimate of the ensemble mean in Equation 5 (Dudewicz & Mishra, 1988).

_{C}*d*′ and

*P*are directly related to one another by

_{C}*P*= Φ(

_{C}*d*′ / √2) (Green & Swets, 1966), where Φ(…) is the standard cumulative normal distribution function defined as

**n**) happens to look like the observer template (

**w**). Looking at Equation 4, we would then expect

**w**

^{tt}Δ

**n**to take on a large positive value, leading to a high probability of a correct response. We might then imagine that when the observer gets the trial correct, it is because the noise-field difference in the trial (on average) looks something like the observer’s template. Conversely, if the noise-field difference looks like the negative of the observer template, then we would expect

**w**

^{t}Δ

**n**to take on a large negative value, leading to a high probability of an incorrect response. We might then surmise that when the observer gets the trial incorrect, it is because the noise-field difference in the trial tends to look something like the negative of the observer’s template. In this case, the negative of Δ

**n**would tend to look like

**w**. This heuristic suggests weighting Δ

**n**by a positive value when the observer gets the trial correct and a negative value when the observer gets the trial incorrect, and then averaging the results.

*o*−

_{j}*a*, where

*a*is some constant between zero and one. When the observer makes a correct decision (

*o*= 1), the weight assumes a positive value, and when the observer makes an incorrect decision (

_{j}*o*= 0), the weight assumes a negative value as alluded to above. In previous works (Abbey et al., 1999; Abbey & Eckstein, 2000, 2001a), we have used a weighting scheme in which

_{j}*a*= 1/2. However, it can be shown that letting

*a*=

*P*minimizes the covariance matrix of the estimated classification images; in particular, it minimizes the variance of each element in the classification image. Because we do not have access to the ensemble proportion correct, we propose setting the constant to

_{C}*a*=

*^P*, the estimated proportion correct defined in Equation 6. Using this weighting scheme, we can define a score weighted difference in noise fields as

_{C}*N*/ (

_{T}*N*− 1) factor will be seen below (Equation 11) to be convenient for removing dependencies on the number of experimental trials from the expected value of δ

_{T}**q**

_{j}. This factor is negligibly different from one in most cases because the number of trials is typically quite large in classification-image experiments.

*^P*is defined over all the experimental trials, it introduces the possibility of trial-to-trial correlations among the vectors Δ

_{C}**q**

_{j}. However, the magnitude of these correlations can be shown to be of order 1 /

*N*

^{2}

_{T}. Typically, more that 1,000 trials are used in a classification image experiment, and hence sequential correlations can be neglected for practical purposes.

**K**

^{−1}

_{n}term in Equation 9 accommodates pixel-to-pixel noise correlations. However, in the case of white noise where

**K**=

_{n}*σ*

^{2}

**(**

_{n}I**I**is the identity matrix and

*σ*

^{2}

_{n}is the pixel variance), the formula simplifies to

**q**becomes clearer when we assume the linear observer model of Equation 4 and compute the expectation of Equation 9. We denote this expectation by <Δ

**q**>

_{Δn,Δɛ}, where the subscripts emphasize that the expectation encompasses both the external-noise variability in Δ

**n**and internal-noise variability in Δ

*ɛ*. We will not derive the expectation here because the derivation is lengthy and has been published previously in a simpler form (Abbey & Eckstein., 2001b). We will simply state the value of the expectation as where

*d*′ is the detectability index of Equation 8. The expected value is equivalent to the observer template up to a positive scalar factor. Because the magnitude of the observer template is somewhat arbitrary (scaling the template and the internal noise component yields an equivalent detection strategy), obtaining the observer template with a normalized magnitude is an acceptable stand-in for

**w**. More importantly, we see below that working with a normalized version of the observer template does not hinder our ability to perform statistical inference.

*j*= 1, …,

*N*where

_{T}*N*is the number of trials. The classification image estimate is then

_{T}**R**can be thought of as reducing the classification image to a set of linear features (specific pixels, spatial averages, etc.) of interest, and hence

**R**is an

*N*×

_{y}*N*matrix where

*N*is the number of pixels in the stimulus and

*N*is the number of features in Δ

_{y}**y**. Although it will generally be the case that

*N*will be much smaller than

_{y}*N*, it is still possible to consider the case where

**R**is the identity matrix. In this case, δ

**y**

_{j}= Δ

**q**

_{j}and

*N*=

_{y}*N*.

^{2}tests.

^{2}distribution is closely tied to the more commonly found F distribution, and this relation is useful for obtaining significance levels from tables. If T

^{2}has a Hotelling’s

*T*

^{2}

_{P,M}distribution where P and M are the two degrees of freedom associated with the distribution, then

*T*

^{2}× (

*M*−

*P*+ 1)/

*MP*has an

*F*

_{P,M − P + 1}distribution. Hence we can take any of the Hotelling’s T

^{2}tests derived below, multiply the test statistic by (

*M*−

*P*+ 1) /

*MP*, and then look up critical values or

*p*values for the test from published tables (e.g., Mardia et al., 1979). Many programming environments supply procedures to compute these values as well.

^{2}distribution to be defined, we must have that

*N*>

_{T}*N*. This is equivalent to requiring that the sample covariance be of full rank. Here we see the advantage of working with a reduced set of classification image features. For the full classification image,

_{Y}*N*is equal to the number of pixels in each image stimulus. In the case of 64 by 64 pixel images we have 4,096 free parameters that require at least 4,097 trials in order to perform the statistical test.

_{Y}*N*

_{T1}trials and the second have

*N*

_{T2}trials. We will denote the sample means and covariance matrices of the two classification images by ,

**S**

_{δy,1}, , and

**S**

_{δy,2}, respectively. For testing the null hypothesis of a common mean, we can use Hotelling’s two-sample test statistic, where

^{2}has a Hotelling’s distribution.

**y**vectors. Let us define where Δy

_{1,j}and Δy

_{2,j}are the individual trial Δy vectors for the first and second observer. The test for differences between the two observers is now defined as a one-sample test for a significant departure from zero in Δ

^{2}y

_{j}. In this case, the test statistic is defined as where is the sample mean of the Δ

^{2}y

_{j}vectors, and is the sample covariance matrix, Under the null hypothesis of , T

^{2}has a Hotelling’s distribution.

**q**

_{j}into two components arising from the signal-present noise field (

**n**

^{+}

_{j}) and the signal-absent noise field (

**n**

^{−}

_{j}). Let us define from which it can be seen that . Under the linear observer response function of Equation 2, the mathematical expectations of

**q**

^{+}

_{j}and

**q**

^{−}

_{j}are given by

**q**

^{+}

_{j}and

**q**

^{−}

_{j}sum to zero.

^{+}

_{j}vectors and the y

^{−}

_{j}vectors, respectively, and the sample covariance matrix is defined as Under the null hypothesis of a linear observer response function, T

^{2}has a Hotelling’s distribution.

^{2}. This signal contrast was determined from psychometric function data to give an average human observer performance of approximately 85% correct. The noise contrast (measured as the luminance standard deviation divided by the mean luminance) was fixed at 15%.

*N*= 18), and all 2,000 trials were used to compute the test statistics (

_{y}*N*= 2000). The test is significant at the 1% level for both observers (D.V.,

_{T}*p*< .0025; C.H.,

*p*< .005) even when Bonferroni-corrected for multiple comparisons (Altman, 1999) across the two observers; therefore, we can conclude that the classification images of both observers depart significantly from the signal profile.

*p*> .36). It should be noted that because both observer templates are subject to estimation error, the resulting hypothesis test is generally less powerful than a test of one observer against a known classification-image profile. It seems reasonable to suppose that at some point, if we collected enough trials, we would find observer differences. Nonetheless, the fact that the templates are not significantly different after 2,000 trials does imply some degree of consistency between the two subjects.

*p*> .13), whereas subject C.H. did show a significant effect (

*p*< .00075). It is possible that a significant effect for subject D.V. would have been found had a more restrictive range of visual angle been used.

*B*(

*p, N*) indicates the binomial probability function

*N*be greater than one.

*p*in Equation 23, often referred to as the link function, is based on the assumption of independent Gaussian distributions for each internal noise component. From the binomial distribution, it is possible to derive the likelihood of the observer scores given a specific choice of the observer template

**w**. The maximum-likelihood (ML) estimate of the classification image is then found by optimizing the likelihood function.

**w**than there are observed trials. In this case, there will not be a unique maximum of the likelihood function and hence no unique ML estimate. This problem can be reduced by using some sort of regularizing function (Abbey & Eckstein, 2001a), but it is not clear at this stage how the choice of a regularizer will influence the resulting estimates.

^{2}and generally have more statistical power. The analytic approximation was derived for a somewhat different (and less efficient) estimate of the classification image. It remains to be seen if the approximation will still be good for the estimate defined in Equation 12.

^{1}We take the definition of a 2AFC experiment (Green & Swets, 1966) as an experiment in which two stimuli are shown in a given trial, and the observer is asked to identify the stimulus that contained the target of interest. The term is sometimes used to describe experiments in which a single stimulus is shown, and the observer is asked to identify one of two target profiles as being present in the image (sometimes referred to as two-alternative forced-response experiments). However, these latter experiments are more closely related to “yes-no” tasks, and methods for estimating classification images for them fit directly into the methodology developed by Ahumada and coworkers (Ahumada & Lovell, 1971; Ahumada et al., 1975; Ahumada, 1996).

^{2}We use the term noise limited to designate visual tasks in which independent trial-to-trial stimulus variability between the two alternatives limits observer performance. A noise-limited task yields a much higher level of performance if the external noise was removed from the stimuli. Alternatively, contrast-limited tasks result in imperfect performance in the absence of any external image noise. Additionally, background-limited tasks are limited by masking induced from variability in a background component that is common to the two alternatives (sometimes referred to as twin noise studies [Burgess & Colborne, 1988; Ahumada & Beard, 1997; Eckstein et al., 1997]).