Conventional approaches to modeling classification image data can be described in terms of a standard linear model (LM). We show how the problem can be characterized as a Generalized Linear Model (GLM) with a Bernoulli distribution. We demonstrate via simulation that this approach is more accurate in estimating the underlying template in the absence of internal noise. With increasing internal noise, however, the advantage of the GLM over the LM decreases and GLM is no more accurate than LM. We then introduce the Generalized Additive Model (GAM), an extension of GLM that can be used to estimate smooth classification images adaptively. We show that this approach is more robust to the presence of internal noise, and finally, we demonstrate that GAM is readily adapted to estimation of higher order (nonlinear) classification images and to testing their significance.

*s*(

*x*) denote a signal along a physical continuum

*x*. The physical continuum can be unidimensional (e.g., time) or multidimensional (e.g., locations in an image) and we sample the signal at locations

*x*

_{ j},

*j*= 1, …,

*p*. The resulting samples

*s*(

*x*

_{ j}),

*j*= 1, …,

*p*form a vector of length

*p*that we denote s(

*j*),

*j*= 1, …,

*p*for convenience.

*p*-dimensional vector of identically distributed random variables drawn from a distribution with mean 0 and a

*p*×

*p*covariance matrix C. The noise values ɛ are typically drawn from a Gaussian or uniform distribution and are typically independent so that the covariance matrix C =

*σ*

^{2}I is diagonal with diagonal entries equal to the variance

*σ*

^{2}of the noise.

^{1})

*x*) defined on the physical continuum

*x*. The sampled values of the template

*x*

_{ j}),

*j*= 1, …,

*p*form a vector denoted T(

*j*),

*j*= 1, …,

*p*. The output,

*D,*of the linear mechanism is determined by the inner product between the template vector and the stimulus vector,

*D,*the

*decision variable*. In the simple case in which there are only two possible signals,

*a*and

*b,*the observer's decision rule on a given trial might be to choose

*a*if

*c,*and otherwise choose

*b*. In effect, the inner product of the template with the stimulus reduces the decision problem to one dimension. If the noise is independent, identically distributed Gaussian, the decision rule that results in the highest expected proportion of correct responses is of the form of Equation 3 as shown in Duda, Hart, and Stork (2000, Chap. 2).

*a,*the response “No”, with the signal

*b*(Green & Swets, 1966; Macmillan & Creelman, 2005). There are only four possible types of response referred to as

*Hit*(H), when the signal is

*a*and the response is

*a, False Alarm*(FA), when the signal is

*b*and the response is

*a, Miss*(M), when the signal is

*a*and the response is

*b,*and

*Correct Rejection*(CR), when the signal is

*b*and the response is

*b*. The decision rule in Equation 3 leads to a method of estimating the classification image by averaging the trial-by-trial noise profiles for each type of response

^{H},

^{M},

^{FA},

^{CR}, where the

*p*-vector

^{H}is the average of the noise profiles, , presented on Hit trials, the

*p*-vector

^{FA}, the average for False Alarm trials, etc. (Ahumada, 2002). We then combine these mean images to form two classification images for stimulus present

_{p}=

^{H}−

^{M}and for stimulus absent,

_{a}=

^{FA}−

^{CR}and then add these two images to get an estimated classification image based on all of the data,

_{p}and

_{a}should have the same expectation (that is, differences between them are due only to random variation). However, as we shall see below, it is useful to analyze the images

_{p},

_{a}separately before combining them (Ahumada, 2002; Solomon, 2002a, 2002b; Thomas & Knoblauch, 2005).

*p*-vector

*p*-vector. If the experiment consists of

*n*trials, we obtain the terms of Equation 4 from the solution to the equation

*np*vector

^{2}of concatenated noise vectors ɛ

_{ i}= [

*ɛ*

_{ i1}, …,

*ɛ*

_{ ip}]′,

*i*= 1, …,

*n,*presented on successive trials, X is an

*np*× 4

*p*incidence matrix indicating which of four outcomes (H, FA, etc.) occurred on the trial, and β is a 4

*p*vector that represents the elements of the 4 components of Equation 4. The matrix X is a block matrix consisting of

*n*× 4 submatrices with one row of submatrices for each trial, each of which is either a

*p*×

*p*zero matrix 0

_{ p × p}or a

*p*×

*p*identity matrix I

_{ p × p}. An example of such a block matrix X is

*p*+ 1, 2

*p*+ 1, or 3

*p*+ 1 of X depending on whether the trial is a Hit, False Alarm, Miss, or Correct Rejection, respectively. In Equation 6, for example, the first trial resulted in a Miss and the identity matrix I

_{ p × p}is the third submatrix beginning in column 2

*p*+ 1 of X. The least-squares solution of Equation 5 is a 4

*p*-vector

^{ H}= [

*β*

_{1}

^{ H}, …,

*β*

_{ p}

^{ H}]′, β

^{ FA}= [

*β*

_{1}

^{ FA}, …,

*β*

_{ p}

^{ FA}]′, β

^{ M}= [

*β*

_{1}

^{ M}, …,

*β*

_{ p}

^{ M}]′, β

^{ CR}= [

*β*

_{1}

^{ CR}, …,

*β*

_{ p}

^{ CR}]′ and the least-squares solutions of Equation 5 for these four vectors are the corresponding mean vectors

^{ H},

^{ M},

^{ FA},

^{ CR}in Equation 4. The classification image estimate

*L*is the likelihood defined as a function of, T, the template coefficients,

*σ,*a scale factor,

_{i}is the added noise profile on the

*i*th trial, Φ the cumulative Gaussian function with mean 0 and variance 1, and

*R*

_{i}an indicator variable taking on the values 1 or 0 if the response on the

*i*th trial is that the signal is

*a*or

*b,*respectively. The model is identifiable with

*p*parameters if one constraint is placed on the vector

*σ*and with this parameterization is in the same units as the signal detection parameter

*d*′ (Green & Swets, 1966; Macmillan & Creelman, 2005).

*η*

*Y*is a member of the

*exponential family*(not to be confused with the exponential distribution), which consists of distributions whose probability density functions can be written in the form

*a, b, c,*and

*d*(McCullagh & Nelder, 1989).

*θ*∈ (0, 1) and is otherwise 0. The Bernoulli distribution for observing

*y*∈ {0, 1} is

*a*(

*y*) =

*y, b*(

*θ*) = log(

*θ*/ (1 −

*θ*)),

*c*(

*θ*) =

*n*log(1 −

*θ*),

*d*(

*y*) = 0, demonstrating that the Bernoulli is a member of the exponential family. Note that

*θ*is the expected value

*E*[ Y] of the response variable, Y.

*η*(•). We can choose the function

*b*(

*θ*) = log(

*θ*/ (1 −

*θ*)) and the resulting GLM is then equivalent to logistic regression (McCullagh & Nelder, 1989). We can, however, select any alternative link function that maps the interval (0, 1) onto the range (−∞, +∞). One common choice of link function is the inverse of the cumulative distribution function of the Gaussian function, denoted Φ

^{−1}(•), leading to the probit model (Finney, 1971).

*β*of intensity variables. The inverse of the link function is the psychometric function.

*σ*= 1. Thus, we can conceptualize Equation 9 as the linear predictor with the stimulus as the covariates and the template as the coefficients to estimate. The term X

*β*can be used to model experimental designs with multiple conditions that are structured as analysis of variance designs or multiple regression designs. We pursue this point in the example below.

*σ*

_{s}= 2.5 deg) was modulated over a 640 ms interval on a calibrated monitor. The modulation consisted of 32 sequential samples, each of 20 ms duration. On a given trial, the samples were either set to random luminance values chosen from a uniform distribution centered on the mean luminance of the screen or to random luminance values added to a temporal luminance modulation following a Gabor function. The temporal luminance modulation was described by

*t*

_{i}is the time of the

*i*th sample,

*i*= 1, 32,

*μ*= 320 ms, and

*σ*

_{t}= 160 ms. Three different carrier frequencies were tested in the article, but here, we consider only data collected from one observer at the highest carrier frequency used, 3.125 Hz.

*d*′ = 1.

resp | time | Time | N | |
---|---|---|---|---|

1 | H | 1 | 0.02 | −0.196 Start trial 1 |

2 | H | 2 | 0.04 | −0.197 |

. | ||||

. | ||||

. | ||||

32 | H | 32 | 0.64 | −0.189 End trial 1 |

33 | H | 1 | 0.02 | 0.039 Start trial 2 |

34 | H | 2 | 0.04 | 0.131 |

. | ||||

. |

*N*into 32 columns, each indicating a time sample. We show the first 8 columns of a sample of rows of the modified data set below with line numbers In this new format, each row corresponds to one trial of the experiment. The time points, t1, …, t32,

^{3}enter the model now as independent covariates, taking on the instantaneous values of the noise on each trial.

Resp | Stim | t1 | t2 | t3 | t4 | t5 | t6 … | |
---|---|---|---|---|---|---|---|---|

1 | 1 | 1 | −0.196 | −0.197 | −0.038 | −0.114 | −0.136 | 0.204 … Trial 1 |

2 | 1 | 1 | 0.039 | 0.131 | 0.233 | −0.053 | −0.133 | 0.295 … Trial 2 |

. | ||||||||

. | ||||||||

. | ||||||||

32 | 0 | 0 | −0.249 | 0.219 | −0.073 | 0.257 | −0.142 | −0.235 … Trial 32 |

33 | 1 | 0 | 0.022 | −0.109 | 0.076 | −0.013 | 0.044 | −0.249 … Trial 33 |

34 | 1 | 1 | 0.254 | −0.193 | −0.197 | 0.162 | 0.285 | −0.264 … Trial 34 |

. | ||||||||

. |

_{1},…, t

_{32}are the noise samples (vectors) at each time point acting in the model as covariates, and the coefficients

*β*

_{1},…,

*β*

_{32}provide the estimated classification image. The R code is given in 1 and the resulting classification image is shown in Figure 2b. Comparing the fits in Figures 2a and 2b reveals that the two methods, LM and GLM, yield very similar results. Both methods lead to an estimated classification image that reveals the same pattern of systematic differences from the signal. We note that the main effect of the term S (see 1) provides an estimate of

*d*′ (DeCarlo, 1998).

_{p}and

_{a}separately for trials on which the signal is present and trials on which it is absent, respectively. We can easily do this with GLM as well. In doing so, we test whether there is an interaction between the presence/absence of the signal and the estimated classification images

_{p}and

_{a}. If there is none, we can combine the two classification images

_{p}and

_{a}according to Equation 4. The model can be represented as

*β*

_{i}

^{s}now depend on whether the signal is present (

*s*=

*p*) or absent (

*s*=

*a*). The estimated values

_{i}

^{p},

*i*= 1,…, 32 and

_{i}

^{a},

*i*= 1,…, 32 correspond to the two estimated classification images

_{p}and

_{a}, respectively.

Residual df | Residual deviance | df | deviance | P[>χ ^{2}] | |
---|---|---|---|---|---|

1 | 3550 | 4213.0 | |||

2 | 3518 | 3977.2 | 32 | 235.8 | <0.00001 |

_{p}) and filled circles (

_{a}), respectively. For comparison, we plot

_{p}+

_{a}in Figure 2d. It can be seen that it differs little from the fit that did not include an interaction (Figure 2b).

*d*′ ≈ 1. For each simulation, the classification image was estimated using both LM and GLM. The estimated classification images were then scaled to the template and the residual variance was calculated as a measure of goodness of fit.

*d*′ to equal 1 in these experiments and that with increasing internal noise, the observer's

*d*′ decreased.

*σ*= 0.05. While internal noise reduces the advantage of the maximum likelihood fit, the result is never worse than that obtained using LM.

- GLM corresponds to the typical underlying model that the experimenter has in mind and it is the maximum likelihood estimate for classification images with this model,
- we find that GLM fits are never worse than corresponding LM fits, and
- the computational costs of GLM fits using modern packages are not unreasonable.

*f*

_{i}are smooth functions of the continuous covariate,

*t,*which here is time, and

*X*

_{P}and

*X*

_{A}are the instantaneous noise values, also taken as covariates, for Present and Absent trials, respectively, extracted from the main effects model matrix. The intercept term can be absorbed directly into the smooth terms. The smooth functions,

*f,*are approximated by the sum of known basis functions

*b*

_{i}are basis functions evaluated at the covariate values for unknown values of the parameters,

*β*

_{i}. Thus, they can be added as columns to the model matrix and fitted as linear covariates. In most circumstances spline curves are used as bases, but alternatives are possible, such as Fourier terms similar to those used by Levi and Klein (2002) and by Mangini and Biederman (2004). More basis terms are chosen than is generally necessary, and smoothness is controlled by adding a penalty for undersmoothing during the process of maximizing the likelihood. The penalty is usually implemented as a proportion,

*λ,*of the integrated square of the second derivative of

*f*. Larger contributions of this term result in less smooth models. Because the function is expressed as a linear combination of basis elements, the penalty term can be written as a quadratic form in

*β*

*b*

_{i}(

*t*). The model is fit then by minimizing the expression

*β,*where Δ is the negative of twice the logarithm of the likelihood, termed the

*deviance,*and

*m*is the number of smooth terms in the model.

*λ*is chosen to minimize the prediction error to a subset of the data left out of the fit, taken across all or a large number of subsets of the data (Wood, 2006). In the case of the binomial family for which the scale parameter is known, the degree of smoothing is chosen to minimize a statistic called the Un-Biased Risk Estimator (UBRE, Wood, 2006, pp. 172–173), which is defined as

*σ*

^{2}is the known binomial variance, A is the influence or hat matrix, and tr is the trace function. The trace of the influence matrix corresponds to the effective number of parameters estimated in the fit. The reduction of deviance obtained by increasing the number of parameters is offset by adding back in a function of the number of parameters, reminiscent of Akaike's Information Criterion (AIC; Akaike, 1973). In fact, this criterion is effectively a linear transformation of AIC (Wood, 2006, p. 178). Like AIC, this measure favors a model that maximizes the predictability of future rather than the actual data and, thus, serves to reduce the tendency to overfit the data with too many parameters. The literature on model selection is growing rapidly and AIC is only one choice of criterion of many. For a review see Burnham and Anderson (2003) and also Myung, Navarro, and Pitt (2006) and other articles in the same special issue.

*p*= 0.02). However, an additional cubic component did not improve the fit significantly (

*p*= 0.09). Similarly, of the three models, the UBRE score is lowest for the quadratic model.

Residual df | Residual deviance | df | deviance | P[>χ ^{2}] | |
---|---|---|---|---|---|

1 | 114677 | 158295 | |||

2 | 114675 | 158287 | 2.07 | 8 | 0.02 |

3 | 114671 | 158279 | 3.96 | 8 | 0.09 |

*formula object*and whose second object is a

*data frame*. The data frame is in basic terms a matrix whose rows correspond to trials and whose columns correspond to variables coding the response and the stimuli including the external noise. We have seen data frames printed out as matrices in the main text. A data frame in R can also store information about the variables including the names we wish to be used in plotting results and in referring to them in code.

*N*in the data frame. The right-hand side of the formula gives the explanatory variables, time, and resp. The first is a 32-level factor indicating the time samples of the stimulus; the second is a 4-level factor indicating the response classifications of the observer. The slash indicates that a separate linear model will be estimated for each time sample and −1 signifies that the global intercept should be removed from the model. Thus, a mean for each of the 4 response classifications at each of the 32 time samples will be estimated. As all explanatory variables are factors in this model, it corresponds to an analysis of variance.

*χ*

^{2}test is appropriate here because the dispersion in a binomial model is not estimated. Extracting the predicted smooth curve is accomplished with the aid of the predict method for objects of class gam as illustrated in the following code fragment.

_{1}will be also referred to as t1.

*Proceedings of SPIE*, 4324, 114–122.

*Journal of Vision*, 2, (1):5, 66–78, http://journalofvision.org/2/1/5/, doi:10.1167/2.1.5. [PubMed] [Article] [CrossRef]

*Proceedings of SPIE*, 4686, 25–36.

*Perception*, 18, 18.

*Journal of Vision*, 2, (1):8, 121–131, http://journalofvision.org/2/1/8/, doi:10.1167/2.1.8. [PubMed] [Article] [CrossRef]

*Journal of the Acoustical Society of America*, 49, 1751–1756. [CrossRef]

*Second international symposium on inference theory*. (pp. 267–281). Budapest, Hungary: Akadémia Kiadó.

*Pattern recognition and machine learning*. New York: Springer.

*Visual Neuroscience*, 21, 283–289. [PubMed] [Article] [CrossRef] [PubMed]

*Model selection and multi-modal inference*. New York: Springer.

*Journal of the Optical Society of America A, Optics, Image Science, and Vision*, 24, 3418–3426. [PubMed] [CrossRef] [PubMed]

*Journal of Vision*, 5, (9):1, 659–667, http://journalofvision.org/5/9/1/, doi:10.1167/5.9.1. [PubMed] [Article] [CrossRef] [PubMed]

*Psychological Methods*, 3, 186–205. [CrossRef]

*Pattern classification*. New York: Wiley.

*Probit analysis*. Cambridge, UK: Cambridge University Press.

*Current Biology*, 10, 663–666. [PubMed] [Article] [CrossRef] [PubMed]

*Signal detection theory and psychophysics*. Huntington, NY: Robert E Krieger Publishing.

*Journal of the Optical Society of America A, Optics, Image Science, and Vision*, 22, 2081–2089. [PubMed] [CrossRef] [PubMed]

*Generalized additive models*. London: Chapman & Hall.

*The elements of statistical learning*. New York: Springer.

*Advances in Neural Information Processing Systems*, 19, 689–696.

*Journal of Statistical Software*, 25, 1–26. [Article]

*Vision Research*, 44, 1493–1498. [PubMed] [CrossRef] [PubMed]

*Journal of Vision*, 2, (1):4, 46–65, http://journalofvision.org/2/1/4/, doi:10.1167/2.1.4. [PubMed] [Article] [CrossRef]

*Detection theory: A user's guide*. New York: Lawrence Erlbaum Associates.

*Cognitive Science*, 28, 209–226. [CrossRef]

*Generalized linear models*. London: Chapman & Hall.

*Journal of Vision*, 8, (6):271, [CrossRef]

*Journal of Mathematical Psychology*, 50, 167–179. [CrossRef]

*Journal of Vision*, 4, (2):2, 82–91, http://journalofvision.org/4/2/2/, doi:10.1167/4.2.2. [PubMed] [Article] [CrossRef]

*Vision Research*, 46, 2465–2474. [PubMed] [CrossRef] [PubMed]

*Nature*, 401, 695–698. [PubMed] [CrossRef] [PubMed]

*R: A language and environment for statistical computing.*.

*Pattern recognition and neural networks*. Cambridge, UK: Cambridge University Press.

*Perception*, 31, 106.

*Journal of Vision*, 2, (1):7, 105–120, http://journalofvision.org/2/1/7/, doi:10.1167/2.1.7. [PubMed] [Article] [CrossRef]

*Journal of the Optical Society of America A, Optics, Image Science, and Vision*, 22, 2257–2261. [PubMed] [Article] [CrossRef] [PubMed]

*Modern applied statistics with S*. New York: Springer ISBN 0-387-95457-0.

*Nature Neuroscience*, 8, 1651–1656. [PubMed] [Article] [CrossRef] [PubMed]

*Advances in Neural Information Processing Systems*, 17, 1489–1496.

*Generalized additive models: An introduction with R*. Boca Raton, FL: Chapman & Hall/CRC.

resp | time | Time | N | |
---|---|---|---|---|

1 | H | 1 | 0.02 | −0.196 Start trial 1 |

2 | H | 2 | 0.04 | −0.197 |

. | ||||

. | ||||

. | ||||

32 | H | 32 | 0.64 | −0.189 End trial 1 |

33 | H | 1 | 0.02 | 0.039 Start trial 2 |

34 | H | 2 | 0.04 | 0.131 |

. | ||||

. |

Resp | Stim | t1 | t2 | t3 | t4 | t5 | t6 … | |
---|---|---|---|---|---|---|---|---|

1 | 1 | 1 | −0.196 | −0.197 | −0.038 | −0.114 | −0.136 | 0.204 … Trial 1 |

2 | 1 | 1 | 0.039 | 0.131 | 0.233 | −0.053 | −0.133 | 0.295 … Trial 2 |

. | ||||||||

. | ||||||||

. | ||||||||

32 | 0 | 0 | −0.249 | 0.219 | −0.073 | 0.257 | −0.142 | −0.235 … Trial 32 |

33 | 1 | 0 | 0.022 | −0.109 | 0.076 | −0.013 | 0.044 | −0.249 … Trial 33 |

34 | 1 | 1 | 0.254 | −0.193 | −0.197 | 0.162 | 0.285 | −0.264 … Trial 34 |

. | ||||||||

. |

Residual df | Residual deviance | df | deviance | P[>χ ^{2}] | |
---|---|---|---|---|---|

1 | 3550 | 4213.0 | |||

2 | 3518 | 3977.2 | 32 | 235.8 | <0.00001 |

Residual df | Residual deviance | df | deviance | P[>χ ^{2}] | |
---|---|---|---|---|---|

1 | 114677 | 158295 | |||

2 | 114675 | 158287 | 2.07 | 8 | 0.02 |

3 | 114671 | 158279 | 3.96 | 8 | 0.09 |