Conventional approaches to modeling classification image data can be described in terms of a standard linear model (LM). We show how the problem can be characterized as a Generalized Linear Model (GLM) with a Bernoulli distribution. We demonstrate via simulation that this approach is more accurate in estimating the underlying template in the absence of internal noise. With increasing internal noise, however, the advantage of the GLM over the LM decreases and GLM is no more accurate than LM. We then introduce the Generalized Additive Model (GAM), an extension of GLM that can be used to estimate smooth classification images adaptively. We show that this approach is more robust to the presence of internal noise, and finally, we demonstrate that GAM is readily adapted to estimation of higher order (nonlinear) classification images and to testing their significance.

*s*(

*x*) denote a signal along a physical continuum

*x*. The physical continuum can be unidimensional (e.g., time) or multidimensional (e.g., locations in an image) and we sample the signal at locations

*x*

_{ j},

*j*= 1, …,

*p*. The resulting samples

*s*(

*x*

_{ j}),

*j*= 1, …,

*p*form a vector of length

*p*that we denote

**s**(

*j*),

*j*= 1, …,

*p*for convenience.

**is a**

*ɛ**p*-dimensional vector of identically distributed random variables drawn from a distribution with mean 0 and a

*p*×

*p*covariance matrix

**C**. The noise values

**are typically drawn from a Gaussian or uniform distribution and are typically independent so that the covariance matrix**

*ɛ***C**=

*σ*

^{2}

**I**is diagonal with diagonal entries equal to the variance

*σ*

^{2}of the noise.

^{1})

*x*) defined on the physical continuum

*x*. The sampled values of the template

*x*

_{ j}),

*j*= 1, …,

*p*form a vector denoted

**T**(

*j*),

*j*= 1, …,

*p*. The output,

*D,*of the linear mechanism is determined by the inner product between the template vector and the stimulus vector,

*D,*the

*decision variable*. In the simple case in which there are only two possible signals,

*a*and

*b,*the observer's decision rule on a given trial might be to choose

*a*if

*c,*and otherwise choose

*b*. In effect, the inner product of the template with the stimulus reduces the decision problem to one dimension. If the noise is independent, identically distributed Gaussian, the decision rule that results in the highest expected proportion of correct responses is of the form of Equation 3 as shown in Duda, Hart, and Stork (2000, Chap. 2).

*a,*the response “No”, with the signal

*b*(Green & Swets, 1966; Macmillan & Creelman, 2005). There are only four possible types of response referred to as

*Hit*(H), when the signal is

*a*and the response is

*a, False Alarm*(FA), when the signal is

*b*and the response is

*a, Miss*(M), when the signal is

*a*and the response is

*b,*and

*Correct Rejection*(CR), when the signal is

*b*and the response is

*b*. The decision rule in Equation 3 leads to a method of estimating the classification image by averaging the trial-by-trial noise profiles for each type of response

^{H},

^{M},

^{FA},

^{CR}, where the

*p*-vector

^{H}is the average of the noise profiles, , presented on Hit trials, the

*p*-vector

^{FA}, the average for False Alarm trials, etc. (Ahumada, 2002). We then combine these mean images to form two classification images for stimulus present

_{p}=

^{H}−

^{M}and for stimulus absent,

_{a}=

^{FA}−

^{CR}and then add these two images to get an estimated classification image based on all of the data,

_{p}and

_{a}should have the same expectation (that is, differences between them are due only to random variation). However, as we shall see below, it is useful to analyze the images

_{p},

_{a}separately before combining them (Ahumada, 2002; Solomon, 2002a, 2002b; Thomas & Knoblauch, 2005).

*p*-vector

**S**presented on each trial is a

*p*-vector. If the experiment consists of

*n*trials, we obtain the terms of Equation 4 from the solution to the equation

**E**is an

*np*vector

^{2}of concatenated noise vectors

*ɛ*_{ i}= [

*ɛ*

_{ i1}, …,

*ɛ*

_{ ip}]′,

*i*= 1, …,

*n,*presented on successive trials,

**X**is an

*np*× 4

*p*incidence matrix indicating which of four outcomes (H, FA, etc.) occurred on the trial, and

**is a 4**

*β**p*vector that represents the elements of the 4 components of Equation 4. The matrix

**X**is a block matrix consisting of

*n*× 4 submatrices with one row of submatrices for each trial, each of which is either a

*p*×

*p*zero matrix

**0**

_{ p × p}or a

*p*×

*p*identity matrix

**I**

_{ p × p}. An example of such a block matrix

**X**is

*p*+ 1, 2

*p*+ 1, or 3

*p*+ 1 of

**X**depending on whether the trial is a Hit, False Alarm, Miss, or Correct Rejection, respectively. In Equation 6, for example, the first trial resulted in a Miss and the identity matrix

**I**

_{ p × p}is the third submatrix beginning in column 2

*p*+ 1 of

**X**. The least-squares solution of Equation 5 is a 4

*p*-vector

*β*^{ H}= [

*β*

_{1}

^{ H}, …,

*β*

_{ p}

^{ H}]′,

*β*^{ FA}= [

*β*

_{1}

^{ FA}, …,

*β*

_{ p}

^{ FA}]′,

*β*^{ M}= [

*β*

_{1}

^{ M}, …,

*β*

_{ p}

^{ M}]′,

*β*^{ CR}= [

*β*

_{1}

^{ CR}, …,

*β*

_{ p}

^{ CR}]′ and the least-squares solutions of Equation 5 for these four vectors are the corresponding mean vectors

^{ H},

^{ M},

^{ FA},

^{ CR}in Equation 4. The classification image estimate

**lm**in the programming environment of R (R Development Core Team, 2008). See 1.

**X**on the right-hand side of Equation 5 where one traditionally finds the explanatory variables (regressors), and the noise samples, a fixed part of the stimulus, are on the left-hand side, where one typically finds the response. The stochastic component is the random placement of submatrices in the matrix

**X**according to observer's responses. The Generalized Linear Model (GLM) and the Generalized Additive Model (GAM) presented later in this article are set up in a more traditional manner, with the response on the left and the stimulus, explanatory variables on the right.

*L*is the likelihood defined as a function of,

**T**, the template coefficients,

*σ,*a scale factor,

_{i}is the added noise profile on the

*i*th trial, Φ the cumulative Gaussian function with mean 0 and variance 1, and

*R*

_{i}an indicator variable taking on the values 1 or 0 if the response on the

*i*th trial is that the signal is

*a*or

*b,*respectively. The model is identifiable with

*p*parameters if one constraint is placed on the vector

*σ*and with this parameterization is in the same units as the signal detection parameter

*d*′ (Green & Swets, 1966; Macmillan & Creelman, 2005).

*η*

*Y*is a member of the

*exponential family*(not to be confused with the exponential distribution), which consists of distributions whose probability density functions can be written in the form

*a, b, c,*and

*d*(McCullagh & Nelder, 1989).

**Y**is a Bernoulli (or binary) random variable, a random variable with two possible outcomes, 1 and 0, that takes on the value 1 with a fixed probability

*θ*∈ (0, 1) and is otherwise 0. The Bernoulli distribution for observing

*y*∈ {0, 1} is

*a*(

*y*) =

*y, b*(

*θ*) = log(

*θ*/ (1 −

*θ*)),

*c*(

*θ*) =

*n*log(1 −

*θ*),

*d*(

*y*) = 0, demonstrating that the Bernoulli is a member of the exponential family. Note that

*θ*is the expected value

*E*[

**Y**] of the response variable,

**Y**.

*η*(•). We can choose the function

*b*(

*θ*) = log(

*θ*/ (1 −

*θ*)) and the resulting GLM is then equivalent to logistic regression (McCullagh & Nelder, 1989). We can, however, select any alternative link function that maps the interval (0, 1) onto the range (−∞, +∞). One common choice of link function is the inverse of the cumulative distribution function of the Gaussian function, denoted Φ

^{−1}(•), leading to the probit model (Finney, 1971).

**X**

*β*of intensity variables. The inverse of the link function is the psychometric function.

*σ*= 1. Thus, we can conceptualize Equation 9 as the linear predictor with the stimulus as the covariates and the template as the coefficients to estimate. The term

**X**

*β*can be used to model experimental designs with multiple conditions that are structured as analysis of variance designs or multiple regression designs. We pursue this point in the example below.

*σ*

_{s}= 2.5 deg) was modulated over a 640 ms interval on a calibrated monitor. The modulation consisted of 32 sequential samples, each of 20 ms duration. On a given trial, the samples were either set to random luminance values chosen from a uniform distribution centered on the mean luminance of the screen or to random luminance values added to a temporal luminance modulation following a Gabor function. The temporal luminance modulation was described by

*t*

_{i}is the time of the

*i*th sample,

*i*= 1, 32,

*μ*= 320 ms, and

*σ*

_{t}= 160 ms. Three different carrier frequencies were tested in the article, but here, we consider only data collected from one observer at the highest carrier frequency used, 3.125 Hz.

*d*′ = 1.

resp | time | Time | N | |
---|---|---|---|---|

1 | H | 1 | 0.02 | −0.196 Start trial 1 |

2 | H | 2 | 0.04 | −0.197 |

. | ||||

. | ||||

. | ||||

32 | H | 32 | 0.64 | −0.189 End trial 1 |

33 | H | 1 | 0.02 | 0.039 Start trial 2 |

34 | H | 2 | 0.04 | 0.131 |

. | ||||

. |

**resp**a 4-level factor indicating the category of the response, H, FA, M, CR,

**time**a 32-level factor indicating the time points of each sample of the stimulus,

**Time**a numeric value indicating the time in seconds of the sample with respect to the beginning of the trial,

**N**a numeric value indicating the relative luminance modulation of the noise component of the stimulus at the indicated time sample.

**Stim**, a two-level factor with labels 0 and 1 indicating whether or not the signal was present and

**Resp**, a two-level factor with the same labels indicating whether or not the observer classified the trial as containing the signal. Second, we reshape the column

*N*into 32 columns, each indicating a time sample. We show the first 8 columns of a sample of rows of the modified data set below with line numbers In this new format, each row corresponds to one trial of the experiment. The time points,

**t1, …, t32**,

^{3}enter the model now as independent covariates, taking on the instantaneous values of the noise on each trial.

Resp | Stim | t1 | t2 | t3 | t4 | t5 | t6 … | |
---|---|---|---|---|---|---|---|---|

1 | 1 | 1 | −0.196 | −0.197 | −0.038 | −0.114 | −0.136 | 0.204 … Trial 1 |

2 | 1 | 1 | 0.039 | 0.131 | 0.233 | −0.053 | −0.133 | 0.295 … Trial 2 |

. | ||||||||

. | ||||||||

. | ||||||||

32 | 0 | 0 | −0.249 | 0.219 | −0.073 | 0.257 | −0.142 | −0.235 … Trial 32 |

33 | 1 | 0 | 0.022 | −0.109 | 0.076 | −0.013 | 0.044 | −0.249 … Trial 33 |

34 | 1 | 1 | 0.254 | −0.193 | −0.197 | 0.162 | 0.285 | −0.264 … Trial 34 |

. | ||||||||

. |

**S**denotes

**Stim**, a two-level factor indicating the presence or absence of the signal,

**t**

_{1},…,

**t**

_{32}are the noise samples (vectors) at each time point acting in the model as covariates, and the coefficients

*β*

_{1},…,

*β*

_{32}provide the estimated classification image. The R code is given in 1 and the resulting classification image is shown in Figure 2b. Comparing the fits in Figures 2a and 2b reveals that the two methods, LM and GLM, yield very similar results. Both methods lead to an estimated classification image that reveals the same pattern of systematic differences from the signal. We note that the main effect of the term

**S**(see 1) provides an estimate of

*d*′ (DeCarlo, 1998).

_{p}and

_{a}separately for trials on which the signal is present and trials on which it is absent, respectively. We can easily do this with GLM as well. In doing so, we test whether there is an interaction between the presence/absence of the signal and the estimated classification images

_{p}and

_{a}. If there is none, we can combine the two classification images

_{p}and

_{a}according to Equation 4. The model can be represented as

*β*

_{i}

^{s}now depend on whether the signal is present (

*s*=

*p*) or absent (

*s*=

*a*). The estimated values

_{i}

^{p},

*i*= 1,…, 32 and

_{i}

^{a},

*i*= 1,…, 32 correspond to the two estimated classification images

_{p}and

_{a}, respectively.

Residual df | Residual deviance | df | deviance | P[>χ ^{2}] | |
---|---|---|---|---|---|

1 | 3550 | 4213.0 | |||

2 | 3518 | 3977.2 | 32 | 235.8 | <0.00001 |

_{p}) and filled circles (

_{a}), respectively. For comparison, we plot

_{p}+

_{a}in Figure 2d. It can be seen that it differs little from the fit that did not include an interaction (Figure 2b).

*d*′ ≈ 1. For each simulation, the classification image was estimated using both LM and GLM. The estimated classification images were then scaled to the template and the residual variance was calculated as a measure of goodness of fit.

*d*′ to equal 1 in these experiments and that with increasing internal noise, the observer's

*d*′ decreased.

*σ*= 0.05. While internal noise reduces the advantage of the maximum likelihood fit, the result is never worse than that obtained using LM.

- GLM corresponds to the typical underlying model that the experimenter has in mind and it is the maximum likelihood estimate for classification images with this model,
- we find that GLM fits are never worse than corresponding LM fits, and
- the computational costs of GLM fits using modern packages are not unreasonable.

*f*

_{i}are smooth functions of the continuous covariate,

*t,*which here is time, and

*X*

_{P}and

*X*

_{A}are the instantaneous noise values, also taken as covariates, for Present and Absent trials, respectively, extracted from the main effects model matrix. The intercept term can be absorbed directly into the smooth terms. The smooth functions,

*f,*are approximated by the sum of known basis functions

*b*

_{i}are basis functions evaluated at the covariate values for unknown values of the parameters,

*β*

_{i}. Thus, they can be added as columns to the model matrix and fitted as linear covariates. In most circumstances spline curves are used as bases, but alternatives are possible, such as Fourier terms similar to those used by Levi and Klein (2002) and by Mangini and Biederman (2004). More basis terms are chosen than is generally necessary, and smoothness is controlled by adding a penalty for undersmoothing during the process of maximizing the likelihood. The penalty is usually implemented as a proportion,

*λ,*of the integrated square of the second derivative of

*f*. Larger contributions of this term result in less smooth models. Because the function is expressed as a linear combination of basis elements, the penalty term can be written as a quadratic form in

*β*

**S**is a known matrix obtained from the basis functions,

*b*

_{i}(

*t*). The model is fit then by minimizing the expression

*β,*where Δ is the negative of twice the logarithm of the likelihood, termed the

*deviance,*and

*m*is the number of smooth terms in the model.

*λ*is chosen to minimize the prediction error to a subset of the data left out of the fit, taken across all or a large number of subsets of the data (Wood, 2006). In the case of the binomial family for which the scale parameter is known, the degree of smoothing is chosen to minimize a statistic called the Un-Biased Risk Estimator (UBRE, Wood, 2006, pp. 172–173), which is defined as

*σ*

^{2}is the known binomial variance,

**A**is the influence or hat matrix, and tr is the trace function. The trace of the influence matrix corresponds to the effective number of parameters estimated in the fit. The reduction of deviance obtained by increasing the number of parameters is offset by adding back in a function of the number of parameters, reminiscent of Akaike's Information Criterion (AIC; Akaike, 1973). In fact, this criterion is effectively a linear transformation of AIC (Wood, 2006, p. 178). Like AIC, this measure favors a model that maximizes the predictability of future rather than the actual data and, thus, serves to reduce the tendency to overfit the data with too many parameters. The literature on model selection is growing rapidly and AIC is only one choice of criterion of many. For a review see Burnham and Anderson (2003) and also Myung, Navarro, and Pitt (2006) and other articles in the same special issue.

**mgcv**(Wood, 2006). The code fragment to fit the model is shown in 1. The function allows several different possible choices of basis functions. We chose thin plate regression splines with shrinkage (Wood, 2006, Sections 4.1.5–4.1.6). The term “shrinkage” implies that smooth terms, which do not contribute significantly to the fit, can be dropped from the model. An alternative fit using a cubic spline basis produces similar results to those reported here. Spline fits can show biases at the boundaries of the data. Such effects are possibly visible in Figure 5 but small. A detailed discussion of considerations that serve to guide the choice of basis functions can be found in Wood (2006, Section 4.1).

*p*= 0.02). However, an additional cubic component did not improve the fit significantly (

*p*= 0.09). Similarly, of the three models, the UBRE score is lowest for the quadratic model.

Residual df | Residual deviance | df | deviance | P[>χ ^{2}] | |
---|---|---|---|---|---|

1 | 114677 | 158295 | |||

2 | 114675 | 158287 | 2.07 | 8 | 0.02 |

3 | 114671 | 158279 | 3.96 | 8 | 0.09 |

**R**(R Development Core Team, 2008). We also generated all of the graphs with R. In this appendix, we provide the code fragments that we used to fit the models described in the main article.

**R**prompt and “+” indicates a continuation from the previous line.

**lm**whose first argument is a

*formula object*and whose second object is a

*data frame*. The data frame is in basic terms a matrix whose rows correspond to trials and whose columns correspond to variables coding the response and the stimuli including the external noise. We have seen data frames printed out as matrices in the main text. A data frame in R can also store information about the variables including the names we wish to be used in plotting results and in referring to them in code.

*N*in the data frame. The right-hand side of the formula gives the explanatory variables, time, and resp. The first is a 32-level factor indicating the time samples of the stimulus; the second is a 4-level factor indicating the response classifications of the observer. The slash indicates that a separate linear model will be estimated for each time sample and −1 signifies that the global intercept should be removed from the model. Thus, a mean for each of the 4 response classifications at each of the 32 time samples will be estimated. As all explanatory variables are factors in this model, it corresponds to an analysis of variance.

**Gabor.lm**.

**> Gabor.lm <**

**− lm(N ∼**

**time/resp − 1, Gabor)**

**> ClsLM <**

**− crossprod(matrix(coef(Gabor.lm),**

**+ nrow = 4, byrow = TRUE), c(0, 1, −1, −1))**

**Gabor.wide**.

**>**

**Resp <**

**− factor(unclass(Gabor$resp) <**

**3,**

**+ labels = 0:1)[seq(1, nrow(Gabor), 32)]**

**>**

**Stim <**

**− factor(unclass(Gabor$resp) %% 2,**

**+ labels = 0:1)[seq(1, nrow(Gabor), 32)]**

**>**

**Gabor.wide <**

**− data.frame(matrix(Gabor$N, ncol = 32,**

**+ byrow = TRUE))**

**>**

**names(Gabor.wide) <**

**− paste(“t”, 1:32, sep = “”)**

**>**

**Gabor.wide <**

**− cbind(Resp, Stim, Gabor.wide)**

**resp**is recoded as 2 two-level factors,

**Resp**and

**Stim**. The first codes the responses of the observer as 0 and 1 for absent and present, respectively; the second codes the absence and presence of the signal, similarly. Then, the 32 instantaneous noise values for each trial are distributed across 32 columns of a data frame, so that each may be treated as an independent covariate in the formula object. For later identification, these are given names

**t1, t2,**…

**, t32**. Finally, the vectors

**Resp**and

**Stim**are attached as the first two columns of the data frame.

**glm**and family and link functions must be specified:

**>**

**Gabor.glm <**

**− glm(Resp ∼**

**. − 1,**

**+ family = binomial(link = “probit”),**

**+ data = Gabor.wide)**

**>**

**ClsImGLM <**

**− coef(Gabor.glm)[−(1:2)]**

**Resp**is the response. The dot is a convenient shorthand for a model in which each column of the data frame except for

**Resp**enters as a covariate additively as in Equation 16, which saves writing out the 33 terms. Finally, the global intercept is excluded. The model yields 34 coefficients in all. The first two correspond to main effects of the two levels of

**Stim**that we ignore here. The next 32 specify the coefficients for each time sample. The negative indices indicate a sequence in which these coefficients are excluded from the vector.

**Stim**, the form of the model from the LM example is used.

**>**

**Gabor2.glm <**

**− glm(Resp ∼**

**Stim/. − 1,**

**+ family = binomial(link = “probit”),**

**+ data = Gabor.wide)**

**Resp.**Terms of the form

**Stim/Stim**are silently removed. As before, the global intercept is excluded. As with the LM model, the slash indicates that a coefficient is estimated for each level of

**Stim**. The model yields 66 coefficients in all. The first two correspond to main effects of the two levels of Stim ignored here. The next 64 specify the coefficients for each time sample, alternating between the estimates for the two levels of

**Stim**. The set of coefficients for each level of

**Stim**correspond to the component images on the right-hand side of Equation 4, here for stimulus absent and present, respectively.

**anova**.

**>**

**sqt <**

**− Gabor.wide[, −(1:2)]^2**

**>**

**colnames(sqt) <**

**− paste(“tt”, 1:32, sep = “”)**

**>**

**Gabor.wide <**

**− cbind(Gabor.wide, sqt)**

**Resp**and

**Stim**. The second line assigns names to distinguish these columns from those used for the first order components. The third line joins these columns to the data frame. With this organization of the data frame the call to

**glm**is the same as for the previous model.

**>**

**Gabor3.glm <**

**− glm(Resp ∼**

**Stim/. − 1,**

**+ family = binomial(link = “probit”),**

**data = Gabor.wide)**

**resp**must be decoded into responses and signal status as in fitting the GLM. Additionally, to obtain separate estimates for signal present and absent conditions, the noise profiles for each trial must be separated into separate columns as a function of signal status. This is performed with the model.matrix function that takes a one-sided formula and a data frame and outputs a model matrix specified by the formula. Here the formula specifies a two-column matrix indicating the interaction of the signal status with the noise profile with the column corresponding to the intercept removed. For convenience, we store each column of this matrix in a separate variable in the data frame.

**>**

**Gabor$Stim <**

**− factor(with(Gabor, unclass(resp) %% 2))**

**>**

**Gabor$Resp <**

**− factor(with(Gabor, unclass(resp) <**

**3),**

**+ labels = c(“0”, “1”))**

**>**

**Imat <**

**− model.matrix(∼Stim:N − 1, data = Gabor)**

**>**

**Gabor$by0 <**

**− Imat[, 1]**

**>**

**Gabor$by1 <**

**− Imat[, 2]**

**>**

**Gabor.gam <**

**− gam(Resp ∼**

**+ s(Time, bs = “ts”, by = by0, k = 25) +**

**+ s(Time, bs = “ts”, by = by1, k = 25),**

**+ family = binomial(“probit”), data = Gabor)**

**s**in the GAM model specify the smooth terms and the argument

**bs**the type of spline basis, here thin plate regression splines with shrinkage. The

**by**argument is used to specify a term by which to multiply the smooth term, generating a “variable coefficient model” (Hastie & Tibshirani, 1990; Wood, 2006). Multiplying the smooth term by the noise profile on each trial corresponds to the inner product rule of (2). The arguments

**k**specify the maximum number of parameters to allocate to each smooth term in fitting the model. Generally, specifying more than would be necessary helps in finding the model requiring fewer parameters, giving the best fit to the data. We could specify

**k**up to 32 per smooth term, in this example, the length of the covariate Time, but the results were unchanged and the smaller value resulted in a faster execution time of the code. To extend this model to include the nonlinear terms discussed in the text, we simply add additional smooth terms to the model formula that contain powers of the by argument. For example, to include the second-order terms, we fit the model.

**>**

**Gabor.gam2 <− gam(Resp ∼**

**+ s(Time, bs = “ts”, by = by0, k = 25) +**

**+ s(Time, bs = “ts”, by = by1, k = 25) +**

**+ s(Time, bs = “ts”, by = by0^2, k = 25) +**

**+ s(Time, bs = “ts”, by = by1^2, k = 25),**

**+ family = binomial(“probit”), data = Gabor)**

**anova**function that outputs a table in the typical form of an analysis of variance, although in the case of the binomial models here, it is deviance that is evaluated. For example,

**>**

**anova(Gabor.gam, Gabor.gam2, Gabor.gam3, test = “Chisq”)**

*χ*

^{2}test is appropriate here because the dispersion in a binomial model is not estimated. Extracting the predicted smooth curve is accomplished with the aid of the predict method for objects of class

**gam**as illustrated in the following code fragment.

**>**

**nd <− data.frame(Time = unique(Gabor$Time),**

**+ N = rep(1, 32),**

**+ by0 = rep(0, 32),**

**+ by1 = rep(1, 32) )**

**>**

**s1 <**

**− predict(Gabor.gam, newdata = nd)**

**nd**, is required that has all the same variables used in the fit with

**gam**, but only containing values for which predictions are desired. By default, the predictions are on the scale of the linear predictor, but a type argument can be used to specify that the predictions be on the response scale that, in this case, corresponds to the probability of responding “Present”.

**t**

_{1}will be also referred to as

**t1**.