Classification images provide compelling insight into the strategies used by observers in psychophysical tasks. However, because of the high-dimensional nature of classification images and the limited quantity of trials that can practically be performed, classification images are often too noisy to be useful unless denoising strategies are adopted. Here we propose a method of estimating classification images by the use of sparse priors in smooth bases and generalized linear models (GLMs). Sparse priors in a smooth basis are used to impose assumptions about the simplicity of observers' internal templates, and they naturally generalize commonly used methods such as smoothing and thresholding. The use of GLMs in this context provides a number of advantages over classic estimation techniques, including the possibility of using stimuli with non-Gaussian statistics, such as natural textures. Using simulations, we show that our method recovers classification images that are typically less noisy and more accurate for a smaller number of trials than previously published techniques. Finally, we have verified the efficiency and accuracy of our approach with psychophysical data from a human observer.

*classification image*is a visual representation of the observer's strategy in the task.

*specifics*of the visual process involved, we often have

*general*exploitable knowledge derived from previous classification image experiments and from neurophysiology. To give but one example (Knoblauch & Maloney, 2008b; Thomas & Knoblauch, 2005), while we may not know exactly how a human observer detects a time-modulated luminance signal in noise, we do know that humans have limited sensitivity to high temporal frequencies. This suggests that observers' internal templates will be smooth at a certain scale for such a task. This prior knowledge of smoothness can then be used to obtain more accurate classification images on this particular task (Knoblauch & Maloney, 2008b).

*underlying*a set of observations. In the context of classification images, this means incorporating knowledge of visual physiology. The human visual system is trained on visual images which themselves are sparse in a basis of smooth, oriented filters (Srivastava et al., 2003). Conversely, a simple constraint of sparse representation of images is sufficient to reproduce many properties of the visual system, including V1 receptive fields (Olshausen & Field, 1996) and color opponency (Lee, Wachtler, & Sejnowski, 2002). It is thus tempting to conjecture that humans are naturally more efficient at representing sparse visual structure (Olshausen & Field, 2004), and that this induces sparse internal templates in classification image experiments.

**x**of dimension

*k,*the template by a vector

**w**, the internal noise by

*ε,*and the observer computes an internal decision variable,

*v,*by

*ε*is taken from a symmetric distribution with mean 0 and standard deviation

*σ,*so that

*ε*∼ Φ′(0,

*σ*

^{2}), where Φ is the cumulative distribution function (cdf) for the distribution in question. The internal noise represents observers' inconsistency: the same physical stimulus can elicit different responses. Other formulations (Ahumada, 2002) assume that it is the threshold

*c*that is a random variable; our formulation is equivalent.

*posterior p*(

**w**∣

**y**), the

*likelihood p*(

**y**∣

**w**), and the

*prior p*(

**w**). Our goal is to find the value of

**w**that maximizes the posterior. For numerical reasons, it turns out to be simpler to find the value of

**w**that minimizes the negative log of the posterior:

*maximum likelihood*(ML) estimate of

**w**; otherwise, the estimate is called

*maximum a posteriori*(MAP). The negative log-likelihood is denoted by

*L,*while the negative log-prior, sometimes called the

*regularizer,*is denoted by

*R*.

**x**, the probability of observing response

*y*= +1 given

**w**and offset

*c*is

*p*(

*y*= −1∣

**x**,

**w**,

*c*) = 1 −

*p*(

*y*= +1∣

**x**,

**w**,

*c*).

**w**, which is assumed to be constant throughout the duration of the experiment. We therefore wish to find a template that captures the relationship between a series of stimuli

**X**= [

**x**

^{{1}},

**x**

^{{2}}, …,

**x**

^{{ n}}] and the corresponding responses

**y**= (

*y*

_{1},

*y*

_{2}, …,

*y*

_{ n}). We make the standard assumption that an observer's response on a given trial is independent of his or her responses on other trials. The likelihood function is then

*σ*it is possible to multiply

*c*and

**w**by a constant factor such that the likelihood stays the same. We resolve this ambiguity by using the standard procedure of setting

*σ*= 1; noisier observers are accommodated by a smaller overall magnitude for the weight vector. The negative log-likelihood is therefore

**U**with corresponding weights

**u**. Later, we will impose a prior on

**w**but not on

**u**. This formulation allows for the inclusion of factors that affect the observer's responses other than the noise stimulus. For example, in Tadin et al. (2006), the question of interest is the time-varying influence of visual motion in the stimulus surround on judgments of motion direction in the center. Here, the time-varying surround stimulus would be put into the

**X**matrix while the signal sign in the center would take one column of the

**U**matrix. Another possibility, in detection tasks, is to include the signal sign into the

**U**matrix, as in Knoblauch and Maloney (2008b, p.5, Equation 16), rather than summing noise and signal in the

**X**matrix. We found that the

**U**strategy generally led to faster optimization while yielding equivalent estimates of internal templates. A final possibility is to include trial-delayed responses in the

**U**matrix to model observer's tendency to respond similarly or dissimilarly on subsequent trials, as in the commonly used method of modelling refractory periods and burstiness in neurons (Pillow et al., 2008). We eliminate the constant

*c*by adding a constant column to the matrix

**U**. Under this new parameterization, we have the negative log-likelihood function for the linear observer model:

*generalized linear model*(GLM) (Gelman, Carlin, Stern, & Rubin, 2003). Assuming that the internal noise has a Gaussian distribution, Φ becomes the cdf of a Gaussian, and we obtain the probit model (Knoblauch & Maloney, 2008b). If instead we assume that the internal noise has a logistic distribution, Φ(

*x*) = 1 / (1 + exp(−

*x*)), we obtain the logit or logistic regression model. In practice, the Gaussian and logistic distributions are very similar, and empirically it is difficult if not impossible to determine which provides a more accurate description of an observer's internal noise. For computational simplicity, we adopt the logistic regression model.

**w**, as we have argued in the introduction. Such prior information can effectively narrow the space of possible classification images, thus improving the efficiency with which classification images can be recovered, as well as their accuracy and the interpretability of the results. Here we propose the use of sparse priors on smooth basis coefficients, which together impose global sparseness and local smoothness on the recovered templates.

*p*(

**w**) =

**w**;0, (

*λ*A^{ T}

**A**)

^{−1}), where

*λ*is a scale factor. Common cases of Gaussian priors are shown in Table 1. When the matrix

**A**is the identity matrix, we get a penalty on the magnitude of coefficients known as weight decay. When

**A**is a spatial derivative operator, we get the smoothness prior, which encourages spatially smooth weights. In general, any choice of

**A**for which

**A**

^{ T}

**A**is positive definite generates a proper Gaussian prior. Choices other than weight decay and smoothness priors are discussed in some detail in Wu et al. (2006).

Gaussian prior name | Regularizer R = −log p( w) |
---|---|

Weight decay (ridge) | λ 2 w ^{ T} Iw |

Smoothness | λ 2 w ^{ T}( D ^{ T} D) w |

Spline smoothness (Knoblauch & Maloney, 2008b) | λ2w^{T}Sw |

Arbitrary | λ2w^{T}(A^{T}A)w |

**w**. If weights are sparse, then most are zero and few have large values. The corresponding prior distribution should have a peak at 0 and heavy tails. A simple prior embodying these assumptions is that of an independent sparse prior over the weights:

**w**∥

_{1}denotes the

*L*

_{1}norm of the vector

**w**. The distribution exp(−

*λ*∥

**w**∥

_{1}) is known as the Laplace distribution, a term we avoid because of the potential confusion with the unrelated Laplacian pyramid which we introduce later; in the following we will refer to the prior induced by this distribution as the “sparse prior.” This distribution has a sharp peak at the origin and heavy tails. In the context of Equation 3, the multiplication of this prior with the likelihood

*p*(

*y*∣

*w*) has predictable effects on the resulting posterior density. For instance, if we assume a Gaussian likelihood ( Figure 2), the resulting posterior density has a peak at the origin when the Gaussian's center is sufficiently close to 0. Hence, parameters whose presence gives only a marginal increase in the likelihood of the data will tend to be clamped at 0 and thus discarded from the model.

*L*

_{1}regularization in the standard (“pixel”) basis thus gives results similar to post hoc thresholding.

*p*(

**w**) = exp(−

*λ*∑

_{ i}∣

*w*

_{ i}∣

^{ q}), which includes both Gaussian and Laplace distribution priors,

*p*(

**w**) is log-concave for

*q*≥ 1, which leads to a convex, tractable optimization problems when estimating

**w**. As sparseness increases for small

*q, q*= 1 yields the sparsest distribution of this form which leads to a tractable optimization problem (Seeger, 2008), which makes it generally preferable to alternative sparseness-inducing priors.

**B**. Denoting the basis weights as

*L*

_{1}norm is not conserved under a rescaling of the columns of the basis matrix

**B**, which means that a sparse prior will consider certain basis functions more likely if

**B**is scaled unevenly. To avoid this, the matrix

**B**should be normalized so that its component column vectors have norm 1. The associated MAP estimate is given by the vector

**w**=

**B**

*basis*loosely, as the rows of

**B**need not span

**R**

^{ k}, as with the spline basis used in Knoblauch and Maloney (2008b), nor do they need to be linearly independent, as with overcomplete transforms. We can freely construct a basis matrix that embeds our assumptions for the particular classification image reconstruction problem at hand. In the case where

**B**is overcomplete, that is, its columns span

**R**

^{k}but are not linearly independent, there are many equivalent ways of expressing the classification image. The sparse prior will tend to select the simplest way of expressing the classification image. The compatibility of sparse priors with overcomplete bases is advantageous as it is generally simpler to construct bases which have desirable properties when one removes the restriction of linear independence.

*a priori*assumptions about the degree of smoothness in the classification image. Both criteria can be met by the use of a Laplacian pyramid (Burt & Adelson, 1983), which consists of multiple Gaussian functions that are smooth on various spatial scales.

*m*dimensions, the decomposition is overcomplete by a factor 2

^{ m}/ (2

^{ m}− 1) ≤ 2, i.e., only mildly overcomplete. The Laplacian pyramid can be extended by using more basis functions than is standard, for example by adding half-levels, or having more basis functions than standard within a level. Such undecimated Laplacian pyramids lead to greater computational cost during model fitting but can be more powerful than standard Laplacian pyramids.

*λ,*much like smoothing involves the choice of the width of the smoothing kernel. This hyperparameter may be selected by assessing how well a model generalizes to untrained examples using

*k*-fold cross-validation (Wu et al., 2006). Generalization performance is naturally assessed using the cross-validated log-likelihood of the data.

*k*-fold cross-validation increases the computational burden of estimating a model by roughly

*km,*where

*m*is the number of regularization hyperparameters considered. In typical use,

*km*is on the order of 50–100. This limits the practical applicability of this form of cross-validation to models which take a minute or so to fit, such as models with Gaussian priors.

*λ*

_{ i}is a decreasing sequence of regularization hyperparameters, and

*f*(

**w**) is some convex function of

**w**. These algorithms therefore generate an entire regularization path for roughly the same cost as fitting the model with the smallest

*λ*considered. The cost of

*k*-fold cross-validation is then roughly

*k*+ 1 ≪

*km*times the cost of fitting a single model, which makes cross-validation with sparse priors feasible. We chose the fixed-point continuation algorithm (Hale, Yin, & Zhang, 2007) to estimate model parameters with sparse priors. The algorithm is outlined in 1.

**S**

_{ i}combined additively with a noise stimulus

**X**

_{ i}, which was chosen from the Gaussian distribution. The templates

**w**varied depending on the task. In the one-dimensional task, the signal and templates were even Gabors. In the two-dimensional task, the signal was a Gaussian blob and the template was a Gaussian blob in one case and a difference of Gaussians in the other. The signal was normalized so that the observer correctly categorized the signal on 81% of the trials before the addition of internal noise. The internal noise

*ε*

_{ i}was chosen from a Gaussian distribution

*p*(

*ε*) ∼

*σ*

^{2}), after which the performance dropped to 75%. Since Gaussians are mapped to Gaussians under linear transformations,

*p*(

*v*) ∼

*μ, σ*

_{ r}

^{2}) for trials of the same signal sign and the following relation holds:

*c*was adjusted such that the observer was unbiased. We estimated the observer's internal template with a logistic regression GLM with a weight decay, smoothness, and sparse prior. The optimal hyperparameter

*λ*for each prior type was estimated through 5-fold cross-validation.

*λ*for 1000 simulated trials. For large

*λ,*most weights are zero. As

*λ*is decreased, more weights become active. Once a weight has become active, it tends to stay active for smaller

*λ*.

*λ*. For large

*λ,*only the areas of the template which have the most influence on the observer's decision are recovered. As

*λ*is decreased and more weights become active, the reconstruction becomes more complex and accurate. Beyond a certain point, however, the model starts fitting to noise, and the reconstruction becomes less accurate. From this point of view, the sparse prior works by determining how influential each weight is, and only allowing weights into the model which are more influential than a cutoff that is determined by the strength of the regularization

*λ*.

*signal-absent*template is more blurry. This is likely due to spatial uncertainty, which strongly affects recovered signal-absent templates but not signal-present templates in detection tasks (Tjan & Nandy, 2006). This effect is visible in a similar task in Figure 4 of (Abbey & Eckstein, 2002), and an analogous effect was shown in the time domain in Knoblauch and Maloney (2008b). Both the signal-present and signal-absent templates show hints of a large, weak inhibitory surround.

*deviance,*defined as twice the negative log-likelihood. Since the deviance always decreases with increasing number of free parameters, deviance values must be corrected for number of free parameters. Here we use the cross-validated deviance (

*D*

^{CV}), the Akaike Information Criterion (AIC), and the validated deviance (

*D*

^{val}) as measures of goodness-of-fit. All three metrics measure the generalization performance of a model; the first by repeated partitions of the data, the second using an asymptotic result, and the last with a set-aside validation set. In all cases, lower values are better.

Model type | D | df | AIC | D ^{CV} | D ^{val} |
---|---|---|---|---|---|

Ideal observer | 2016 | 3 | 2022 | 2023 | 488.7 |

Weight decay prior | 1748 | 166 | 2080 | 2109 | 503.2 |

Smoothness prior | 1818 | 102 | 2022 | 2025 | 480.9 |

Sparse prior | 1888 | 38 | 1964 | 1965 | 474.9 |

Pseudo-ideal observer | 1985 | 4 | 1993 | 1994 | 474.0 |

Weight decay with signal effect | 1722 | 203 | 2129 | 2134 | 492.3 |

Smoothness with signal effect | 1680 | 173 | 2026 | 2028 | 461.9 |

Sparse prior with signal effect | 1863 | 32 | 1927 | 1917 | 456.8 |

Model type | D | df | AIC | D ^{CV} | D ^{val} |
---|---|---|---|---|---|

Ideal observer | 2157 | 3 | 2163 | 2168 | 545.7 |

Weight decay prior | 1989 | 112 | 2213 | 2218 | 571.4 |

Smoothness prior | 2051 | 61 | 2174 | 2179 | 556.1 |

Sparse prior | 2110 | 36 | 2182 | 2161 | 545.3 |

Pseudo-ideal observer | 2127 | 4 | 2134 | 2140 | 542.9 |

Weight decay with signal effect | 1980 | 114 | 2207 | 2202 | 569.2 |

Smoothness with signal effect | 1956 | 106 | 2169 | 2161 | 560.6 |

Sparse prior with signal effect | 2041 | 45 | 2131 | 2130 | 541.9 |

*df*column. The decreased deviance is thus overshadowed by a large increase in degrees of freedom.

*D*

^{val}favors the smoothness prior over the null hypotheses, but not

*D*

^{CV}nor the AIC, which suggests that the models estimated with the smoothness prior are performing sufficiently close to the ideal and pseudo-ideal observers to warrant doing more trials. Table 3 shows a similar effect in the four-blob task, as neither the weight decay nor the smoothness prior is close to rejecting the null hypotheses. In contrast, the model that makes use of a sparse prior supports the idea that the observer is behaving nonideally and nonlinearly, decisively in the one-blob task and to a lesser extent in the four-blob task. Again, looking at the

*df*column, it is clear that the sparse prior achieves this by aggressively searching for low-dimensional models.

*intrinsic dimensionality*of the observer's template, that is, the number of basis coefficients needed to accurately represent it, may be quite low. Sparse priors in bases use this fact to effectively estimate the parameters of seemingly complex models.

*that one uses*are appropriate. If the assumptions are indeed warranted, then a reduction in noise during model estimation can lead to more powerful inference. The Real observer section gives an example of this, where the model fitted with a sparse prior allows us to forcefully infer that the observer is using a strategy that is both nonideal and nonlinear, in contrast to the conclusion obtained via models fitted with a weight decay and a smoothness prior.

*quality*of estimated models compared to other sparsity-inducing methods. In contrast, sparsity-inducing methods do have a decisive edge over using Gaussian priors on some problems (David et al., 2007; Seeger et al., 2007; Rosset et al., 2004). Thus, it is advisable to use a sparsity-inducing method in general, while choosing the particular method which is most mathematically convenient and computationally efficient for a given application. Our proposed estimation method is competitive in terms of implementation speed, as explained in 1. For example, the model with the largest design matrix considered in this article (10,000 × 256) took roughly 40 seconds to fit on a recent desktop computer, including 5-fold cross-validation.

*hard*dimensionality reduction. In contrast, a sparse prior in a basis decides which basis functions to keep, yielding

*soft*dimensionality reduction.

*p*-value in the truncated sinusoid basis and visualized the results by projecting the coefficients back onto the pixel basis. Only a fraction of the coefficients (less than 5%) were kept in the process. The resulting estimated templates were similar to the nonthresholded templates, but with less high-frequency noise, and improved interpretability. The quality of the truncated reconstruction is likely due to the fact that faces can be represented sparsely in the truncated sinusoid basis, which, like wavelet and pyramid bases, is passband and local.

*expanded input spaces*: the stimuli, instead of being described simply as intensity over time or over time and space, are redescribed in terms of redundant features. For example, an auditory stimulus

*x*(

*t*) can be redescribed using a windowed Fourier transform as a spectrogram

*x*(

*f, t*), which augments the representation with the evolution of the different component frequencies

*f*over time. A neuron's response can be modelled as a linear weighting of

*x*(

*s, t*), and the resulting two-dimensional weight pattern is then known as a spectro-temporal receptive field (STRF). The STRF is able to capture a range of nonlinear behaviors, such as time-varying frequency sensitivity.

*f*is a convex, differentiable function of

**w**. This optimization is nontrivial as ∥

**w**∥

_{1}is nondifferentiable at the axes, where

*w*

_{ i}= 0 for some

*i*. The fixed-point continuation (FPC) algorithm (Hale et al., 2007) solves Equation A1 efficiently by combining two insights. First, the fixed-point iterations:

*shrink*is the soft-threshold operator:

*τ,*the step size, is a small number, eventually converge to the solution of Equation A1, with each iteration coming at a very moderate cost.

**w**, known as a

*cold-start*. The second insight is that if once we have found

**w*** = argmin

_{ w}

*E*(

**w**,

*λ*

_{1}), we may use

**w*** as a

*warm-start*estimate for minimizing

*E*(

**w**,

*λ*

_{2}) when ∣

*λ*

_{1}−

*λ*

_{2}∣ / (

*λ*

_{1}+

*λ*

_{2}) is small. This suggests solving the series of optimization problems:

*λ*

_{1}>

*λ*

_{2}> … >

*λ*

_{end}, using the fixed-point iterations ( Equation A2), using the result of the previous optimization to start the next optimization, a process known as

*continuation*. The entire regularization path is generated as part of the process, which may be used to determine the optimal

*λ*.

*τ*. Second, we use insights from (Park & Hastie, 2007) to find good values of

*λ*to sample the regularization path. Finally, we use a blockwise implementation of cross-validation which saves iterations when

*λ*

_{optimal}is unknown and large.

*λ*

*E*with respect to

**w**:

*τ*is some small number. Such iterations will reduce the objective function

*E*for a sufficiently small

*τ*as long as we avoid the discontinuities in the derivatives of the penalty. One way to avoid these discontinuities is simply to set

*w*

_{ i}to 0 as it attempts to pass through the origin:

*w*

_{ i}= 0 before the iteration, for any

*τ*> 0 the weight leaves the axis and the derivative of the penalty changes. This suggests evaluating the derivative of the penalty at

**w**−

*τ*∇

*f*instead of at

**w**. We thus obtain:

*E*(

**w**,

*λ*) (Hale et al., 2007).

*τ*as long as

*τ*<

*τ*

_{max}= 2/max

_{ i}

**Λ**

_{ ii}, where

**QΛQ**

^{ T}=

**H**is an eigendecomposition of the matrix of second derivatives of the negative log-likelihood with respect to the weights, the Hessian

**H**. Note that this is the same condition as in gradient descent. The fixed-point iteration's relationship with gradient descent hints that convergence may be substantially faster if

*τ*is chosen optimally on every iteration such that:

*τ*. Unfortunately, a naive line-search over

*τ*involves multiple products of the form

**Xw**and repeated computations of

*E,*which are expensive; hence, overall using a naive line search does not yield a large performance improvement over a fixed

*τ*. Here we propose a line search algorithm which, although rather involved, is much more efficient than a naive line search.

*E*(

*τ*) has a peculiar form: its derivatives of all orders with respect to

*τ*are continuous outside of a finite number of “cutpoints”

*τ*

_{0}= 0 <

*τ*

_{1}<

*τ*

_{2}< … <

*τ*

_{ n}which occur when a weight enters or leaves the model. Thus, we approximate

*E*(

*τ*) as a piecewise quadratic, convex function, whose minimum is inexpensively found.

**(**

*η***w**) ≡

**Xw**+

**Uu**and

**w**(

*α*) = shrink(

**w**

^{0}−

*α*

**g**,

*αλ*). We may Taylor-expand

*f*as a function of

*α*up to second order, which by the chain rule gives:

*α*=

*ε, ε*> 0 arbitrarily small, to give right-sided derivatives at

*α*= 0. This approximation is valid between 0 <

*α*<

*α*

_{max}assuming no weights enter or leave the model in this range. The derivatives of

*f*with respect to

**are straightforward to compute; they also occur in the iteratively reweighted least-squares (IRLS) algorithm often used to fit GLMs (Wood, 2006). The derivatives of are given by**

*η***w**(

*ε*) ≠ 0) is a vector with value 1 if

**w**

_{ i}(

*ε*) ≠ 0 and 0 otherwise. Similarly, the regularizer

*R*=

*λ*∥

**w**∥

_{1}may be Taylor expanded between 0 <

*α*<

*α*

_{max}to give:

*C*> 0 and

*E*has a minimum at

*α*

_{min}= −

*B*/

*C*. Thus, either 0 <

*α*

_{min}<

*α*

_{max}, in which case we have found a minimum for

*E,*or not, in which case the minimum of

*E*is located elsewhere and the approximations are no longer valid.

*E,*it follows that

*E*is approximately piecewise quadratic, continuous, and convex away from the cutpoints. At a cutpoint

*τ*

_{ i},

*E*(

*τ*

_{ i}) has a local maximum if and only if the left-sided derivative of

*E*(

*τ*

_{ i}) is positive and the right sided derivative is negative; it is straightforward to show that this is never the case. We conclude that

*E*(

*τ*) is piecewise quadratic, continuous, and convex, and hence it has a single minimum in the range 0 <

*τ*< ∞. This suggests attempting to find a minimum of

*E*between 0 <

*τ*<

*τ*

_{1}. If the minimum is not in that range, we search for the minimum in the range

*τ*

_{1}<

*τ*<

*τ*

_{2}using a new Taylor expansion at

*E*(

**w**(

*τ*

_{1})), and so forth until we find the minimum of E. This gives Table A1.

•Compute the ordered set of cutpoints of E( τ),
giving τ _{0} = 0 < τ _{1} < τ _{2} < … < τ _{ n − 1} < τ _{ n} = ∞ |
---|

•For i from 1 to n |

If i = 1 |

Compute , ηf, ∂ f ∂ η, ∂ 2 f ∂ η, ∂ η ∂ α, ∂ R ∂ α evaluated at α = ε |

Else |

Let Δ α ← ( τ _{ i − 1} − τ _{ i − 2}) |

Update ← η + Δ ηα ∂ η ∂ α |

Update f ← f + Δ α ∂ f ∂ η · ∂ η ∂ α + 1 2Δ α ^{2} ∂ 2 f ∂ η 2 · ( ∂ η ∂ α ) 2 |

Update ∂ f ∂ η ← ∂ f ∂ η + Δ α ∂ 2 f ∂ η 2 ∂ η ∂ α |

Update ∂ η ∂ α, ∂ R ∂ α |

End If |

Compute α _{min} |

If 0 < α _{min} < ( τ _{ i} − τ _{ i − 1}) |

τ _{min} = τ _{ i − 1} + α _{min}; Exit |

Else if α _{min} < 0 |

τ _{min} = τ _{ i − 1}; Exit |

End If |

•End For |

*i*≥ 2 are rather inexpensive, save for computing

*j,*corresponding to the weight which enters or leaves the model. This implies that

**X**at cutpoints, and hence

*f,*

**is saved from the previous line search, which avoids computing a product of the form**

*η***Xw**. To compute

*f*and its derivatives, roughly

*N*logarithms,

*N*exponentials, and a few element-wise products of vectors of length

*N*must be performed, where

*N*is the number of trials. For

**X**

**Xw**and

*f*repeatedly, and thus is much less expensive than a naive line search.

*λ*. After each set of inner iterations,

**u**is reoptimized and a smaller value of

*λ*is chosen. The next value of

*λ*is set to the largest

*λ*for which a new weight should appear in the model:

_{new}is continually chosen to be very close to the current

*λ,*and that the estimate is inaccurate when

_{new}is far from the current

*λ*. We thus bracket

_{new}to be neither too close nor too far from

*λ*

_{current}:

*α*

_{min}= 0.9,

*α*

_{max}= 0.98.

*λ*

_{next}<

*ελ*

_{start}for some user-defined value of

*ε,*which is set, by default, to 10

^{−3}.

*λ*values considered should be minimized within reason to obtain an efficient algorithm. To ensure that fitting is done over a tight range of

*λ*values, we implemented an efficient version of

*k*-fold cross-validation, which we call blockwise cross-validation. Rather than fitting each fold from 10

^{−3}

*λ*

_{start}<

*λ*<

*λ*

_{start}in turn, we first fit the model for all folds from 0.7

*λ*

_{start}<

*λ*<

*λ*

_{start}and compute the cross-validated deviance. If a minimum is found, we stop. If not, we restart the fitting process from 0.5

*λ*

_{start}<

*λ*< 0.7

*λ*

_{start}, and so on until either 10

^{−3}

*λ*

_{start}is reached or a minimum of the cross-validated deviance is found. All information related to fitting is saved in a structure after each block; hence, restarting an optimization has little overhead. We determine that a minimum has been found by asking that the cross-validated deviance at the minimum

*λ*probed so far is higher than the minimum cross-validated deviance by a certain number of units, by default 10. Finally, we fit the full model down to the minimum

*λ*found by cross-validation.

*D*

^{CV}(

*λ*) between

*λ*

_{start}and 10

^{−3}

*λ*

_{start}. While we cannot rule out the possibility that

*D*

^{CV}(

*λ*) has several non-shallow local minima, this does not appear to be an issue in practice;

*D*

^{CV}(

*λ*) is typically quite smooth and almost quadratic in shape near its minimum, and we have not observed non-shallow local minima in any of the fits performed for the purposes of this article. In the pathological scenario where

*D*

^{CV}(

*λ*) has several non-shallow local minima, the cross-validation algorithm will select the one associated with the largest

*λ,*or the largest level of regularization, which we consider to be a safe fallback.

*λ*across different folds correspond to the same level of regularization. Tibshirani (1996) and Park and Hastie (2007) suggest that corresponding regularization levels are found when the ratio

*λ*/

*λ*

_{max}

^{fold}is the same across folds. We thus perform cross-validation to find the optimal ratio

*λ*/

*λ*

_{max}

^{fold}rather than

*λ*. The regularization paths are sampled differently across each fold. We solved this issue by linearly interpolating cross-validated deviance values at all

*λ*/

*λ*

_{max}

^{fold}used by a fold.

*f,*a

*mn*) operation, typically accounts for 50%–90% of the time spent during optimization; roughly 5% is accounted for by miscellaneous overhead; and the remainder time is spent in the line search. The ratio of inner iterations to outer iterations varies with

*λ,*being typically equal to 1 for large

*λ*and 2–4 for small

*λ*. Given the algorithm for determining the next

*λ*and assuming we evaluate the model at

*λ*values from 0.001

*λ*

_{start}to

*λ*

_{start}, the number of outer iterations is bounded above by log(0.001)/log(0.98) ≈ 340. Given typical ratios of inner iterations to outer iterations, this might come out to about 600 evaluations of ∇

*f*. This number can be cut down by a factor 2 or 3 by stopping iterations when

*λ*reaches beyond

*λ*

_{optimal}determined by cross-validation; our software includes a cross-validation implementation, outlined above, that accomplishes this. In total, perhaps 300 evaluations of ∇

*f*may be required in a typical application; with overhead this may balloon up to a cost equivalent to computing ∇

*f*500–600 times, multiplied by (

*k*+ 1) during

*k*-fold cross-validation. This compares favorably and in some cases may be appreciably faster than boosting where the cost of each iteration is equal to the cost of computing ∇

*f,*in addition to some overhead such as computing

*f*and performing a line search in some variants (Buhlmann & Hothorn, 2007).

**X**in memory. With a 64-bit version of Windows or Linux and 4 GBs of RAM, one can thus expect to able work with design matrices as large as 25,000 × 5,000 (1 GB in memory) before running into out-of-memory errors. For the largest design matrices tested here, which are of size 10,000 × 256, optimization including 5-fold cross-validation takes about 40 seconds on our test computer running on 4 Intel Xeon CPUs running at 2 Ghz and 3 GB of RAM.

**y**,

**X**, and

**U**are response and design matrices and stopcrit is the fraction of

*λ*

_{start}after which to stop optimization. folds is a matrix of booleans which defines which portion of the data is assigned to the fit set versus the validation set for each cross-validation fold. Such a matrix may be generated by the included auxiliary function getcvfolds. A structure is returned in both cases which contains the entire regularization path, deviances, AIC values, the values of

*λ*used, and, if applicable, cross-validated deviances and

**w**and

**u**at the optimal value of

*λ*. In addition to the binomial/logit model presented here, the software supports two other GLMs: a Poisson model with exponential inverse link, for count data, e.g., binned spike trains, and a Gaussian noise model with identity link, e.g., least-squares. These latter may be invoked through supplementary arguments whose calling sequence is available from the Matlab command-line help.

*proper*is defined as

*D*=

*D**

*ϕ,*where

*ϕ*is a scale parameter that depends on the noise distribution used. For the binomial distribution,

*ϕ*= 1 (Wood, 2006); hence, we use the terms deviance and scaled deviance interchangeably.

*L*is the negative log-likelihood of the fitted model, while

*L*

_{max}is the negative log-likelihood for a saturated model which contains one parameter per data point

*y*

_{i}. For the logistic regression model used here,

*L*

_{max}= 0, although this is not always the case, for example with Poisson regression. By construction,

*D*≥ 0; a small

*D*implies a good fit to the data while a large

*D*implies a poor fit.

*k*-fold cross-validation works by splitting the data into

*k*randomly chosen nonoverlapping partitions of equal size, fitting the model with data in all but the partition and computing the likelihood or deviance of the data in the

*i*th partition, and repeating the process for

*i*= 1…

*k*. This yields a cross-validated deviance score

*D*

^{CV}.

*D*

^{CV}is a measure of how well a model predicts out-of-sample observations and therefore estimates its generalization performance.

**y**under the model is approximately given by cross-validated likelihood of the data (Geisser, 1975; Geisser & Eddy, 1979). Hence, the relative likelihood of two models is approximately given by

*PSBF*(

_{1},

_{2}) > 150, corresponding to

*D*

_{2}

^{CV}−

*D*

_{1}

^{CV}> 10 is “very strong” (Kass & Raftery, 1995) or “decisive” (Jeffreys, 1961) evidence for

_{1}over

_{2}.

*D*

^{CV}can be large; its value depends on the exact choice of folds. Therefore, when differences in the cross-validated deviances between two models are relatively small, for example less than 10 units, we do not recommend concluding that either model is better based on these scores; lower variance estimates of

*D*

^{CV}can be obtained by repeatedly computing

*D*

^{CV}for different folds.

**w**. For a model with a Gaussian prior, on the other hand (Wood, 2006):

**H**denotes the Hessian of the negative log-likelihood and

*tr*the trace. Note that this reduces to

*n,*the number of parameters in the model, as

*λ*→ 0. Given

*df*for the GLM with either a sparse or Gaussian prior, we may define an AIC (Akaike Information Criterion) analogue (Wood, 2006):

*λ*) is discontinuous and has several shallow minima. However, it may be useful in comparing our results with other GLMs with binomial endpoints that, for practical reasons, avoid cross-validation (Knoblauch & Maloney, 2008b; Ross & Cohen, 2009). We note that the AIC is equivalent, up to a linear transformation, to the unbiased risk estimator (UBRE) used in Knoblauch and Maloney (2008b). As with cross-validated deviance, a lower AIC is better, and large differences in AIC values between two models are interpreted as support for the model with the lower AIC. Since the AIC is based on asymptotic results, its validity is dubious when low numbers of trials are used. In the case where the AIC and the cross-validated deviance point towards incompatible conclusions, we recommend averaging cross-validated deviance over several different random choices of folds and base conclusions on this low-variance cross-validated deviance estimate.

_{1}nested inside a more complex GLM model

_{2}:

*p*-value may be obtained from this expression. This is known as a log-likelihood ratio test. Again, the

*χ*

^{2}approximation is based on asymptotic results and has dubious validity for a small number of trials (Wood, 2006). It is also important to keep in mind that the test is only valid for

*nested*models. We recommend the use of log-likelihood ratio tests when the importance of a single predictor or ensemble of related predictors are in question. For nonnested models, Vuong's test may be used (Vuong, 1989).