We describe an information-theoretic framework for fitting neural spike responses with a Linear–Nonlinear–Poisson cascade model. This framework unifies the spike-triggered average (STA) and spike-triggered covariance (STC) approaches to neural characterization and recovers a set of linear filters that maximize mean and variance-dependent information between stimuli and spike responses. The resulting approach has several useful properties, namely, (1) it recovers a set of linear filters sorted according to their informativeness about the neural response; (2) it is both computationally efficient and robust, allowing recovery of multiple linear filters from a data set of relatively modest size; (3) it provides an explicit “default” model of the nonlinear stage mapping the filter responses to spike rate, in the form of a ratio of Gaussians; (4) it is equivalent to maximum likelihood estimation of this default model but also converges to the correct filter estimates whenever the conditions for the consistency of STA or STC analysis are met; and (5) it can be augmented with additional constraints on the filters, such as space–time separability. We demonstrate the effectiveness of the method by applying it to simulated responses of a Hodgkin–Huxley neuron and the recorded extracellular responses of macaque retinal ganglion cells and V1 cells.

*all*stimulus vectors (Bialek & de Ruyter van Steveninck, 2005; Bialek et al., 1991; de Ruyter van Steveninck & Bialek, 1988; Simoncelli, Paninski, Pillow, & Schwartz, 2004). Figure 1C shows an example where we have reduced the stimulus to a 1D subspace by linear projection (i.e., by filtering the stimulus with a single linear filter). Within this subspace, the mean of the STE is significantly higher and its variance is significantly lower than that of the raw stimulus ensemble. These differences indicate that position along this axis in stimulus space (i.e., the response of this linear filter) carries information about the probability that the neuron will spike.

*P*(spike|

**x**): the probability that a neuron will elicit a spike in response to a stimulus

**x**. As illustrated in Figures 1C and D, we can compute this probability directly using Bayes rule:

*α*is a constant proportional to the probability that a spike occurs,

*P*(spike). The encoding model can therefore be computed as the ratio of two probability distributions, and in the simple cases (e.g., Figure 1), we can estimate it directly as a quotient of two histograms.

*P*(

**x**|spike) and

*P*(

*x*) do not differ except within a relatively low dimensional subspace. Our first step is thus to find the subspace that best captures these differences. We formalize this as the search for a matrix

**B**for which the true conditional probability of spiking is closely approximated by the conditional probability within the subspace spanned by the columns of

**B**:

**B**); (2) a nonlinear combination rule, which converts the filter outputs (

**B**

^{ T}

**x**) to an instantaneous probability of spiking; and (3) inhomogeneous Poisson spiking.

*P*(

**x**) has zero mean, then the STA,

*P*(

**x**|spike) and

*P*(

**x**) differ most. Similarly, the STC,

*P*(

**x**|spike) and

*P*(

**x**) differ maximally.

**B**for a reduced-dimensional model of the neural response consisting of the STA (if it differs significantly from zero) and the eigenvectors of the STC whose associated eigenvalues differ significantly from the variance of the raw ensemble. We will refer to this latter group as the “significant” eigenvectors of the STC or simply “STC axes.” For spike responses generated by an LNP model, the STA and STC axes converge asymptotically to span the correct subspace (i.e., the subspace associated with

**B**) if the raw stimulus distribution

*P*(

**x**) is Gaussian and the instantaneous nonlinearity induces a change in the mean, variance, or both along each dimension of this subspace (Bialek & de Ruyter van Steveninck, 2005; Bussgang, 1952; Paninski, 2003).

*P*(

**x**|spike), can therefore be written as

*n*is the dimensionality of the stimulus space.

*P*represents the distribution of the raw stimuli.

*Q*and

*P*are both Gaussian;

*P*is zero mean and has identity covariance or can be made so by subtracting the mean and “whitening” the stimulus space according to

**x**=

*Λ*

_{0}

**x**

_{0}−

*μ*

_{0}), where

*μ*

_{0}and

*Λ*

_{0}are the mean and covariance of the original stimulus distribution

*P*(

**x**

_{0}). Under these assumptions, Equation 6 reduces to

*P*and

*Q*within a given subspace is given by:

**B**is a matrix whose

*m*columns form an orthonormal basis for the subspace.

**B**that maximizes Equation 8. If we want a 1D subspace, this objective function reduces to

**b**(the most informative filter) is a unit vector.

**b**

^{ T}

*μ*) and standard deviation (

**b**was a 20-dimensional vector resembling the (biphasic) temporal profile of a retinal receptive field. Simulations were performed using three different point nonlinearities ( Figure 3, top row), each of which produces a change in both mean and variance in the STE. A half-wave rectified linear function (left column) shifts the mean of the STE and reduces its variance relative to the raw stimuli, meaning that both the STA and the low-variance STC axis provide consistent estimates for

**b**. For the sigmoidal and quadratic nonlinearities (center and right columns, respectively), the STE has shifted mean and increased variance relative to the raw stimulus; hence, the STA and the large-variance STC axes can provide consistent estimates for

**b**. The bottom row of plots in Figure 3 shows the convergence of the STA, STC, and iSTAC estimates as a function of the experiment duration. The iSTAC estimate was computed by directly maximizing Equation 9 for

**b**, which gives the 1D subspace maximizing the KL divergence between

*P*and

*Q*. This optimization took less than 1 s for each estimate and does not depend on the amount of data, apart from the additional time required to compute the STA and STC. Although STA, STC, and iSTAC estimates all converge (i.e., are statistically consistent) for these examples, iSTAC exhibits superior performance in all three cases, due to its sensitivity to information in both mean and covariance.

^{8}time samples and roughly 10

^{6}spikes. A substantial number of eigenvalues lie above or below 1, indicating that the computation performed by the HH neuron on its input is multidimensional (Aguera y Arcas et al., 2003).

*m,*we computed the optimal information-preserving subspace by maximizing Equation 8 for

**B**(a 100 ×

*m*matrix), constraining the columns of

**B**to be orthonormal (see 1 for details of the optimization procedure). Figure 4C shows the KL divergence between the two distributions projected into this optimal subspace, as a function of

*m*. Although the large number of “significant” eigenvalues ( Figure 4A) reveals that the HH computation is high dimensional, the information analysis indicates that total information saturates quite rapidly with dimensionality. Specifically, as shown in Figure 4C, a model using just two linear filters captures 94% of the information available.

*k*-dimensional subspace is given by the collection: {1, 2, …,

*k*}. The first two iSTAC vectors closely resemble the first two high-variance eigenvectors of the STC. The third iSTAC vector resembles the STA (orthogonalized with respect to the first two), and the remaining vectors closely match the remaining eigenvectors of the STC (with high-variance vectors preceding the low-variance vectors in importance). Note, however, that this ordering was not obvious a priori. Other V1 cells from the same data set reveal a variety of orderings: In some cells, the STA carries more information or less information than all other filters, and in some, the low-variance axes carry more information than all high-variance axes.

*k*+ 1)th dimension and the previous

*k*dimensions. Such correlations are important when the STA is not geometrically aligned with the eigenvectors of the STC matrix. For example, we often find that the second iSTAC filter carries less information by itself, as compared with the third or fourth filters, but gives rise to the most informative 2D subspace when grouped with the first.

*k*-dimensional subspace. Moving from a 1D to a 2D representation increases information nearly as much as moving from a zero-dimensional to a 1D representation (increases of 0.27 and 0.24 bits, respectively), but moving to larger dimensional subspaces does not contribute nearly as much additional information, as illustrated in Figure 5G. Note that although information continues to increase as a function of subspace dimensionality, this increase can be attributed to data limitations. The covariance matrix

*Λ*is an

*estimate*of the true covariance, computed from a finite set of samples. These samples have, by chance, slightly smaller or greater variance than the raw stimuli along most dimensions, resulting in an apparent increase in information with each dimension included in the model. This same phenomenon is responsible for the spread of “nonsignificant” eigenvalues around 1 in Figure 5B.

*m*> 8, the amount added does not exceed the 95% confidence level for the information increase we would expect due to statistical error in estimation of mean and covariance with this number of samples, and thus we conclude that the cell's response is captured by an eight-dimensional model.

*x*to a feature vector

*x** =

**B**

^{ T}

*x,*where

**B**is a basis for the feature space. For a complete model, we also need a mapping from the feature vector

*x** to the probability of observing a spike,

*P*(spike|

*x**). In low-dimensional (1D or 2D) spaces, the nonlinearity can be estimated by computing the quotient of (e.g., histograms of) the densities

*P*(

*x**|spike) and

*P*(

*x**) (see Figures 1C and D). For higher dimensional feature spaces, however, this is infeasible due to the difficulty of estimating densities.

*Q*(

*x*) is a Gaussian density with mean and covariance matching that of the spike-triggered ensemble,

*P*(

*x*) is the prior distribution over raw stimuli, and

*α*is a proportionality constant equal to

*P*(spike) =

*n*

_{sp}/

*n*

_{stim}. Reducing dimensionality by a linear projection onto

**B**preserves Gaussianity of both numerator and denominator distributions; thus, the reduced-dimensional model of the neural response, specified in feature space, is given by

**B**

^{ T}

*μ*and covariance

**B**

^{ T}

*Λ*

**B**. A bit of algebra reduces this to an exponential form:

*M*=

*I*-

^{−1}) and

*b*=

^{−1}

**b**

_{ j}

*,*and the nonlinearity was computed using the ROG (blue) and a ratio of densities estimated using histograms (dotted black; Chichilnisky, 2001). Note that, for each plot, the numerator of the ROG model is a Gaussian with mean (

**b**

^{T}

*) and variance (*

_{j}μ**b**

^{T}

_{j}Λ**b**

_{j}), whereas the histogram estimate is computed using the 1D projection of the stimuli: {

**b**

^{T}

*}. The model is seen to be equally adept at describing the asymmetric, symmetric excitatory, and symmetric suppressive behaviors found along different axes.*

_{j}x*P*and

*Q*have identical covariance but differ in mean; Paninski, 2004) and, as shown here, STC analysis corresponds to ML estimation when the nonlinearity is the ratio of two zero-mean Gaussians (

*P*and

*Q*have the same mean but differing covariance). Thus, iSTAC analysis generalizes the optimality of STA and STC. It is also consistent and unbiased under the same conditions as STA and STC analysis, meaning that it converges to the correct subspace whenever the raw stimulus is Gaussian and the nonlinearity affects STE's mean, variance, or both. The ROG description also provides an important litmus test for the possible suboptimality of these moment-based approaches. If the estimated nonlinearity is poorly fit by an ROG, there may be a significant statistical advantage to dimensionality-reduction techniques that are sensitive to higher order moments (e.g., Paninski, 2003; Sharpee et al., 2004).

**h**and a spatial filter

**g**:

*n*

_{ h}and

*n*

_{ g}are the number of spatial and temporal elements of the raw stimulus, respectively. Note that this greatly reduces the number of filter parameters from that of stimulus dimensionality

*n*(equal to

*n*

_{ h}×

*n*

_{ g}) to

*n*

_{ h}+

*n*

_{ g}. By stacking the columns of

**hg**

^{ T}to form a single column vector,

**L**

_{ h}is an

*n*×

*n*

_{ g}block-diagonal matrix, with each block given by the column vector

**h**.

**h**that preserves maximal information about the response. Filtering each spatial element of the stimulus with

**h**is equivalent to projecting each stimulus

*x*onto the columns of

**L**

_{ h}; this operation produces an

*n*

_{ g}-dimensional vector,

**L**

_{ h}

^{ T}

*x,*with one dimension for each spatial element of the stimulus. From this derivation, it is obvious that

**L**

_{ h}is a special form of the dimensionality-reducing matrix

**B**that we considered previously. Therefore, we can find the most informative

**h**simply by maximizing KL divergence, using

**L**

_{ h}in place of

**B**in Equation 8. Note that we could not directly estimate such a temporal filter using STA or STC analysis: Although we could find a space–time separable fit to either the STA or the eigenvectors of the STC, this provides no unique solution nor does it combine information from all of the filters (STA and the significant eigenvectors) simultaneously. The more general information-theoretic estimators also cannot be tractably applied to this problem. The matrix

**L**

_{ h}reduces dimensionality to an

*n*

_{ g}-dimensional feature space, which is too high-dimensional (assuming the stimulus contains more than a few spatial elements) for estimating mutual information directly.

**h**

_{ i}is inserted into a dimensionality-reducing matrix

**L**

_{ hi}

*,*and the concatenation of these matrices, [

**L**

_{ h1}…

**L**

_{ hk}], preserves more dimensions of the original stimulus space and can be inserted into Equation 8 in place of

**B**. Figure 8 shows an application of this approach to the RGC shown in Figures 6 and 7. Figure 8A shows the two most informative temporal filters, which were found to be sufficient for preserving the information about the response. We then performed an identical analysis to find a set of maximally informative spatial filters (i.e., by exchanging the roles of

**h**and

**g**in the previous analysis) and found that four filters were required to preserve spatial information about the response, as shown in Figure 8B.

*constrained*iSTAC filters; these should resemble the original filters but obey the additional constraint that they are composed only of the spatial and temporal filters obtained from the space–time separable analysis. This constraint implies that each filter

**b**can written as

*i*temporal filter with the

*j*th spatial filter, and {

*w*

_{ ij}} is a set of linear weights, which we fit by maximizing KL divergence. Figure 8C shows the set of constrained iSTAC filters obtained for this cell, each of which is constructed using the spatial and temporal components shown in Figures 8A and B and a set of weights. For comparison, we can plot these alongside the original filters obtained with iSTAC analysis ( Figure 6). Figures 8D and E show temporal and spatial sections of the original iSTAC filters (blue) and the space/time-constrained filters (red), indicating basic agreement between the two methods. The constrained filters, however, are much smoother and require far fewer parameters to describe: A set of five 300-element filters (20 temporal × 15 spatial dimensions) was reduced to a set of filters constructed from two temporal filters and four spatial filters and a set of weights (eight for each of the five filters), resulting to a reduction from 1,500 to 140 parameters.

- When the STC is the identity matrix, iSTAC recovers the same subspace as STA. The proof is simple: If STC
*Λ*is the identity matrix, then the first two terms in Equation 8 are constant, and we obtain maximal KL divergence by taking**B**as a unit vector proportional to*μ,*the STA. Note that in this case, all information is captured by this 1D subspace; hence, there is no advantage to using a higher dimensional feature space. - When the STA is 0, iSTAC recovers the same subspace as the significant eigenvectors of the STC. To prove this, let
**A**=**B**^{ T}*Λ***B**, and the KL divergence reduces to Tr[**A**] − log|*A*| plus a constant. Note that the first term of this expression is the sum of the eigenvalues of**A**and that the second is the negative sum of the log eigenvalues of**A**. These eigenvalues, in turn, represent the variance of the STE along each major axis preserved by the basis**B**. If we diagonalize*Λ*using its eigenvectors, it is easy to show that the function is maximized by setting**B**to contain the eigenvectors of*Λ*for which the corresponding eigenvalues*σ*_{ i}are greater than or less than 1. The information contributed by each eigenvector is equal to*σ*_{ i}− log(*σ*_{ i}) − 1, which is the function plotted in Figure 2B, and monotonically increases as*σ*_{ i}moves away from a value of 1. This means that extrema (high or low eigenvalues) of the STC will also be maxima in the iSTAC analysis, and thus, the iSTAC basis will be the same as the STC basis. Moreover, if we wish to preserve only*j*axes of the stimulus space, the most informative*j*-dimensional subspace is generated by the eigenvectors whose corresponding eigenvalues*σ*_{ i}give the*j*largest values of*σ*_{ i}− log(*σ*_{ i}).

*Λ*+

*μμ*

^{ T}) and that the second term is the sum of the log eigenvalues of

*Λ*within the subspace preserved by

**B**. A simple intuition (verified by hand inspection in low dimensions) suggests that this objective function has only

*n*local maxima and/or saddle points on the unit hypersphere, which are intermediate between the eigenvectors of (

*Λ*+

*μμ*

^{ T}) and

*Λ*.

**B**incrementally, starting with the maximally informative 1D basis and adding columns so that KL divergence is maximized for each dimensionality. The optimal

*k*-dimensional basis includes the (

*k*− 1)-dimensional basis provided by the previous step of the algorithm; thus, it is possible to think of the optimal basis as consisting of the first

*k*vectors from a fixed, ordered set (as shown in Figures 5, 6, and 8). To ensure that the optimization converges to the true global optimum at each step, we use several initialization points, selected from a set of the significant eigenvectors of

*Λ*and (

*Λ*+

*μμ*

^{ T}).

*k*examines whether the incremental information that arises from increasing dimensionality from

*k*− 1 to

*k*is significantly above that expected from random sampling. To quantify the latter, we performed 1,000 bootstrap resamplings of the STE by randomly time shifting the spike train relative to the stimulus (removing stimulus dependence of the response but preserving spike train statistics) and computed the STA and STC of the shifted samples. We then computed the KL divergence of the most informative

*k*-dimensional subspace while setting the mean and covariance in the first

*k*− 1 dimensions to be those given by the true STA and STC. We use these 1,000 estimates to generate an empirical distribution of the incremental information provided by the

*k*th dimension and compute a 95% confidence level (gray line plotted in Figures 5 and 6). If the incremental information computed from the actual data fails to surpass this significance level, we conclude that the neural response is captured by the first

*k*− 1 dimensions. Otherwise, we proceed by repeating the whole test for

*k*+ 1 dimensions.

*Q*and

*P*is asymptotically equivalent to finding the ROG model parameters that maximize the likelihood of the spike train given the stimuli. We assume that spikes are generated according to an inhomogeneous Poisson process, and thus, the likelihood of observing

*k*spikes for a stimulus

*x*is given by

*r*(

*x*) is the instantaneous firing rate. The average log-likelihood of set of spike data {

*k*

_{ i}

*, x*

_{ i}}, for

*i*∈ [1,

*N*] is given by

*c*is a constant that does not depend on

*r*(

*x*).

*Q*and

*P*that maximizes KL divergence will also (in the limit of large data) maximize the Poisson likelihood of the data under the model.

*r*(

*x*) is a ratio of two zero-mean Gaussians. This follows from the conjunction of the second point in the previous section (equivalence of STC and iSTAC when the expected mean of the STE is zero) and the optimality of iSTAC under an ROG model. The corollary that the STA is asymptotically optimal when

*r*(

*x*) is exponential has been shown previously (Paninski, 2004) but can be derived similarly from the fact that the ratio of two Gaussians with identical covariance but shifted means is exponential.