Abstract
A statistical model used extensively in vision research consists of a cascade of a linear operator followed by a static (memoryless) nonlinearity. Common applications include the measurement of simple-cell receptive fields in primary visual cortex and the modeling of human performance in various psychophysical tasks. It is well known that the front-end linear filter of the model can readily be recovered, up to a multiplicative constant, using reverse-correlation techniques. However, a full identification of the model also requires an estimation of the output nonlinearity. Here, we show that for a large class of static nonlinearities, one can obtain analytical expressions for the estimates. The technique works with both Gaussian and binary noise stimuli. The applicability of the method in physiology and psychophysics is demonstrated. Finally, the proposed technique is shown to converge much faster than the currently used linear-reconstruction method.
A linear-nonlinear model consists of a cascade of a linear operator acting on the input in series with a static nonlinearity. In practice, the input (or stimulus) is a finite-dimensional vector,
x, and the output (or response),
r, can be either 0 or 1. In this case, the model can be written simply as
where g(·) is a nonlinear function and the (column) vector w represents the linear filtering operation. This equation can be considered a one-term projection pursuit regression model (
Friedman & Stuetzle, 1981). Therefore, the optimization techniques developed for projection pursuit regression can be used to fit w and g(·) to the data. In the laboratory, however, the researcher has full control of the stimulus; therefore, as shown below, more efficient methods for estimating the parameters of the model can be developed.
Cross-correlation methods are commonly used to estimate the front-end linear filtering (the vector
w) in this model (
Marmarelis & Marmarelis, 1978;
Jones & Palmer, 1987;
Emerson, Korenberg, & Citron, 1989;
DeAngelis, Ohzawa, & Freeman, 1993a;
DeAngelis, Ohzawa, & Freeman, 1993b;
Reid, Victor, & Shapley, 1997;
Ringach, Sapiro, & Shapley, 1997;
Anzai, Ohzawa, & Freeman, 1999). A simple geometrical proof of the technique can be found in
Chichilnisky (2001). In addition to estimating
w, a full characterization of the model also requires that we estimate the shape of the nonlinearity.
The importance of estimating g(·) has been somewhat neglected in the past [however, see
Anzai et al, (1999) and
Chichilnisky (2001)]. This is surprising because it is well known that the tuning properties of a neuron (such as its orientation or spatial frequency bandwidth) do not depend solely on its linear kernel; they can also be influenced by a static nonlinearity in the spike generation mechanism, represented by g(·). Specifically, it has been suggested that both direction and orientation tuning can be sharpened significantly by thresholding or by an accelerating nonlinearity (
Reid, Soodak, & Shapley, 1987;
DeAngelis et al, 1993a,b;
Anzai et al, 1999;
Carandini & Ferster, 2000). Thus, estimating g(·) is clearly important when comparing the tuning properties of the model to that of the data.
Similarly, cross-correlation methods are being used in visual psychophysics applications to recover the “classification images” used by subjects in two-alternative forced-choice (2AFC) tasks (
Ahumada & Beard, 1998;
Ringach, 1998;
Gold, Murray, Bennett, & Sekuler, 2000). These methods show that performance is correlated, to some extent, with some physical attribute of the image. However, in order to quantitatively evaluate the fraction of human performance explained by the model, one needs to identify the static nonlinearity as well.
One strategy to identify the nonlinearity is to perform a two-step analysis (
Anzai et al, 1999;
Chichilnisky, 2001). First, measure the linear kernel using reverse correlation, then generate a scatter-plot of the linear prediction (
wTx against the actual response
r and smooth it to obtain an estimate of the static nonlinearity. Here we show that for a large class of nonlinearities it is possible to write down a closed form solution for the estimates of the linear kernel and the static nonlinearity in one single step. The parameters of the nonlinearity are found by matching the input-output moments of the model to the data. We refer to this technique as the moment method. The calculation avoids computing the linear prediction altogether (which is a very time-consuming calculation) and can be updated recursively, as new data arrives. Furthermore, our results apply to both Gaussian and binary noise inputs.
Let each component of the input be an independent normal (Gaussian) random variable with standard deviation
σ. Alternatively, let each component be a binary random variable that takes the values
σ or −
σ with equal probability. As shown in Appendix A, if the linear kernel is normalized so that ‖
w‖ = 1, the average output is
The correlation between the input and the output is proportional to the kernel
w,
where
E{·} denotes expected value and the constant of proportionality is given by (
Bussgang, 1952
Thus, the linear filter
w can be recovered from
E{
wr} because,

, which is the usual reverse correlation result.
Through
Equation 2 and
4, the measurement of the mean output,

, and the proportionality constant, C, impose constraints on the nonlinearity g(·). The proposed moment method uses these measurements, which correspond to the first moment of the output

and the second input-output moment (cross-correlation), to specify the nonlinearity. For example, if one considers a class of nonlinearities specified by only two parameters,
Equations 2 and
4 can be used to determine g(·). If, however, one considers a class of nonlinearities with more than two parameters, additional conditions must be provided to determine g(·). We demonstrate the procedure with some classes on nonlinearities used in both electrophysiology and psychophysics: half-rectification with threshold, a power law, an error function, and the Naka-Rushton equation.
A linear-nonlinear system with a threshold half-rectifier nonlinearity has been one of the classical models of simple cells (
Movshon, Thompson, & Tolhurst, 1978;
Tolhurst & Heeger, 1997;
Carandini & Ferster, 2000). The nonlinearity is described by,

, where [
x]
+ is a half-rectifier such that [
x]
+ =
x if
x if
x > 0 and is zero otherwise. In this case (see Appendix A), it can be shown that
where erfc(
x = 1 − erf(
x is the complementary error function and

is the error function. The system represented by
Equations 5 and
6 has two measured variables,

and
C, and two unknowns,
A and
y0. Thus, given the measurements, one can solve numerically the system to obtain estimates of
A and
y0.
This is an important example because power laws are used to model the output nonlinearity in simple cortical cells (
Emerson et al, 1989;
Albrecht & Geisler, 1991;
Heeger, 1992;
Anzai et al, 1999). Furthermore, they have been shown to fit the data slightly better than half-rectification (
Tolhurst & Heeger, 1997). If we assume
g(
y) =
Ayβ, then, as shown in Appendix A, we obtain
where the gamma function is given by

. Once again, the system represented by
Equations 7 and
8 has two measured variables,

and
C, and two unknowns,
A and β. Thus, given the measurements one can solve numerically the system to obtain estimates of
A and β.
Another important case is when the nonlinearity can be represented by an error function:
This class of nonlinearities is specified by three parameters:
rmax,
y0 and ɛ. In Appendix A we derive the following specific forms of
Equations 2 and
4 for the error function nonlinearity:
where

. Because these equations provide only two constraints and the nonlinearity has three parameters, additional information is required to identify the system. One approach is to obtain additional constraints by estimating higher moments between the input and output of the system, which is discussed below. Another possibility is to begin by estimating
rmax. For example,
rmax can be taken as the reciprocal of the minimum inter-spike interval in the data record, which is the method used in this paper. Alternatively, the maximum output rate may be determined by the response to optimal stimuli from other experiments. In the case of 2AFC psychophysical experiments,
rmax can be defined as 1. Thus, if we know
rmax and measure

and
C, we can solve for
y0 and ɛ analytically.
A final example of a sigmoidal function commonly used in models is the Naka-Rushton (NR) equation,
with parameters
rmax,
c, and
n. Unfortunately, the integrals in
Equation 2 and
4 do not have nice analytic expressions, but we can still solve numerically the system of integral equations,
for
c and
n provided that
rmax is estimated from some additional information as above.
To demonstrate the ability of the moment method to reconstruct the nonlinearity, we tested the model with two simulated experiments. First, we modeled the response of a human subject in a 2AFC psychophysical task. Second, we modeled the receptive field of a simple-cell in primary visual cortex. In each case, we calculated the mean response rate and correlation magnitude, using the method described in detail in Appendix B. To evaluate the performance of the method, the parameters of the simulated nonlinear system were compared with these estimated values. We also compared the performance of the proposed technique with variants of the linear reconstruction method.
Anzai et al (1999) used a straightforward implementation of the linear reconstruction (or LR) method, whereas
Chichilnisky (2001) implemented a cross-validated version of the technique (or CLR). Details about the implementations of these algorithms are described in Appendix C. To anticipate the results, we find that the moment method clearly outperforms the LR and CLR techniques.
We simulated the study of
Ahumada & Beard (1998) who measured classification images in a Vernier acuity task. In our simulations, we assume a spatial kernel (the vector
w) similar to those obtained in the actual measurements (
Figure 1a) and an error function nonlinearity with
rmax = 1,
y0 = 0.5, ɛ = 1.0. In each trial of the experiment one of two possible stimuli,
x0 or
x1, is presented in the presence of additive white Gaussian noise with
σ = 1. The subject is asked to identify the stimulus. We code the subject’s response as
r = 0 for the
x0 choice and = 1 for the
x1 choice. The probability that the subject will select
x1 is modeled as

. If we pool all the trials in which only
x1 was presented, we obtain

(where
n represents the noise in the stimulus). Similarly, for the trials in which only
x0 was presented, we obtain

. The “classification images” obtained by cross-correlation between response and stimulus in these two cases should be the same and proportional to
w. Similarly, the nonlinearities should be identical up to a translation. In fact, these constraints could be used as a test for the validity of the model in a 2AFC task.
We ran simulations that contained between 300 and 100,000 trials per stimulus. We selected one of the stimuli and calculated the linear kernel (or “classification image”) (
Figure 1b and
1c) and the estimated error function nonlinearity (
Figure 2 The nonlinearity calculations show that the proposed method converges faster than the LR/CLR methods (the result from the CLR method is not shown, but it was no better than the LR method). Our technique gives a reasonable approximation to the nonlinearity after only 500 trials, whereas the LR method yields a gross underestimate (
Figure 2a). After 2400 trials, the moment method has determined the nonlinearity almost perfectly, but the LR method still underestimates (
Figure 2b). After 75,000 trials, both methods have converged to the correct solution (
Figure 2c).
To study the convergence of the algorithm, we simulated 60 runs and estimated the average relative error of the parameters for increasing lengths of the data record, ranging from 300 to 100,000 trials. A summary of the average relative error in
y0 and ɛ vs. length of the data record is shown in
Figure 2d. After 2500 trials, the error is down to 10 percent, and it converges toward zero as the number of trials is increased further.
The second simulation was that of a simple cell in visual cortex. To mimic a cell’s response, we used a linear kernel obtained from cross-correlation analysis of actual experimental data. The kernel was discretized in 45 time slices with Δ
T = 2 ms and by a 40 × 40 grid in space (see
Figure 3a). In each 2-ms interval, the probability of a spike was given by
Equation 1, and the nonlinearity was a power law with parameters
A = 0.01 and
β = 2. The input was binary noise taking the values 1 and −1, giving a standard deviation of
σ = 1.
We simulated the response of the simple cell to noise stimuli with experimental time ranging from 1 minute to 8 hours. For each case, we estimated the kernel (
Figure 3b and
3c) and the power law nonlinearity (
Figure 4). These simulations show the strength of the moment method with large kernels, such as the 40 × 40 × 45 kernel simulated here. Unlike the LR/CLR methods, our technique closely approximates the simulated nonlinearity after only 15 minutes, or less than 5000 spikes (
Figure 4a). Even the CLR method required 8 simulated hours (more than 140,000 spikes) to converge. The bias in the LR method is large and is still substantial in the 8-hour simulation.
The relatively poor performance of the CLR method can be explained by the noise in the estimate of a large kernel. If the estimate of
w were extremely noisy, the linear prediction y = w
Tx would be mostly noise, and the CLR method would give a flat line at the mean rate.
Figure 5b illustrates that the CLR results lie between the mean rate and the true nonlinearity after 15 minutes of data collection, as would be expected from a noisy estimate of
y. As the simulation length increases, the results approach the correct nonlinearity. A more detailed discussion of the noise in the CLR methods, as well as a discussion on the bias in the LR method, can be found in Appendix C.
Figure 4d shows the rate of convergence of the estimated nonlinearity as a function of simulated time between 1 minute and 8 hours. For each data point of 1 hour or less, we computed 12 simulations and calculated the average relative error in the nonlinearity parameters. For longer trials, the relative error estimates were based on four simulations of 2 hours, two simulations of 4 hours, and one simulation of 8 hours. After only 10 minutes of simulation (around 3000 spikes), the estimated parameters of the power law are within 10 percent of the actual parameters.
To obtain the above results, we assumed we knew the nonlinearity was a power law. Normally, the investigator does not necessarily know in advance the shape of the nonlinearity. A natural question is what would happen if we were to fit other nonlinearities to data generated by a power law function.
Figure 5a shows results of the moment method if we assumed a power law, error function, or Naka-Rushton nonlinearity. The different estimates roughly agree for small values of
y but diverge as
y increases. Thus, confidence in any extrapolation to large
y values requires precise knowledge of the shape of the nonlinearity. Because the LR and CLR methods do not make any assumptions about the shape of the nonlinearity, they could be used to determine the proper class of nonlinearities to use in the moment method.
The moment method is a parametric technique that assumes a particular class of nonlinearities. On the other hand, the LR and CLR techniques are nonparametric algorithms. A natural question is: would the performance of the two techniques be similar if we added information about the class of relevant nonlinearities to the LR/CLR methods. This could be done by fitting a parametric nonlinear function to the scatter-plot of the linear prediction versus the actual response in the LR/CLR algorithms. In the simulation of the psychophysics experiment (
Figure 2), we fit an error function, and for the simulation of the simple-cell receptive field (
Figure 4), we fit a power-law.
Figure 6 shows the maximum relative error of the parameters obtained from these fits as well as the parameters from the moment method. In all cases, the moment method is dramatically superior to the LR/CLR methods. The better performance of the LR method than the CLR method in
Figure 6a is understandable because removing bias can sometimes decrease accuracy. The erratic behavior of the LR results in
Figure 6b is due its deviation from a power-law shape (as seen in
Figure 4a and
b).
The moment method is a fast and accurate method to identify a linear-nonlinear system. In this method, we estimate the linear filter using cross-correlation, and identify the output nonlinearity by matching the mean response and cross-correlation amplitude to the data. The method is fast because it does not require the calculation of a linear prediction to estimate the static nonlinearity. Simulations demonstrate that the rate of convergence of the moment technique is faster than the LR and CLR methods. The advantage of the moment method is particularly evident when the kernels are large, as in the simulation of the simple cell (
Figure 4). Here, the moment method obtains a reasonable estimate of the nonlinearity after 15 minutes, whereas the LR/CLR estimates are inaccurate for this amount of data. Another benefit of the technique is that estimates can be updated recursively as new data arrives. This provides a means to stop an experiment when a sufficient signal-to-noise ratio is achieved. We note that both the moment method and the LR/CLR methods are dependent on the assumption of a linear-nonlinear system as described by
Equation 1. Thus, these methods will give accurate results only to the extent that a system is well represented by this lumped phenomenological model.
As mentioned above, one advantage of the LR/CLR methods is that no assumptions are made about the shape of the nonlinearity. This observation suggests using the LR/CLR methods initially when studying a new system. These experiments will be time consuming, as large amounts of data are required for accurate estimates of g(·). However, once a family of parametric functions is found to fit the measured nonlinearities, the more efficient moment method can be used to conduct further experiments.
Several nonlinearities of interest are parameterized by more than two variables. In these cases, additional conditions must be provided to fully determine g(·). One possibility is to obtain independent information about the remaining parameters (as done with rmax above). In the present study, we calculate two moments to estimate a two-parameter nonlinearity. We are currently investigating the possibility of calculating higher order moments to constrain the additional parameters.
We thank E. J. Chichilnisky and J. D. Victor for helpful comments and criticisms on an earlier version of this paper. This work was supported by a NSF Mathematical Sciences Postdoctoral Research Fellowship (D.Q.N) and by NSF-IBN-9720305 and NIH-EY-12816 (D.L.R).
Appendix A. Derivation of Equations for Nonlinearity
In this appendix, we derive the equations for the mean response

and correlation
E{
x} for the linear-nonlinear system in response to a discrete approximation of Gaussian white noise. In this derivation, we ignore the distinction between space and time and assume that the probability of a response is given by
Equation 1 where each component of
x is a normal random variable with mean zero and standard deviation
σ. Let
N be the dimension of
w and
x, thus
x has the probability density function
In the derivation, we let u = Ox, where O is any orthogonal matrix whose first row is wT. (Recall that w is a unit vector.) Then wTx is the first component of u, which we replaced by y.
To derive an expression for the input-output correlation
we first look at
Bj = 0
j ≠ 1 because
Bj is an odd integral in this case. The first component of
B is
where we integrated by parts to obtain the last expression. Thus,
where
e1 is the first unit vector. The input-output correlation is then
where
Let the nonlinearity be
where erf(
x) is the error function

. The derivative of
g(
y) is
Then, the mean response is
where

and

. We integrated by parts in the second step. To obtain the last step, we computed a change of variables with one of the variables parallel to the line
u = (
y +
y0)/(
σ√2) and completed the square in the resulting integrand. We have thus obtained
Equation 10.
The magnitude of the input-output correlation is obtained by a completion of the square:
One challenge in using the reverse-correlation technique is getting enough data for a reasonable signal-to-noise ratio. Typically, one must run long experiments in order to obtain good results. This problem is exacerbated with physiological experiments because cells do not respond well to a white noise stimulus. Thus, one would like to develop ways to increase the output rate.
If
g′(
x) ≥ 0, the output rate increases with the standard deviation
σ of the input [to see this, integrate
Equation 2 by parts]. Therefore, one would like to maximize the standard deviation of the input. However, physical devices, like computer monitors, have limited dynamic ranges. The discrete approximation to Gaussian white noise as described above will have a relatively small
σ given restraints on the magnitude of the stimulus. We thus extended the theory to more general forms of noise stimuli that increase the stimulus standard deviation.
Let M denote the maximum stimulus magnitude that could be produced by an input device, ie, each component of the stimulus must be in the interval [−M, M]. In that case, a preferable input would be one where each component of the stimulus was a binary random variable that took on the values M and −M with equal likelihood. This input would have a standard deviation of M.
Generalizations of the central limit theorem permit the use of our reconstruction method even with this input. Because ‖
w‖ = 1, the sum

converges in distribution to a normal random variable with standard deviation
M as the number of components increases (for example, see the discussion on page 700 of
Port 1994). This convergence is valid as long as the sum isn’t dominated by one component of
w. Because the argument of the nonlinearity
g(
x) would be approximately a normal random variable with
σ =
M, the output rate should closely match the white noise case.
Furthermore, to compute
E{
xr} one really computes an average of the input conditioned on the presence of an output event [
Equation B1]. The central limit theorem states that this average converges in distribution to a normal random variable. The results for non-Gaussian input, such as binary input, should thus closely approximate the above results. Numerical simulations showed virtually the same results for both types of input.
Appendix B. Computation of Correlation Magnitude
To compute the magnitude of the input-output correlation,
This average stimulus is simply the input-output correlation divided by the mean response rate:
We used Bayes’ Theorem to complete the third step and
Equations 1 and
A2 in the fourth step.

is the probability density function of
x conditioned on
r = 1. The mean output rate is simply

. The correlation magnitude is thus
There is a subtlety regarding how one should calculate

to reduce the bias in the estimate of
C. If

is a noisy estimate of
p, then

overestimates the magnitude of

. Thus, a naïve calculation of
C will overestimate the input-output correlation.
Assume

is an unbiased estimator of
p so that

. Let
pj be the jth component of
p. If δ
pj is the standard deviation of

, then
and
Thus, an unbiased estimate of
To calculate
δp, we divide the results into multiple trials and calculate an approximation to
p for each trial. We let

be the average of these values and
δp be the sample standard deviation of the mean. We then use B6 and B3 to calculate the input-output correlation.
The bias with the naïve estimate of the input-output correlation increases with the size of the kernel. The magnitude ‖p‖ is fixed for a given nonlinearity, but ‖δp‖ increases with the number of components. Thus, the bias in the naïve estimate of C was much larger with the 40 × 40 ×45 kernel of the simple cell than with the 32 × 32 kernel of the psychophysical task.
Calculations with only small amounts of data can be so noisy that

. In that case, ‖
p‖ is the square root of a negative number, and we cannot compute an estimate of the nonlinearity. Thus, in
Figure 2d we could not include the results for a few simulations with 500 or fewer trials. Similarly, we could not include one simulation that was a minute long in
Figure 4d.
For a benchmark by which to evaluate the moment method, we also calculated the static nonlinearity with a linear reconstruction (LR) method similar to that employed by
Anzai et al (1999) and
Chichilnisky (2001). This two-step method involves first calculating the linear kernel
w via reverse correlation and then calculating the average response as a function of the linear prediction
y =
wTx. In the latter calculation, we discretized
y into intervals, using narrower intervals around zero and wider intervals away from zero, because
y will have a normal distribution centered around zero. For each interval, we calculated the mean and standard error of the response if the interval contained three or more data points.
We present results from both the LR method and a cross-validated implementation of the linear-prediction method (CLR). In the LR method, we used all the data for both parts of the calculation. In the CLR method, we divided the data into N parts, where N ≈ 200. We calculated N-1 versions of the linear kernel, where each version used all the data except two consecutive parts. Thus, each version of the kernel would be based on approximately 99% of the data. We then used all the data to calculate the average response as a function of the linear prediction. The key of the CLR method was to switch kernels used for the linear prediction y = wTx so that the same data was never used simultaneously for both the linear kernel estimate w and the corresponding stimulus x. Thus, the CLR method removes the bias in y = wTx created from covariance between the stimulus x and the noisy estimate of w.
The bias in
y =
wTx is similar to the bias in ‖
p‖
2 =
pTp discussed above. As with ‖
p‖
2, the bias in
wTx increases with the size of the kernel. Thus, for small kernels, the CLR method may not give better results than the LR method because sometimes removing bias can increase the error. The bias with large kernels, on the other hand, may greatly skew the results of the LR method. In this case, the CLR method will be significantly more accurate. For the small 32 × 32 kernel of the psychophysical simulations, the bias in the LR method did give an underestimate of the nonlinearity (
Figure 2a,
2b), but the bias was small. As shown in
Figure 6a, the LR method outperformed the CLR method in this case. However, the large 40 × 40 × 45 kernel of the simple-cell simulations produced significant bias in the LR method even after 8 hours (
Figure 4). In this case, the CLR method was necessary to give accurate results.
Although the CLR method removes the bias caused by covariance between the stimulus and the kernel estimate, it is still sensitive to noise in the estimate of
w, as discussed in connection with
Figure 5b. When
w is high dimensional, small noise in each component of
w can build up to make the unbiased estimate of
y =
wTx very noisy, as observed in the simple-cell simulations.