A statistical model used extensively in vision research consists of a cascade of a linear operator followed by a static (memoryless) nonlinearity. Common applications include the measurement of simple-cell receptive fields in primary visual cortex and the modeling of human performance in various psychophysical tasks. It is well known that the front-end linear filter of the model can readily be recovered, up to a multiplicative constant, using reverse-correlation techniques. However, a full identification of the model also requires an estimation of the output nonlinearity. Here, we show that for a large class of static nonlinearities, one can obtain analytical expressions for the estimates. The technique works with both Gaussian and binary noise stimuli. The applicability of the method in physiology and psychophysics is demonstrated. Finally, the proposed technique is shown to converge much faster than the currently used linear-reconstruction method.

**x**, and the output (or response),

*r*, can be either 0 or 1. In this case, the model can be written simply as where g(·) is a nonlinear function and the (column) vector w represents the linear filtering operation. This equation can be considered a one-term projection pursuit regression model (Friedman & Stuetzle, 1981). Therefore, the optimization techniques developed for projection pursuit regression can be used to fit w and g(·) to the data. In the laboratory, however, the researcher has full control of the stimulus; therefore, as shown below, more efficient methods for estimating the parameters of the model can be developed.

**w**) in this model (Marmarelis & Marmarelis, 1978; Jones & Palmer, 1987; Emerson, Korenberg, & Citron, 1989; DeAngelis, Ohzawa, & Freeman, 1993a; DeAngelis, Ohzawa, & Freeman, 1993b; Reid, Victor, & Shapley, 1997; Ringach, Sapiro, & Shapley, 1997; Anzai, Ohzawa, & Freeman, 1999). A simple geometrical proof of the technique can be found in Chichilnisky (2001). In addition to estimating

**w**, a full characterization of the model also requires that we estimate the shape of the nonlinearity.

**w**

^{T}

**x**against the actual response

*r*and smooth it to obtain an estimate of the static nonlinearity. Here we show that for a large class of nonlinearities it is possible to write down a closed form solution for the estimates of the linear kernel and the static nonlinearity in one single step. The parameters of the nonlinearity are found by matching the input-output moments of the model to the data. We refer to this technique as the moment method. The calculation avoids computing the linear prediction altogether (which is a very time-consuming calculation) and can be updated recursively, as new data arrives. Furthermore, our results apply to both Gaussian and binary noise inputs.

**σ**. Alternatively, let each component be a binary random variable that takes the values

**σ**or −

**σ**with equal probability. As shown in Appendix A, if the linear kernel is normalized so that ‖

**w**‖ = 1, the average output is

**w**, where

*E*{·} denotes expected value and the constant of proportionality is given by (Bussgang, 1952

**w**can be recovered from

*E*{

**w**

*r*} because, , which is the usual reverse correlation result.

*x*]

^{+}is a half-rectifier such that [

*x*]

^{+}=

*x*if

*x*if

*x*> 0 and is zero otherwise. In this case (see Appendix A), it can be shown that where erfc(

*x*= 1 − erf(

*x*is the complementary error function and is the error function. The system represented by Equations 5 and 6 has two measured variables, and

*C*, and two unknowns,

*A*and

*y*

_{0}. Thus, given the measurements, one can solve numerically the system to obtain estimates of

*A*and

*y*

_{0}.

*g*(

*y*) =

*Ay*

^{β}, then, as shown in Appendix A, we obtain where the gamma function is given by . Once again, the system represented by Equations 7 and 8 has two measured variables, and

*C*, and two unknowns,

*A*and β. Thus, given the measurements one can solve numerically the system to obtain estimates of

*A*and β.

*r*

_{max},

*y*

_{0}and ɛ. In Appendix A we derive the following specific forms of Equations 2 and 4 for the error function nonlinearity: where . Because these equations provide only two constraints and the nonlinearity has three parameters, additional information is required to identify the system. One approach is to obtain additional constraints by estimating higher moments between the input and output of the system, which is discussed below. Another possibility is to begin by estimating

*r*

_{max}. For example,

*r*

_{max}can be taken as the reciprocal of the minimum inter-spike interval in the data record, which is the method used in this paper. Alternatively, the maximum output rate may be determined by the response to optimal stimuli from other experiments. In the case of 2AFC psychophysical experiments,

*r*

_{max}can be defined as 1. Thus, if we know

*r*

_{max}and measure and

*C*, we can solve for

*y*

_{0}and ɛ analytically.

*r*

_{max},

*c*, and

*n*. Unfortunately, the integrals in Equation 2 and 4 do not have nice analytic expressions, but we can still solve numerically the system of integral equations, for

*c*and

*n*provided that

*r*

_{max}is estimated from some additional information as above.

**w**) similar to those obtained in the actual measurements (Figure 1a) and an error function nonlinearity with

*r*

_{max}= 1,

*y*

_{0}= 0.5, ɛ = 1.0. In each trial of the experiment one of two possible stimuli,

*x*

_{0}or

*x*

_{1}, is presented in the presence of additive white Gaussian noise with

**σ**= 1. The subject is asked to identify the stimulus. We code the subject’s response as

*r*= 0 for the

*x*

_{0}choice and = 1 for the

*x*

_{1}choice. The probability that the subject will select

*x*

_{1}is modeled as . If we pool all the trials in which only

*x*

_{1}was presented, we obtain (where

**n**represents the noise in the stimulus). Similarly, for the trials in which only

*x*

_{0}was presented, we obtain . The “classification images” obtained by cross-correlation between response and stimulus in these two cases should be the same and proportional to

**w**. Similarly, the nonlinearities should be identical up to a translation. In fact, these constraints could be used as a test for the validity of the model in a 2AFC task.

*y*

_{0}and ɛ vs. length of the data record is shown in Figure 2d. After 2500 trials, the error is down to 10 percent, and it converges toward zero as the number of trials is increased further.

*T*= 2 ms and by a 40 × 40 grid in space (see Figure 3a). In each 2-ms interval, the probability of a spike was given by Equation 1, and the nonlinearity was a power law with parameters

*A*= 0.01 and

**β**= 2. The input was binary noise taking the values 1 and −1, giving a standard deviation of

**σ**= 1.

**w**were extremely noisy, the linear prediction y = w

^{T}x would be mostly noise, and the CLR method would give a flat line at the mean rate. Figure 5b illustrates that the CLR results lie between the mean rate and the true nonlinearity after 15 minutes of data collection, as would be expected from a noisy estimate of

*y*. As the simulation length increases, the results approach the correct nonlinearity. A more detailed discussion of the noise in the CLR methods, as well as a discussion on the bias in the LR method, can be found in Appendix C.

*y*but diverge as

*y*increases. Thus, confidence in any extrapolation to large

*y*values requires precise knowledge of the shape of the nonlinearity. Because the LR and CLR methods do not make any assumptions about the shape of the nonlinearity, they could be used to determine the proper class of nonlinearities to use in the moment method.

*g*(·). However, once a family of parametric functions is found to fit the measured nonlinearities, the more efficient moment method can be used to conduct further experiments.

*g*(·). One possibility is to obtain independent information about the remaining parameters (as done with

*r*

_{max}above). In the present study, we calculate two moments to estimate a two-parameter nonlinearity. We are currently investigating the possibility of calculating higher order moments to constrain the additional parameters.

*E*{

*x*} for the linear-nonlinear system in response to a discrete approximation of Gaussian white noise. In this derivation, we ignore the distinction between space and time and assume that the probability of a response is given by

**x**is a normal random variable with mean zero and standard deviation

**σ**. Let

*N*be the dimension of

**w**and

**x**, thus

**x**has the probability density function

**u**=

**Ox**, where

**O**is any orthogonal matrix whose first row is

**w**

^{T}. (Recall that

**w**is a unit vector.) Then

**w**

^{T}

**x**is the first component of

**u**, which we replaced by

*y*.

*B*

_{j}= 0

*j*≠ 1 because

*B*

_{j}is an odd integral in this case. The first component of

**B**is where we integrated by parts to obtain the last expression. Thus, where

**e**

_{1}is the first unit vector. The input-output correlation is then where

*x*) is the error function . The derivative of

*g*(

*y*) is Then, the mean response is where and . We integrated by parts in the second step. To obtain the last step, we computed a change of variables with one of the variables parallel to the line

*u*= (

*y*+

*y*

_{0})/(

**σ**√2) and completed the square in the resulting integrand. We have thus obtained Equation 10.

*g*′(

*x*) ≥ 0, the output rate increases with the standard deviation

**σ**of the input [to see this, integrate Equation 2 by parts]. Therefore, one would like to maximize the standard deviation of the input. However, physical devices, like computer monitors, have limited dynamic ranges. The discrete approximation to Gaussian white noise as described above will have a relatively small

**σ**given restraints on the magnitude of the stimulus. We thus extended the theory to more general forms of noise stimuli that increase the stimulus standard deviation.

*M*denote the maximum stimulus magnitude that could be produced by an input device, ie, each component of the stimulus must be in the interval [−

*M, M*]. In that case, a preferable input would be one where each component of the stimulus was a binary random variable that took on the values

*M*and −

*M*with equal likelihood. This input would have a standard deviation of

*M*.

**w**‖ = 1, the sum converges in distribution to a normal random variable with standard deviation

*M*as the number of components increases (for example, see the discussion on page 700 of Port 1994). This convergence is valid as long as the sum isn’t dominated by one component of

**w**. Because the argument of the nonlinearity

*g*(

*x*) would be approximately a normal random variable with

**σ**=

*M*, the output rate should closely match the white noise case.

*E*{

**x**

*r*} one really computes an average of the input conditioned on the presence of an output event [Equation B1]. The central limit theorem states that this average converges in distribution to a normal random variable. The results for non-Gaussian input, such as binary input, should thus closely approximate the above results. Numerical simulations showed virtually the same results for both types of input.

**x**conditioned on

*r*= 1. The mean output rate is simply . The correlation magnitude is thus

*C*. If is a noisy estimate of

**p**, then overestimates the magnitude of . Thus, a naïve calculation of

*C*will overestimate the input-output correlation.

**δp**, we divide the results into multiple trials and calculate an approximation to

**p**for each trial. We let be the average of these values and

**δp**be the sample standard deviation of the mean. We then use B6 and B3 to calculate the input-output correlation.

**p**‖ is fixed for a given nonlinearity, but ‖δ

**p**‖ increases with the number of components. Thus, the bias in the naïve estimate of

*C*was much larger with the 40 × 40 ×45 kernel of the simple cell than with the 32 × 32 kernel of the psychophysical task.

**p**‖ is the square root of a negative number, and we cannot compute an estimate of the nonlinearity. Thus, in Figure 2d we could not include the results for a few simulations with 500 or fewer trials. Similarly, we could not include one simulation that was a minute long in Figure 4d.

**w**via reverse correlation and then calculating the average response as a function of the linear prediction

*y*=

**w**

^{T}

**x**. In the latter calculation, we discretized

*y*into intervals, using narrower intervals around zero and wider intervals away from zero, because

*y*will have a normal distribution centered around zero. For each interval, we calculated the mean and standard error of the response if the interval contained three or more data points.

*N*parts, where

*N*≈ 200. We calculated

*N*-1 versions of the linear kernel, where each version used all the data

*except*two consecutive parts. Thus, each version of the kernel would be based on approximately 99% of the data. We then used all the data to calculate the average response as a function of the linear prediction. The key of the CLR method was to switch kernels used for the linear prediction

*y*=

**w**

^{T}

**x**so that the same data was never used simultaneously for both the linear kernel estimate

**w**and the corresponding stimulus

**x**. Thus, the CLR method removes the bias in

*y*=

**w**

^{T}

**x**created from covariance between the stimulus

**x**and the noisy estimate of

**w**.

*y*=

**w**

^{T}

**x**is similar to the bias in ‖

**p**‖

^{2}=

**p**

^{T}

**p**discussed above. As with ‖

**p**‖

^{2}, the bias in

**w**

^{T}

**x**increases with the size of the kernel. Thus, for small kernels, the CLR method may not give better results than the LR method because sometimes removing bias can increase the error. The bias with large kernels, on the other hand, may greatly skew the results of the LR method. In this case, the CLR method will be significantly more accurate. For the small 32 × 32 kernel of the psychophysical simulations, the bias in the LR method did give an underestimate of the nonlinearity (Figure 2a,2b), but the bias was small. As shown in Figure 6a, the LR method outperformed the CLR method in this case. However, the large 40 × 40 × 45 kernel of the simple-cell simulations produced significant bias in the LR method even after 8 hours (Figure 4). In this case, the CLR method was necessary to give accurate results.

**w**, as discussed in connection with Figure 5b. When

**w**is high dimensional, small noise in each component of

**w**can build up to make the unbiased estimate of

*y*=

**w**

^{T}

**x**very noisy, as observed in the simple-cell simulations.