The response of visual cells is a *nonlinear* function of their stimuli. In addition, an increasing amount of evidence shows that visual cells are optimized to process *natural images*. Hence, finding good nonlinear models to characterize visual cells using natural stimuli is important. The Volterra model is an appealing nonlinear model for visual cells. However, their large number of parameters and the limited size of physiological recordings have hindered its application. Recently, a substantiated hypothesis stating that the responses of each visual cell could depend on an especially low-dimensional subspace of the image space has been proposed. We use this low-dimensional subspace in the *Volterra relevant-space technique* to allow the estimation of high-order Volterra models. Most laboratories characterize the response of visual cells as a nonlinear function on the low-dimensional subspace. They estimate this nonlinear function using histograms and by fitting parametric functions to them. Here, we compare the Volterra model with these histogram-based techniques. We use simulated data from cortical simple cells as well as simulated and physiological data from cortical complex cells. Volterra models yield equal or superior predictive power in all conditions studied. Several methods have been proposed to estimate the low-dimensional subspace. In this article, we test projection pursuit regression (PPR), a nonlinear regression algorithm. We compare PPR with two popular models used in vision: spike-triggered average (STA) and spike-triggered covariance (STC). We observe that PPR has advantages over these alternative algorithms. Hence, we conclude that PPR is a viable algorithm to recover the relevant subspace from natural images and that the Volterra model, estimated through the Volterra relevant-space technique, is a compelling alternative to histogram-based techniques.

*Volterra relevant-space technique*to estimate Volterra models of visual cells from natural images.

*r*th-order kernel represents the weight that nonlinear interactions between

*r*inputs have on the response.

*y*(

*t*) of a stationary, stable, causal system can be expressed in terms of its input signal

*x*(

*t*) through its Volterra series expansion as

*Q*in

*y*

_{ Q,k}indicates that the Volterra series has been truncated at order

*Q,*the subscript

*k*reflects the dependence of the Volterra model on the particular choice of kernels {

*k*

_{0},

**k**

_{1},…,

**k**

_{ Q}}, and

*ɛ*stands for possible errors due to truncation and to noise in the data.

**x**∈

^{ N × N}is the input image. Any Volterra model can be represented with kernels

**k**

_{ r}that are symmetric with respect to permutations in pairs of indices (

*i*

_{ p}

*,j*

_{ p}). Hence, the number of parameters in Volterra kernels is less than their number of kernel coefficients. For example, for

**k**

_{2}, we have the symmetry

*k*

_{2}(

*i*

_{1},

*j*

_{1},

*i*

_{2},

*j*

_{2}) =

*k*

_{2}(

*i*

_{2},

*j*

_{2},

*i*

_{1},

*j*

_{1}), which implies that the number of parameters is

*N*

^{4}. The number of parameters of the Volterra model in Equation 3,

*n*

_{V}, is

*Q*= 4, using images of size 16 × 16 as input, contains

*n*

_{V}= 186,043,585 parameters. From the limited data that can be recorded in physiological experiments, we cannot estimate such a large number of parameters. The Volterra relevant-space technique described in the next section addresses this problem.

**x**is an image and

**b**

_{1},…,

**b**

_{ L}:

**b**

_{ l}∈

^{ N × N}} is a set of orthonormal basis functions spanning the space

**x**onto the basis function

**b**

_{ l}. Using these coefficients, the low-dimensional Volterra model is

*if a vector space*

*is such that the response of the cell to any image equals the response to the projection of the image onto*

*, then the original Volterra model can be rewritten as the low-dimensional Volterra model*. A vector space

*relevant space*. That is,

**b**

_{1},…,

**b**

_{ L}:

**b**

_{ l}∈

^{ N × N}} spanning a relevant space is called a set of

*relevant dimensions*.

*n*

_{LV}, is

*L*= 10 relevant dimensions, a fourth-order (

*Q*= 4) low-dimensional Volterra model will contain

*n*

_{LV}= 1,001 parameters. In contrast, the number of parameters in the original Volterra model (

*n*

_{V}) for a 16 × 16 image is 186,043,585 ( Equation 4). Therefore, if a small relevant subspace can be identified, the Volterra relevant-space technique achieves enormous dimensionality reduction.

*q*

_{ i}

*,*of the low-dimensional Volterra model ( Equation 7), estimated? We seek coefficients that minimize the difference between true responses and those predicted by the Volterra model. To obtain a convenient method for this minimization, we begin by expressing Equation 7 as the linear Volterra equation

*K*image–response pairs and

*M*parameters,

**y**∈

^{ K}is the response vector whose

*i*th element,

*y*(

**x**

_{ i}), is the response of the cell to the

*i*th image,

**A**∈

^{ K × M}is the data matrix,

**q**∈

^{ M}is the vector of parameters, and

*ɛ*∈

*Q*= 2), the

*i*th row of matrix

**A**,

**a**

_{ i}′, is

**q**, is

*MSE*),

**q**. The use of linear techniques in the estimation of Volterra models removes the whiteness requirement of cross-correlation techniques. Therefore, we can use natural images to estimate Volterra models, without the need to whiten them or to modify image statistics in any way.

**k**

_{ i}

*,*in Equation 3, with

**k**

_{ i}, then, for any set of true relevant dimensions {

*b*

_{ i}} ( Equation 8), there exist coefficients

*q*

_{0},…,

**q**

_{ Q}, such that a Volterra model using the true kernels,

**k**

_{ i}, generates the same responses as a Volterra model using the approximated kernels in Equations 14– 16. Thus, from a predictive perspective, these approximated kernels are indistinguishable from the true kernels.

*Q*. Here, we selected the order

*Q*by cross-validation, as indicated in the Volterra relevant-space technique section under the Receptive-field estimation section. In summary, the Volterra relevant-space technique uses the following procedure to estimate Volterra models from natural images:

- Estimate the relevant space.
- Project the images in the estimated relevant dimensions ( Equation 6), obtaining the coefficients
*α*_{ l}. - Use the coefficients
*α*_{ l}to construct the data matrix ( Equation 11) of the linear Volterra equation ( Equation 10). - Estimate the coefficients vector,
*q,*of the linear Volterra equation by minimizing the*MSE*in Equation 13. - Reconstruct the Volterra kernels using Equations 14– 16.

*v*) = saturation / [1 + exp(−slope × (

*v*− threshold))], and a Poisson spike generator. We used a sigmoidal rectification instead of a halfway rectification because we wanted to use a model that was differentiable at every point. This differentiability allowed derivation of true Volterra kernels (see 2). The simple-cell model is then given by

**f**is the spatial filter. For the spatial filter, we used a two-dimensional Gabor function. It had a preferred orientation of 45 deg from vertical, a preferred spatial frequency of 2 cycles per receptive field, a spatial bandwidth of 1.6 octaves, and an even phase. For the sigmoidal rectification of the filtered image, we used slope = 5.0 and threshold = 1.0 and varied the saturation parameter (setting it to 458, 276, 92, and 46) to obtain different mean number of spikes per image (5, 3, 1, and 0.5, respectively). Because the responses of the simulated simple cell followed a Poisson distribution, the signal-to-noise ratio in the responses increased monotonically as a function of the mean number of spikes per image.

**f**(the coefficients

*α*

_{ l}(

**x**) in Equation 7) are large. In this case, truncation errors will become significant. However, in the opposite case, that is, when these projections are small, the problem is not serious. By using a model with kernels of sufficiently high order, these truncation errors will be negligible. To minimize the effect of truncation errors, we selected the amplitude of the filter

**f**in such a way that the absolute value of the largest filtered image was 1.

**f**

_{1}and

**f**

_{2}are the linear filters in quadrature phase. To vary the mean number of spikes in the responses, we scaled the magnitude of the Gabor filters, generating responses with 5.0, 3.0, 1.0, and 0.5 spikes/image.

**x**of Equation 18, evaluated at zero, are different from zero. Hence, from Equation B3, only the second-order Volterra kernels of the complex cell are different from zero. Therefore, the complex cell can be represented with a second-order Volterra model.

*Estimating the relevant dimensions.*To develop low-dimensional Volterra models, one must estimate a set of relevant dimensions satisfying Equation 8. Methods for doing this rely on different assumptions about the statistics of the input, the properties of the response-generation function, or both. The methods of reverse correlation (de Boer & Kuyper, 1968), STC (de Ruyter van Steveninck & Bialek, 1988), and maximally informative dimensions (Sharpee et al., 2004) have been introduced in the field of neuroscience. These methods are powerful but have constraints about the dimensionality of the relevant space and the type of stimulus used or require an arduous high-dimensional nonlinear optimization that constrains the dimensionality of estimable relevant spaces. (An exception may be the modification of STC for natural images explained in the STC for natural images section.) Other interesting methods involve perception neural networks (Hertz, Palmer, & Krogh, 1991) and partial least squares (Geladi & Kowalski, 1986; Helland, 1988). The former suffers from the curse of dimensionality, and the latter relies on a linear assumption about the response-generation function. Other methods are sliced inverse regression and PPR, both of which were introduced in statistics. Sliced inverse regression (Li, 1991) makes no assumptions about the response-generation mechanism but relies on the assumption of a spherical distribution of the stimuli. In turn, PPR (Friedman & Stuetzle, 1981) makes no assumption about the distribution of the inputs and uses the following generalized linear model for the response-generation mechanism:

*M*

_{0}is sufficiently large, any smooth function can be represented by Equation 19 (universal approximator property; Diaconis & Shahshahani, 1984). Another interesting property of PPR is that its estimation algorithm explicitly addresses the curse of dimensionality by separately estimating each term in the generalized linear model. In the following experiments, we evaluated PPR for the estimation of the relevant space. In 3, we outline the PPR algorithm (see Friedman, 1984, for further details). The estimated relevant dimensions for the low-dimensional Volterra model are computed by orthonormalizing the set of projection directions, {

**a**

_{1},…,

**a**

_{M0}}, in Equation 19. To understand why this procedure works, let

**a**

_{1},…,

**a**

_{M0}〉. Let us also denote

**a**

_{i},

**x**〉 = 〈

**a**

_{i},

**x**)〉,

*y*(

**x**) =

*y*(

**x**)) implies that

*y*. Consequently, a set of relevant dimensions can be computed by orthonormalizing the set of projection directions {

**a**

_{1}, …,

**a**

_{M0}}, as indicated above.

**a**

_{i}, in Equation 19 by cross-correlating the input images with the residuals of the responses. Consequently, it might fail to estimate the relevant dimensions of cells for which the responses are completely uncorrelated with the input images. For these cases, one can use the implementation of Automatic Smoothing Spline Projection Pursuit (Roosen & Hastie, 1994) developed by Roosen (1994), which allows the specification of initial projection directions. However, for all the conditions reported below, there was enough correlation between images and responses for the R implementation of PPR to estimate good relevant dimensions.

*L,*of projection directions needed to characterize the cell. Then, the second run estimates the

*L*projection directions. In the first run, for each jackknifed data set in our training set ( Data partitioning section), we ran the PPR algorithm with parameters

*M*

_{0}= 1 and

*M*= 6. These parameters allowed assessing the predictive error of PPR models with one to six projection directions (see 3). For each jackknifed data set, PPR generates a predictive-error-versus-number-of-projection-directions curve. The PPR predictive-error curves of different jackknifed data sets were averaged. This average generated a new curve that was used to select the number of projection directions. PPR predictive-error-versus-number-of-projection-directions curves follow a common pattern. Before a critical number of terms, the predictive error decreases substantially as we increase the number of terms. Increasing the number of terms beyond the critical value yields only small reductions in predictive error. We selected the number of projection directions as the largest number, between one and six, for which the error bars (Size 10 standard errors) did not intersect those of the previous number of projection directions.

*L,*we proceeded to estimate the projection directions. We ran the PPR algorithm, with parameters

*M*

_{0}=

*L*and

*M*= 6, for the

*N*(2 ≤

*N*≤ 10) jackknifed data sets, estimating

*N*sets of

*L*noisy projection directions. To attenuate the effect of noise, we averaged these sets of projection directions, as described in the Averages of sets of relevant dimensions section.

*Estimating the low-dimensional Volterra model*. Having estimated the relevant dimensions, we constructed the low-dimensional Volterra model ( Equation 7). We estimated its parameters,

*q*

_{ i}, by minimizing the

*MSE*between the responses of the cell model and the predictions of the low-dimensional Volterra model. As noted in the Volterra relevant-space technique section, the solution to Equation 13 is a linear problem. We solved it using the pseudoinverse computed from the singular value decomposition (SVD; Mendel, 1995).

*Selecting the order of Volterra model*. For each training jackknife data set ( Data partitioning section), we estimated low-dimensional Volterra models of different orders. We then evaluated the predictive power of these models using the 10% portion of the training data not included in the jackknife data set. The predictive power was the cross-correlation between cell responses and Volterra model predictions. We then generated a curve of predictive power versus model order. Curves of different data sets were averaged. The least order maximizing the average predictive power was selected as the order of the Volterra model.

*y*(

**x**) =

*N*(〈

**h**,

**x**〉) +

*ɛ,*where

*y*(

**x**) is the response of the cell to image

**x**,

**h**is a linear filter, 〈.,.〉 is the inner-product operation, and

*N*is a static nonlinearity. When the images are Gaussian white noise, the filter

**h**can be estimated from pairs of images–responses by cross-correlating the inputs with the responses (Marmarelis & Marmarelis, 1978).

**C**

_{ yx}=

**C**

_{ xx}

*h*. Then, for nonwhite inputs and a linear cell, the filter

**h**can be recovered as

**C**

_{ xx}

^{−1}

**C**

_{ yx}. The autocorrelation matrix,

**C**

_{ xx}, for natural stimuli is nearly singular. Therefore, its true inverse tends to amplify noise. To avoid this problem, we regularized the autocorrelation matrix using the truncated SVD (Hansen, 1987) and computed the pseudoinverse (Ben-Israel & Greville, 1980) from this regularized matrix. The computation of the truncated SVD uses a threshold to decide how many singular values to include in the regularized matrix. For each condition in the studies reported below, we selected the optimal threshold by using

*k*-fold cross-validation (Efron & Tibshirani, 1993). We used the jackknifed data sets from the training data (Data partitioning section) to estimate different STA filters. To minimize the effect of noise, we used the average of these filters as the final STA estimate.

**x**as a linear filter,

**h**, followed by a static one-dimensional nonlinearity (Chichilnisky, 2001). Precisely,

*y*(

**x**) =

*N*(〈

**h**,

**x**〉) +

*ɛ*. We estimated the linear filter using STA and the static nonlinearity by fitting the parameters of the same sigmoidal function used to generate the simulated data to a histogram of responses as a function of projections of input images in the estimated filter. We used histograms with equi-spaced bins and selected the bin size from the data range using the Sturges rule (Sturges, 1926).

*U*the matrix of eigenvectors of the autocovariance of the stimuli (one eigenvector per column), and denote by

*λ*

_{i}its eigenvalues. The matrix of normalized eigenvectors is defined as

*X*

_{w}=

*XU*

_{n}. STC for natural images now performs a classical STC analysis on the whitened natural images, estimating a set of relevant dimensions,

*V*

_{w}(one relevant dimension per column). Finally, the estimated relevant dimensions of the cell,

*V*(one relevant dimension per column), are

*V*=

*U*

_{n}

*V*

_{w}.

*λ*

_{ i},

*i*≫ 1) will be very small and tend to amplify noise. To avoid this effect, we selected a threshold and set eigenvalues below this threshold to 1. In this way, only the eigenvectors corresponding to large eigenvalues are normalized in Equation 21. With physiological and simulated data, we used different threshold values for different cells and selected the value maximizing the predictive power. Best predictions were obtained using thresholds that normalized around 35% of the eigenvectors. Thus, for the results reported below, we used this criterion to select the threshold. The same criterion was used by Touryan et al. (2005). Multiple sets of STC relevant dimensions were estimated using the jackknifed data sets from the training data (Data partitioning section). We then averaged these sets (Averages of sets of relevant dimensions section) to obtain the final STC estimates.

*f*(

*p*) =

*β*|

*p*|

^{γ}+

*r*

_{0}to each histogram, where

*f*(

*p*) is the response of the cell to a given image,

*p*is the projection of the image onto a estimated relevant dimension, and

*β, γ,*and

*r*

_{0}are free parameters. The reconstructed nonlinearity is then the sum of the power functions fitted to each histogram,

*y*(

**x**) =

*f*

_{1}(〈

*φ*

_{1},

**x**〉) +

*f*

_{2}(〈

*φ*

_{2},

**x**〉), where

*y*(

**x**) is the response of the cell to image

**x**,

*f*

_{1}and

*f*

_{2}are the power functions fitted to the histograms,

*φ*

_{1}and

*φ*

_{2}are the first and second relevant dimensions, and 〈.,.〉 is the inner-product operation.

*N*sets of

*L*estimated relevant dimensions, when each set has been estimated from a different portion of the training data. All sets describe, with noise, a common relevant space. The goal is to estimate this common space. Because different sets might represent the common space in different coordinate systems, we cannot average them element by element. To perform the average, we instead collected the

*N*sets into a matrix

*A*. In

*A,*columns {(

*i*− 1)

*L,*…,

*iL*} correspond to the relevant dimensions of the

*i*th set. The column space of

*A*spans an

*L*-dimensional relevant space (signal) and a noise space. We use the SVD to separate the column space of

*A*into a signal and an orthogonal noise space (Scharf, 1991). Specifically, we compute the SVD of

*A, A*=

*U*Σ

*U*

^{T}

*,*and use the first

*L*left singular vectors (first

*L*columns of matrix

*U*) to represent the signal space. Thus, the first

*L*columns of matrix

*U*are the average of the

*N*sets of

*L*relevant dimensions.

*z*by

*k*

_{1}

^{ s}(

*i,j*) =

*k*

_{1}(

*i,j*)

*k*

_{2}

^{ s}(

*i*

_{1},

*j*

_{1},

*i*

_{2},

*j*

_{2}) =

*k*

_{2}(

*i*

_{1},

*j*

_{1},

*i*

_{2},

*j*

_{2})

*k*

_{1}(

*i,j*), have two independent variables: the dimensions of space and one dependent variable, the amplitude. The latter is represented on the vertical axis of the 3D plots, with

*i*and

*j*along the

*X*- and

*Y*-axes, respectively. Second-order kernels,

*k*

_{2}(

*i*

_{1},

*j*

_{1},

*i*

_{2},

*j*

_{2}), have four independent variables and one dependent variable. We display second-order kernel slices with respect to different reference positions. That is, we fix the reference position (

*i*

_{1},

*j*

_{1}) and plot the value

*k*

_{2}(

*i*

_{1},

*j*

_{1},

*i*

_{2},

*j*

_{2}) on the vertical axis, with

*i*

_{2}and

*j*

_{2}on the

*X*- and

*Y*-axes, respectively. These plots show contributions to the response of interactions between individual pixels of the image and pixel

*x*(

*i*

_{1},

*j*

_{1}). The kernels shown in the Results section have been scaled ( Scaling kernels section). This allows the reader to appreciate visually the average contribution of a kernel to the cell response and to compare the contributions from different kernels. For predicting responses, on average, larger kernels are more relevant than smaller ones.

*MSE*. The

*MSE*is a scale-sensitive measure. This implies that if true and estimated kernels have small amplitude, then the

*MSE*between them will be small, independent of how well they correlate. For measuring errors between scaled kernels, this scale sensitivity is a good property. This is because scaled kernels with small amplitudes contribute little to the response, and thus it does not matter how well the estimated kernels correlate with the true kernel. Moreover, the

*MSE*allows the comparison of errors across different kernels. This happens because the amplitude of scaled kernels indicates their mean contribution to the response. Therefore, a similar

*MSE*for different scaled kernels indicates that they contribute similar errors to the response.

*i*th term to the response

*y*(

**x**) to image

**x**as contrib

_{ i}(

**x**) = |

*y*

_{ i}(

**x**)|/∑

_{ j = 0}

^{ Q}|

*y*

_{ j}(

**x**)|.

*r*

^{2}statistic) of the fitted linear model as the goodness-of-fit measure. The

*r*

^{2}statistic is the ratio between the reduction in variance achieved by the regression model (the total variance of the outputs minus the variance of the residuals) and the total variance of the outputs. Without a relationship between the true relevant dimension and its projection onto the estimated relevant space, the

*r*

^{2}statistic would be 0. In contrast, if the true relevant dimension were identical to its projection, then the

*r*

^{2}statistic would be 1. For the simple cell, the

*r*

^{2}statistic for the analytically relevant dimension and its projection onto the estimated relevant space was .94.

*MSE*( Goodness of fit of kernels section). For the simple cell, the

*MSE*between scaled true and estimated kernels was 2.42E−03 for the first-order kernel ( Figure 6) and 1.83E−03 for the second-order kernel slice with respect to Position (6, 6) ( Figure 7). This slice had the largest

*MSE*among all second-order kernel slices (the

*MSE*averaged across all second-order kernel slices was 1.74E−04). Therefore, despite the noise, the estimated kernels accurately approximate their true values.

*r*

^{2}statistic for the regression between the coefficients of the true and estimated relevant dimensions as a function of the number of inputs used in the estimation. For this plot, the average response of the simulated simple cell was 5 spikes/image. Figure 8b plots the same

*r*

^{2}statistic as a function of the number of spikes per image in the simulated responses, when using 5,000 inputs in the estimation. These figures show that the two PPR relevant dimensions better approximate the true relevant dimension of the simple-cell model than the relevant dimension estimated by STA (black vs. green curve). The superiority of PPR over STA is not due to the fact that PPR is using two relevant dimensions to approximate the true relevant dimension, although STA is using only one. PPR also outperforms STA when using one relevant dimension (red vs. green curves).

*r*

^{2}= .96; second relevant dimension,

*r*

^{2}= .95).

*MSE*between them ( Goodness of fit of kernels section). The

*MSE*between scaled true and estimated kernels was 5.6E−04 for the first-order kernel ( Figure 13) and 1.42E−03 for the second-order kernel slice with respect to Position (6, 6) ( Figure 14). This slice was the one with maximum

*MSE*among all slices (the mean

*MSE*across all slices was 3.06E−04). Thus, despite the noise, the estimated kernels accurately approximate their true values.

*r*

^{2}statistic as a function of the number of images for several conditions involving PPR and STC. Figure 15b plots the average

*r*

^{2}statistic as a function of the average number of spikes in the responses. STC achieves its best performance with responses to natural images with equal variance (red curve), as in Touryan et al. (2005). When the responses correspond to nonequal-variance images, STC for natural images performs poorly (blue curve) and the variance-equalization preprocessing step significantly improves its performance (green curve). With nonequal-variance data, PPR outperforms STC for natural images for any number of inputs or signal-to-noise ratios (black vs. green and blue curves). However, STC has an advantage in running time over PPR. The latter is a time-consuming iterative algorithm, requiring several estimations of smooth functions and least squares minimizations. STC is a fast algorithm that only requires the computation of covariance matrices and their spectral decomposition. A fast algorithm was essential for the online estimation of the relevant dimensions of a cell, an impressive task performed by Touryan et al. (2005).

*kernels-expansion technique,*initially proposed by Wiener for the estimation of Wiener kernels, has been the method of choice for the estimation of Volterra models (Amorocho & Brandstetter, 1971; Bose, 1956; Lee, 1964; Marmarelis, 1993; Watanabe & Stark, 1975). It reduces the number of parameters by compacting the representation of kernels and avoids the computation of cross-correlations, performing instead least squares fitting of the actual data to model parameters. The representation of Volterra kernels is compacted by expressing them as linear combinations of a few basis functions. One then uses the compact representation to build a low-dimensional Volterra model. Then, the kernels of the original Volterra model are reconstructed from the estimated parameters of the low-dimensional Volterra model.

*L*in Equations 15 and 16, to represent the kernels. The experimenter must also select parameter

*α*of the Laguerre basis functions. Marmarelis recommended a judicious selection of

*L*and

*α*based on trial and error. Later, Mitsis and Marmarelis (2002) proposed a method to learn parameter

*α*from input–output data.

- The effectiveness of the low-dimensional representation of Volterra models depends upon using a small set of basis functions. In the kernels-expansion technique, the assumption for effectiveness is that the kernels of the Volterra model can be represented using a small set of basis functions. This assumption cannot be tested experimentally a priori. In the Volterra relevant-space technique, the assumption is that, for a given cell, the relevant space is low-dimensional. As demonstrated in previous studies (de Ruyter van Steveninck & Bialek, 1988; Rust et al., 2005; Sharpee et al., 2006), this assumption has experimental support.
- Optimization of Equation 8 allows the Volterra relevant-space technique to estimate the relevant dimensions from the data. Consequently, the Volterra relevant-space technique avoids the bias introduced in the estimates of kernels by an improper ad hoc selection of basis functions.
- As described in the Volterra relevant-space technique section under the Receptive-field estimation section, the solution of Equation 8 is a well-studied problem in different fields of science. We can then use the expertise from these fields in the estimation of Volterra models.

- Convergence of the Volterra series requires certain smoothness conditions on the function being approximated (Palm & Poggio, 1977). This requirement is not severe because the firing rate of visual cells can be adequately approximated by smooth functions.
- The response could depend on very high order Volterra terms. Consequently, even with the use of the Volterra relevant-space technique, we might not be able to estimate them from a limited amount of physiological data. However, because many cells in the early visual system have been adequately characterized with linear or quadratic models, this scenario does not seem to be a strong limitation. Furthermore, studies of statistics of natural images find that distributions of variables that modulate responses tend to have maxima at values that generate little response. For example, the distribution of contrasts tends to be maximal at zero (Balboa & Grzywacz, 2000; Ruderman & Bialek, 1994; Tadmor & Tolhurst, 2000; Zhu & Mumford, 1997). Therefore, responses to natural images may be low, generating relatively little high-order nonlinearities.
- A cell response may admit a Volterra representation, with all relevant terms of moderate order, but it may not depend on a low-dimensional subspace of the image space. In such a case, the Volterra relevant-space technique will not provide substantial dimensionality reduction to allow the estimation of high-order Volterra models. However, physiological experiments with some cells in the early visual system come to our aid (de Ruyter van Steveninck & Bialek, 1988; Rust et al., 2005; Sharpee et al., 2004), showing that the responses of these cells depend on such a low-dimensional subspace.

_{ Q,k}

*(*x

*):*

^{ N × N}

*→*

*b*_{ 1}

*,…,*

*b*_{ L}

*:*

*b*_{ l}

^{ N × N}}, fulfills Equation 8, then Equation 3 can be expressed as Equation 7, where q

_{ 0}

*,…,q*

_{ Q}are

_{ Q,k}

*(*

**x**

*):*

^{ N × N}

*→*

*= {*

*b*_{ 1}

*,…,*

*b*_{ L}

*:*

*b*_{ l}

^{ N × N}}, satisfies Equation 8. Consider the set projection coefficients of the kernels in the space

*,*

*y*

_{ Q,k}(

**x**), is analytic on an open set |

**x**| <

*R*

_{0}centered at zero with radius

*R*

_{0}, then at each point

**x**in that set,

*y*

_{ Q,k}(

**x**) has the Taylor-series representation (Brown & Churchill, 1996)

*y,*to its input,

**x**= (

*x*

_{1},…,

*x*

_{ N})′, using the nonlinear transformation given in Equation 19 (Friedman, 1984; Friedman & Stuetzle, 1981). The projection part of the name indicates that the input vector

**x**is projected onto the direction vectors

**a**

_{1},

**a**

_{2},…,

**a**

_{M0}. In turn, the pursuit part indicates that an optimization technique is used to find a good set of direction vectors. The estimation of the PPR model parameters is a two-stage process. They comprise a forward stepwise procedure followed by a backward stepwise procedure. The PPR algorithm is controlled by two parameters,

*M*and

*M*

_{0}. Parameter

*M*controls the forward procedure, and parameter

*M*

_{0}controls the backward procedure.

*M*-term model of the form given by Equation 19 is constructed. First, a trial direction

**a**

_{1}is used to compute the values

*z*

_{ i1}= 〈

**a**

_{1},

**x**

_{ i}〉,

*i*= 1,…,

*n,*where

**x**

_{ i}= (

*x*

_{ i1},…,

*x*

_{ iN})′ is the

*i*th input and

*n*is the number of inputs. Then, denoting the response of the system to input

**x**

_{ i}by

*y*

_{ i}and

_{ i}

^{1}=

*y*

_{ i}−

*z*

_{ i1}) is constructed. This scatter plot is smoothed to obtain a first estimate

*φ*

_{1}in Equation 19. Next,

**a**

_{1}is varied to minimize the sum of squares

*φ*

_{1}(

*z*

_{ i1}))

^{2}, where for each

**a**

_{1}in the optimization procedure, a new

*φ*

_{1}is computed using the smoothing procedure. The result of this estimation, the pair (

**a**

_{1},

*φ*

_{1}) is then standardized according to Equation 19, and the corresponding value

*β*

_{1}is calculated. We now have the approximation

*y*

_{ i}=

*β*

_{1}

*φ*

_{1}(〈

**a**

_{1},

**x**

_{ i}〉),

*i*= 1,…,

*n*. Next, we treat

*y*

_{ i}−

*β*

_{1}

*φ*

_{1}(

*z*

_{ i1}) as the response. We then fit a second term

*β*

_{2}

*φ*

_{2}(

*z*

_{ i2}) to this modified response in the same manner that we fitted

*β*

_{1}

*φ*

_{1}(

*z*

_{ i1}) to

*y*

_{ i}=

*β*

_{1}

*φ*

_{1}(〈

**a**

_{1},

**x**

_{ i}〉) +

*β*

_{2}

*φ*

_{2}(〈

**a**

_{2},

**x**

_{ i}〉),

*i*= 1,…,

*n*. Continuing in this fashion, we arrive at the forward stepwise estimated model

*m*=

*M*− 1,

*M*− 2,…,

*M*

_{0}are fit in a backward stepwise manner. For each term in the model, the sum of squared residuals

*β*

_{ l}

*,*

**a**

_{ l}

*,*and

*φ*

_{ l}. The initial values for these parameters are the solutions for the

*m*most important values of the

*m*+ 1 terms in the previous order

*m*+ 1 model, where importance is measured by |

*β*

_{ l}|.