Probing the visual system with the ensemble of signals that occur in the natural environment may reveal aspects of processing that are not evident in the neural responses to artificial stimulus sets, such as conventional bars and sinusoidal gratings. However, unsolved is the question of how to use complex natural stimulation, many aspects of which the experimenter cannot completely specify, to study neural processing. Here a method is presented to investigate the structure of a neuron’s receptive field based on its response to movie clips and other stimulus ensembles. As a particular case, the technique provides an estimate of the conventional first-order receptive field of a neuron, similar to what can be obtained with other reverse-correlation schemes. This is demonstrated experimentally and with computer simulations. Our analysis also revealed that the receptive fields of both simple and complex cells had regions where image boundaries, independent of their contrast sign, would enhance or suppress the cell’s response. In some cases, these signals were tuned for the orientation of the boundary. This demonstrates for the first time that it might be feasible to investigate the receptive field structure of visual neurons from their responses to natural image sequences.

*Macaca fascicularis*) in compliance with National Institutes of Health guidelines as described elsewhere (Ringach et al., 1997a). Natural image sequences were generated by digitally sampling commercially available videotapes in VHS/NTSC format. A Silicon Graphics R10000 Solid Impact was used to sample frames at a spatial resolution of 320 × 240 pixels (6 deg × 4.5 deg of visual angle) and at a temporal rate of 15 Hz. The selected movies included both man-made and natural landscape scenes. Six segments of 30-s duration were sampled from eight different movies, making a total of 24 minutes of video. The movies were compressed using Silicon Graphics’ MVC2 compression scheme (proprietary) and stored on a disk. The compressed data fitted in 480 megabytes of memory. A Silicon Graphics O2 R5000 computer played back the images during the experiment on a computer screen that measured 34.3 cm wide by 27.4 cm high. The refresh rate of the monitor was 100 Hz and each movie image was presented for six consecutive frames. Thus, the effective playback rate was 16.6 Hz—slightly faster than the sampling rate. The mean luminance of the display was 56 cd/m

^{2}. Stimulation was monocular to the dominant eye (the other eye was occluded). Movie clips were effective in evoking responses from V1 cells; the mean spike rate to natural image stimulation was ≈10 spikes/s. We think these movie clips have statistics similar to those used in other studies of natural image sequences, such as those employed by van Hateren and van der Schaaf, (1998) who sampled videos from Dutch, British and German TV broadcasts.

**I**(

*x,y,t*) denote the value of a pixel at location (

*x,y*) and time

*t*. This is normally a three-dimensional vector representing the values of the red, green and blue components of the pixel. For the response of the cell we consider the total number of spikes occurring within a time window

*Δt*centered at time

*t*. This value is denoted by

*r*(

*t*).

*r*(

*t*), depends on the recent history of the visual stimulus

*s*() = {

**I**(

*x,y,t′*)‖

*t*−

*T*<

*t′*<

*t*{ where

*T*is the width of the analysis window. This relationship is fully characterized by the joint probability of the stimulus and the response

*P*(

*s,r*) (Rieke, Warland, van Steveninck, & Bialek, 1997). Due to the high dimensionality of the stimulus space, however, estimating this probability distribution is not possible in the given experimental time. Instead, methods that make specific assumptions about the relationship between stimulus and response are required. Here we consider a general class of models described by where is the mean response rate,

**Φ**(

*x,y,t*) represents a feature map sequence, and

*w*(

*x,y.t*) are weights representing a spatio-temporal kernel of the receptive field. The feature map is a function (linear or nonlinear) of the input image sequence,

**I**(

*x,y,t*). Therefore, the cell’s modulation away from its mean response is modeled as a linear spatio-temporal filter acting on the feature map sequence. The choice of

**Φ**is limited only by our intuition about what “features” of the image sequence the cell at hand may be representing.

*L*(

*x,y,t*) = w

^{T}I(

*x,y,t*) is the luminance of the pixel at location (

*x,y*) at time

*t*, and is the mean luminance of the frame at time

*t*. The luminance of a pixel is obtained by weighting the values of the red, green, and blue guns appropriately (which is achieved by multiplying with a vector w obtained from the calibration of the display). As an example of this calculation, we present original color frames from the stimulus in Figure 1a, and their associated luminance contrast map Figure 1b. With this definition of the feature map, the modulation of the cell’s response is modeled as a linear function of the luminance contrast values within its receptive field—a commonly used model for simple cells.

**Φ**(

*x,y,z*) is selected, we want to find the optimal spatio-temporal weighting function, or kernel,

*w*(

*x,y,t*) that predicts the response of the cell in the least squares sense according to the model in Equation (1). When the input is white noise stimuli, one computes this kernel by calculating the mean input before a spike (Lee & Schetzen, 1965; deBoer & Kuyper, 1968). This computation does not apply to natural images because there are strong spatio-temporal correlations in the input sequence and resulting feature maps. The autocorrelation of the input must be taken into account. To do this, we used a standard technique, recursive least-squares (RLS), to calculate the optimal kernel. The input to the algorithm is the feature map sequence and the response of the cell. The output is the optimal kernel that transforms the feature map into the response. This is done via a recursive procedure that refines our guess of the kernel as more and more data are added to the calculation. At the first step of the calculation, the weights (kernel values) are all set to zero. At the

*n*th step of the calculation, the

*old*estimate of the kernel, at step

*n*− 1, is used to predict the response of the cell. The error between the predicted and true response is used to make a correction to the weighting function and generate a new estimate. The correction in the RLS algorithm is computationally complex but basically it is the present input image filtered so as to correct for image correlation, and is weighted by the magnitude of the error. It can be shown that the expected value of the RLS algorithm’s estimate is equal to the true value of the kernel (Haykin, 1991).

*y*(

*n*) (Figure 2). Next, in an attempt to make the simulation realistic,

*y*(

*n*) was perturbed by a large amount of additive Gaussian noise,

*z*(

*n*). The standard deviation of

*y*(

*n*) and

*z*(

*n*) were equal, i.e., the signal-to-noise ratio was one. Finally, the resulting signal

*w*(

*n*) =

*y*(

*n*) +

*z*(

*n*) was passed through a hard rectifier (Figure 2, right). The threshold was set at a value that caused the model cell to “fire” (i.e., generate a nonzero output) only 12% of the time. This is equivalent to a mean response rate of ≈ 2 spikes/s. The output variance was 2.1 (spikes/s).

^{2}These numbers are close to the median values for our data: median response 2.4 spikes/sec and variance (2.3 spikes/s).

^{2}The movies used in the simulation, and the length of the data record, were identical to those in the actual experiment. The simulated receptive field had two symmetric subfields, one excitatory (indicated in red) and one inhibitory (indicated in blue), and was defined on a square grid of 17 × 17 pixels representing 0.65 deg × 0.65 deg of visual angle. These parameters were selected to test the proposed method under stringent conditions: the algorithm had to estimate 289 parameters from very noisy thresholded data in the presence of highly correlated input signals (the condition number [Golub & van Loan, 1989]) of the luminance covariance matrix was ≈ 3 × 10

^{3}. The resulting estimate of the receptive field is very good (Figure 2, lower receptive field): the correlation coefficient between the true and estimated weights equals 0.88. Thus, the algorithm can perform very well even in the presence of strong output nonlinearity and large additive noise levels.

*t*. A fixed delay of τ = 60 ms, which corresponds to the median time-to-peak in our V1 population (Ringach et al., 1997a), was used for all cells. Representative results are shown in Figure 3. Each panel in the figure corresponds to a different cell and depicts the luminance contrast kernel on the left, and edge kernel on the right. Regions in red correspond to positive values of the kernel; those in blue represent negative values. For comparison, the optimal stimulus orientation obtained with drifting sinusoidal gratings is shown by the orientation of the bar on top of the kernels for each cell.

*p*< 7 × 10

^{−5}.

**Φ**

_{1}and

**Φ**

_{2}, a next step would be to build a compound model by defining a new feature map that represents the concatenation of

**Φ**

_{1}and

**Φ**

_{2},

**Φ**= [

**Φ**

_{1}

**Φ**

_{2}, and run the same algorithm which will compute new kernels with respect to these two maps taking into account any possible cross-correlations. If the feature maps are approximately orthogonal (i.e., uncorrelated), the resulting kernels are clearly the same as those obtained by doing a regression on each feature map individually. This is the case in this study, as the maximal cross-covariance between the luminance-contrast and “edge” map was very small, 0.04, meaning that the maps are nearly orthogonal. As a consequence, estimating the maps separately is justified in our case. We note that the responses of cells to image attributes other than luminance contrast is consistent with previous data showing that cortical cells may respond to image boundaries that are not defined by luminance cues alone, such as illusory contours (Grosof, Hawken, & Shapley, 1993) and second order motion (Mareschal & Baker, 1999). Thus, we do not believe this phenomenon arises only when using natural image stimulation.

*t*+ τ from the feature map at time

*t*. The response at time

*t*+ τ was defined as the total number of spikes in the segment. We picked a window width of Δ

*t*= 60 ms and used a fixed delay τ =60 ms (this is the average delay in our population of V1 cells (Ringach et al., 1997a). The central portion of the feature map was subsampled on a square grid of 17 × 17. The visual area represented by this grid was varied from cell to cell to make sure it covered their receptive fields. These data were arranged in a column data vector,

**u**(

*t*), having 289 entries.

**I**(

*i,j,n*) represents the image at location (

*i,j*) for the

*n*-th stimulus frame in the movie sequence and similarly for the other variables. The variable

**w**(

*n*) is an

*N*× 1 vector representing the estimate of the weights at time step

*n*. When we begin the process we have no data, so we set the initial value of

**w**to zero (line 1).

*N*is the total number of parameters to be estimated. In our case we have

*N*= 289 parameters.

**P**(

*n*) is an

*N*×

*N*matrix representing a recursive estimate of the inverse of the correlation matrix, and δ = 0.00001 is a small number. Two modifications were done to the standard RLS algorithm. First, we added a recursive estimate of the mean response of the cell µ that is subtracted from the response

*r*(

*n*) at each step (lines 8 and 9). This is done to factor out slow trends in the excitability of the neuron, as we are only interested in explaining departures in the response of the cell from its mean. The forgetting factor β = 0.99 corresponds to a time constant of ≈ 6 s. A second modification is the spatial smoothing of the estimated coefficients in lines 18 and 19. The standard RLS algorithm does not include any knowledge about the spatial relationship between the different coordinates. Here, we chose to smooth the estimates with a 3 × 3 pixel Gaussian kernel every

*Q*= 450 frames (equivalent to 30 s of video). The smoothing kernel had a time-varying width given by σ(

*n*) = 0.4 + 5

*Q/n*pixels. Our simulations indicated that adding this sort of “annealing” smoothing step increases the convergence rate of the algorithm.

**w**is expected to converge on the mean. In other words, the estimation of the receptive field is unbiased. Second, the variance of the prediction error converges to the variance of the noise in the system, i.e., under the assumption that the response of the neuron is contaminated by independent noise, the variance of the response prediction and the true response are equal. Thus, we are guaranteed that the model in Equation (1) will match both the mean and variance of the neural response.

**w**will never decrease beyond a lower bound set by the noise in the system. Thus, we expect Δ

**w**(

*n*) to decrease and asymptote at some finite value. At this point we considered the algorithm to have converged on the mean. After this time the values of

**w**(

*n*) may be averaged to yield more accurate estimates. In the population of cells studied the algorithm converged, on average, after 15 minutes of video. Finally, in those cases where the calculation of the feature map required an estimate of the gradient, Sobel operators were used (Pratt, 1991

*Nature Neuroscience*, 4, 409–416. [PubMed]