Determining the features of natural stimuli that are most useful for specific natural tasks is critical for understanding perceptual systems. A new approach is described that involves finding the optimal encoder for the natural task of interest, given a relatively small population of noisy “neurons” between the encoder and decoder. The optimal encoder, which necessarily specifies the most useful features, is found by maximizing accuracy in the natural task, where the decoder is the Bayesian ideal observer operating on the population responses. The approach is illustrated for a patch identification task, where the goal is to identify patches of natural image, and for a foreground identification task, where the goal is to identify which side of a natural surface boundary belongs to the foreground object. The optimal features (receptive fields) are intuitive and perform well in the two tasks. The approach also provides insight into general principles of neural encoding and decoding.

**ω**that can take one of a discrete number of possible values, indexed by the integer variable

*k,*and we represent the received stimulus by a vector

**s**(e.g., a patch of image or a sample of sound) that also can take one of a discrete number of possible values, indexed by the pair of integer variables (

*k, l*), where

*l*is the specific exemplar from category

*k*. Thus, the natural scene statistics for the identification task can be represented by a joint probability distribution,

*p*

_{0}(

*k, l*), and any given randomly sampled stimulus in the task can be regarded as a random sample (

*K, L*) from this distribution. It is the task-relevant structure of this unknown joint probability distribution that we wish to characterize.

*p*

_{0}(

*k, l*), we suppose stimuli sampled from this distribution are encoded in the responses of a population of

*q*neurons. The responses to a stimulus

**s**(

*k, l*) can be represented by a random vector,

**R**

_{ q}(

*k, l*) = [

*R*

_{1}(

*k, l*),…,

*R*

_{ q}(

*k, l*)], and the observer's guess of the category (i.e., the observer's response) based on these random responses can be represented by

**R**

_{ q}(

*k, l*)]. The mean response functions,

**r**

_{ q}(

*k, l*) = [

*r*

_{1}(

*k, l*),…,

*r*

_{ q}(

*k, l*)], describe the mapping between the stimulus and the mean response of each neuron in the population and can be regarded as the encoding functions. For example, each encoding function might be defined by a unique receptive field or tuning function. The aim of AMA is to find encoding functions that maximize identification performance in a specific categorization task.

**N**

_{ q}= [

*N*

_{1},…,

*N*

_{q}] that may be correlated across neurons (Gawne & Richmond, 1993; Zohary, Shadlen, & Newsome, 1994) and may depend on the mean response of the neuron (e.g., in cortical neurons the variance of the response is proportional to the mean response, Geisler & Albrecht, 1997; Tolhurst, Movshon, & Dean,1983).

*r*

_{1}(

*k, l*),…,

*r*

_{ q}(

*k, l*)], that maximize accuracy in the identification task. To do this, we must consider the ideal observer whose input is the neural population response. The decision rule of the ideal observer is the optimal decoder. If the goal is to maximize accuracy, then the ideal decision rule is to pick the category that is most likely given the observed neural population response (i.e., the category with the greatest posterior probability):

**s**(

*k, l*). The posterior probability distribution computed by the ideal observer from the population response to this stimulus is

*p*(

*i*∣

**R**

_{ q}(

*k, l*)). This posterior probability distribution varies randomly because of the randomness of the neural population responses to the same stimulus, and thus the ideal observer will be accurate if this posterior probability distribution is on average as close as possible to a probability distribution

*f*(

*i*) that is 1 at the correct category

*k*and is 0 elsewhere. A principled measure of the difference between two probability distributions

*f*(

*x*) and

*g*(

*x*) is the

*relative entropy, D,*also known as the Kullback–Leibler divergence (Cover & Thomas, 2006):

*D*

_{q}(

*k, l*) decreases toward zero monotonically as the posterior probability at the correct category approaches 1.0. It follows intuitively that the overall accuracy of the ideal observer will be maximized when the average difference

_{q}over all possible stimuli is minimized, where

*K*

_{ i},

*L*

_{ i}) of natural stimuli, the optimal encoding functions can be obtained by minimizing the sample mean:

**x**(

*k, l*) is the un-normalized image patch,

*k, l*) is the mean gray level of the patch,

*sd*(

*k, l*) is standard deviation gray level of the patch, and

*n*

_{ pixels}is the number of pixels in the patch.

*r*

_{max}, and (c) the neural noise is independent Gaussian with variance proportional to the mean response. The first two constraints are implemented by the following equation for the mean response of the

*t*

^{ th}neuron:

**w**

_{ t}is a vector of weights (normalized to a length of 1.0) that defines the 12 × 12 receptive field, and

**s**(

*k, l*) ·

**w**

_{ t}is the dot product of the stimulus with the receptive field. Because the stimulus and receptive field (RF) are both normalized to a vector length of 1.0, their maximum dot product is 1.0 (when the receptive field matches the stimulus) and hence the maximum possible response is

*r*

_{max}. The third constraint is implemented by requiring that the probability of response

*r*from the

*t*

^{ th}neuron in the population is given by:

*α*is the Fano factor and

*σ*

_{0}

^{2}is the small baseline variability.

^{1}The dynamic range and noise parameters of the neurons were set to the mean values for neurons in monkey V1 reported in Geisler and Albrecht (1997), for 200 ms (fixation-like) stimulus presentations:

*r*

_{max}= 5.7 spks,

*α*= 1.36,

*σ*

_{0}

^{2}= 0.23.

*p*(

*k*∣

**r**

_{ q}(

*k, l*)) into a recursive formula using Bayes' rule (see 1), we have:

*n*

_{ k}is the number of training samples from category

*k, p*(

*k*∣

**r**

_{0}(

*k, l*)) =

*n*

_{ k}/

*n, n*is the total number of training stimuli, and Z is a normalization factor. In keeping with the approximation in Equation 4, the logarithm of this formula gives the average relative entropy when the stimulus is

**s**(

*k, l*). Thus, Equations 5– 10 provide a closed-form expression for the average relative entropy of the posterior probability distribution (that the ideal observer computes) for arbitrary samples from the joint probability distribution of environmental categories and associated stimuli,

*p*

_{0}(

*k, l*).

*r*

_{1}(

*k, l*) that minimizes

_{1}(see Equation 5); then we substitute the resulting expected posterior probability distribution

*p*(

*i*∣

**r**

_{1}(

*k, l*)) into Equation 10 and find the encoding function

*r*

_{2}(

*k, l*) that minimizes

_{2}; then we substitute the resulting expected posterior probability distribution

*p*(

*i*∣

**r**

_{2}(

*k, l*)) into Equation 10, and so on. A consequence of this procedure is that the neural encoding functions tend to be rank ordered in how much they reduce the relative entropy, with the first function producing the largest decrease. (It is possible that a simultaneous rather than greedy procedure could lead to better performance, but we have not yet explored this more computationally intense approach.)

Spatial Correlations of Receptive Fields | |||||
---|---|---|---|---|---|

RF2 | RF3 | RF4 | RF5 | RF6 | |

−0.04 | −0.10 | 0.17 | −0.09 | 0.08 | RF1 |

0.10 | 0.03 | −0.04 | 0.00 | RF2 | |

0.08 | −0.01 | −0.02 | RF3 | ||

0.01 | 0.04 | RF4 | |||

−0.04 | RF5 |

_{2}200). The value of the slope depends in part upon the dynamic range and reliability of the neurons; the larger dynamic range and the lower the neural noise level, the steeper the slope.

Spatial Correlations of Receptive Fields | |||||
---|---|---|---|---|---|

RF2 | RF3 | RF4 | RF5 | RF6 | |

−0.10 | −0.60 | −0.04 | −0.11 | 0.18 | RF1 |

−0.15 | −0.02 | 0.96 | −0.08 | RF2 | |

0.10 | −0.21 | −0.08 | RF3 | ||

−0.04 | −0.60 | RF4 | |||

−0.07 | RF5 |

*n*principle components) might provide a near optimal encoding given a limited number of feature dimensions. Indeed the similarity of the PCA and AMA receptive fields provides evidence that PCA, which is computationally much simpler and faster than AMA, produces receptive fields that are very effective for uniquely identifying image patches. This result also suggests that AMA is finding approximately the global optimum for this task.

*γ*(

*i, j*) that gives the utility of picking category

*i*when the correct category is

*j*. This utility function is specified in the definition of the identification task. In the standard Bayesian approach, a rational observer is defined to be one that picks the category with the maximum expected utility (i.e., minimum risk):

*γ*(

*i, j*), affect the optimal encoding functions? We have not yet systematically explored this question, but it is possible that the optimal encoding functions are relatively insensitive to modest variations of the utility function. For example, the primary effect of variations of the utility function in many identification tasks is to change decision boundaries and such changes presumably do not change the stimulus dimensions (features) that are optimal for performance of the task, but this remains to be explored.

**R**

_{ q}(

*k, l*) to a presentation of stimulus

**s**(

*k, l*). (Keep in mind that the ideal observer does not know that the stimulus is

**s**(

*k, l*), but does know the mean response of each neuron in the population to each stimulus in the training set.) According to Bayes' rule:

^{1}Note that making the variance proportional to the absolute value of the mean response allows negative responses. We could easily have half-wave rectified the responses to be more consistent with real neurons, but allowing negative responses reduces the number of optimal encoding functions that need to be estimated.