**Understanding how nervous systems exploit task-relevant properties of sensory stimuli to perform natural tasks is fundamental to the study of perceptual systems. However, there are few formal methods for determining which stimulus properties are most useful for a given natural task. As a consequence, it is difficult to develop principled models for how to compute task-relevant latent variables from natural signals, and it is difficult to evaluate descriptive models fit to neural response. Accuracy maximization analysis (AMA) is a recently developed Bayesian method for finding the optimal task-specific filters (receptive fields). Here, we introduce AMA–Gauss, a new faster form of AMA that incorporates the assumption that the class-conditional filter responses are Gaussian distributed. Then, we use AMA–Gauss to show that its assumptions are justified for two fundamental visual tasks: retinal speed estimation and binocular disparity estimation. Next, we show that AMA–Gauss has striking formal similarities to popular quadratic models of neural response: the energy model and the generalized quadratic model (GQM). Together, these developments deepen our understanding of why the energy model of neural response have proven useful, improve our ability to evaluate results from subunit model fits to neural data, and should help accelerate psychophysics and neuroscience research with natural stimuli.**

**f**|| = 1.0). The expression for the cost requires the specification of five factors (see Figure 3A). These factors are (1) a well-defined task (i.e., a latent variable to estimate from high-dimensional stimuli), (2) a labeled training set of stimuli, (3) a set of encoding filters, (4) a response noise model, (5) and a cost function (Figure 3A). The training set specifies the joint probability distribution

*X*and the stimuli

**s**(Figure 3B) and implicitly defines the prior

**s**

*from the labeled training set, (2) obtain a set of noisy filter responses*

_{kl}**R**(

*k, l*) from a particular (possibly nonoptimal) set of filters, (3) use the optimal non-linear decoder g(.) to obtain the optimal estimate

**f**

*are those that minimize the cost (Figure 3B).*

^{opt}*X*given the noisy filter responses

_{u}**R**(

*k, l*) to stimulus

**s**

*is given by Bayes' rule*

_{kl}*N*is the number of latent variable level, and

_{lvl}*l*indexes the stimuli having latent variable value

*X*. The conditional distribution of noisy responses given the latent variable is

_{k}*N*is the number of stimuli having latent variable level

_{u}*X*, and

_{u}*v*indexes training stimuli having that latent variable value. Conveniently,

*v*with latent variable

*X*given that there are

_{u}*N*such stimuli, and

_{u}*X*. Therefore, Equation 1 reduces to

_{u}**f**

*from set of filters*

_{t}*q*is the number of filters), the mean response

*r*, noisy response

_{t}*R*, and noise variance

_{t}**s**

*having latent variable value*

_{uv}*X*are

_{u}*η*is a noise sample,

*α*is the Fano factor, and

**x**is an (possibly noise corrupted) intensity stimulus. If

*q*filters are considered simultaneously, the response distributions

*q*-dimensional: mean response vector

*γ*(.) is an arbitrary cost function and

**R**(

*k, l*). The overall cost for a set of filters is the expected cost for each stimulus averaged over all stimuli

**f**that minimize the overall cost

**f**

*are the optimal filters.*

^{opt}*N*is the total number of stimuli and

*N*is the number of latent variable levels in the training set. As noted earlier, this compute time makes AMA impractical for large scale problems without specialized computing resources.

_{lvl}*L*

_{2}and

*L*

_{0}cost functions, and their gradients. We believe this is a valuable step toward making AMA a more practical tool in vision research.

**R**are responses to stimuli having latent variable level

*X*,

_{u}*X*is

_{u}*N*is the number of latent variable levels. The AMA–Gauss posterior (Equation 14), has a simpler form than the AMA posterior (Equation 3). Hence, whereas a single evaluation of the AMA posterior probability distribution requires

_{lvl}*N*is the number of stimuli in the training set (see Results). This reduction in compute-time substantially improves the practicality of AMA when the Gaussian assumption is justified. Even if the Gaussian assumption is not justified, AMA–Gauss is guaranteed to make the best possible use of first- and second-order conditional response statistics, and could thus provide a decent initialization at low computational cost.

*L*

_{2}cost,

*L*

_{0}cost, and their gradients.

*X*across all stimuli in the training set. Stimuli having latent variable value

_{k}*X*are indexed by

_{k}*l*, and the

*i*stimulus in the training set is denoted (

^{th}*k*,

_{i}*l*). The likelihood function of the AMA–Gauss filters is

_{i}*L*

_{2}and

*L*

_{0}cost. The cost function specifies the penalty assigned to different types of error. For the

*L*

_{2}(i.e. squared error) cost function, the expected cost for each stimulus

**s**

*(Equation 9) is*

_{kl}*L*

_{0}(i.e., 0,1) cost function, the expected cost across all stimuli is closely related to the KL-divergence of the observed posterior and an idealized posterior with all its mass at the correct latent variable

*X*; in both cases, cost is determined only by the posterior probability mass at the correct level of the latent variable (Burge & Jaini, 2017; Geisler et al., 2009). Here, the expected KL-divergence per stimulus is equal to the negative log-posterior probability at the correct level (Geisler et al., 2009)

_{k}*L*

_{0}or KL-divergence cost.

*L*

_{2}and

*L*

_{0}cost functions are given in Appendix B and Appendix C, respectively.

*L*

_{0}cost,

*L*

_{2}cost) exert pressure on the filters to produce class-conditional response distributions that are as different as possible given the constraints imposed by the stimuli. Hence the optimal filters will (1) maximize the differences between the class-conditional means or covariances and (2) minimize the generalized variance for each class-conditional response distribution. (Generalized variance is a measure of overall scatter, represents the squared volume of the ellipse, and is given by the determinant of the covariance matrix.)

*R*, the likelihood of a particular level of the latent variable

*X*is found by evaluating its response distribution at the observed response (blue dot; Figure 4A, D). The posterior probability of latent variable

_{u}*X*is obtained by normalizing with the sum of the likelihoods (blue, red, and green dots; Figure 4B, E). With two filters, the response distributions are two-dimensional (red, blue, and green ellipses with corresponding marginals; Figure 4C, F). The second filter will increase the posterior probability mass at the correct value of the latent variable (not shown) because the second filter selects for useful stimulus features that the first filter does not. These hypothetical cases illustrate why cost is minimized when mean or covariance differences are maximized between classes and generalized variance is minimized within classes. The filters that make the response distributions as different as possible make it as easy as possible to decode the latent variable.

_{u}^{1}to find the receptive fields that are optimal for estimating speed and disparity from local patches of natural images. Second, we compare AMA–Gauss and AMA and show that both methods (1) learn the same filters and (2) converge to the same cost for both tasks. Third, we verify that AMA–Gauss achieves the expected reductions in compute-time: filter-learning with AMA–Gauss is linear whereas AMA is quadratic in the number of stimuli in the training set. Fourth, we show that the class-conditional filter responses are approximately Gaussian, thereby justifying the Gaussian assumption for these tasks. Fifth, we show how contrast normalization contributes to the Gaussianity of the class-conditional responses. Sixth, we explain how the filter response distributions determine the likelihood functions and optimal pooling rules. Seventh, we explain how these results provide a normative explanation for why energy-model-like computations describe the response properties of neurons involved in these tasks. Eighth, and last, we establish the formal relationship between AMA–Gauss and the GQM, a recently developed method for neural systems identification.

*L*

_{0}cost function and constant, additive, independent filter response noise. In general, we have found that the optimal filters are quite robust to the choice of cost function when trained with natural stimuli (Burge & Jaini, 2017). Figure 5 shows results for the retinal speed estimation task. Figure 6 shows results for the disparity estimation task. AMA and AMA–Gauss learn nearly identical encoding filters (Figure 5A and 6A;

*ρ*> 0.96) and exhibit nearly identical estimation costs (Figure 5B and 6B); note, however, that these filter and performance similarities are not guaranteed (see Appendix D). AMA–Gauss also dramatically reduces compute time (Figures 5C, D and 6C, D). With AMA, the time required to learn filters increases quadratically with their number of stimuli in the training set. With AMA–Gauss, filter learning time increases linearly with the number of stimuli. Finally, the class-conditional filter responses are approximately Gaussian (Figures 5E, F and 6E, F), indicating that the Gaussian assumption is justified for both tasks. Quadratic computations are therefore required to determine the likelihood of a particular value of the latent variable. The posterior probability distribution over the latent variable

**s**is a contrast normalized (

**x**with mean intensity

*c*

_{50}is an additive constant and

*n*is the dimensionality of (e.g., number of pixels defining) each stimulus. Here, we assumed that the value of the additive constant is

*c*

_{50}= 0.0. The effect of the value of

*c*

_{50}has been studied previously (Burge & Geisler, 2014).

**R**to an arbitrary stimulus. When the class-conditional response distributions are Gaussian, as they are here, the log-likelihood of latent variable value

*X*is quadratic in the encoding filter responses

_{u}*q*is the number of filters and where the weights are functions of the class-conditional mean and covariance for each value

*X*of the latent variable (Burge & Geisler, 2014, 2015). Specifically,

_{u}*diag*(.) is a function that returns the diagonal of a matrix and

**1**is a column vector of ones. [Note that in these equations

*i*and

*j*index different filters (see Figure 2), not different latent variables and stimuli, as they do elsewhere in this manuscript.] These equations (Equations 20–24) indicate that the log-likelihood of latent variable value

*X*is obtained by pooling the squared (and linear) responses of each receptive field with weights determined by the mean

_{u}*and covariance Σ*

**μ**_{u}*) of the subunit responses to stimuli with latent variable*

_{u}*X*.

_{u}**w**

_{34}(

*X*) on the sum-squared response of filter 3 and filter 4 peak at 0°/s (see Figure 9A). This peak results from the fact that the filter 3 and filter 4 response covariance is highest at 0°/s (see Figure 5F; Equation 23). In contrast, very little information is carried by the class-conditional means; the mean filter responses to natural stimuli are always approximately zero. Hence, the weights on the linear subunit responses are approximately zero (see Equation 21, Figure 4A–C).

*X*of the latent variable elicited the observed filter responses

_{u}**R**. This latent variable value

*X*would be the preferred stimulus of the likelihood neuron. We refer to this hypothetical neuron as an AMA–Gauss likelihood neuron (see Equation 20).

_{u}*X*

_{k}*y*is the neural response,

*P*(.) is the noise model,

*f*(.) is a nonlinearity, and

*X*is given by

_{u}*and Σ*

**μ**_{u}*are the class-conditional response mean and covariance and*

_{u}*ζ*is a constant. The noisy filter response vector

_{u}**R**is given by the projection of the stimulus onto the filters

**f**plus noise (Equations 4, 5). Hence, Equation 27 can be rewritten as

*q*matrix where

*q*is the number of filters,

**f**and their responses to natural stimuli, conditional on latent variable

*X*. Given a hypothesis about the functional purpose of a neuron's activity, AMA–Gauss could predict the parameters that the GQM would recover via response-triggered analyses.

_{u}*Journal of the Optical Society of America A*, 2 (2), 284–299.

*Visual Neuroscience*, 7 (6), 531–546.

*Vision Research*, 37 (23), 3327–3338.

*Proceedings of the National Academy of Sciences, USA*, 108 (40), 16849–16854.

*Proceedings of the IS&T/SPIE 47th annual meeting*. Proceedings of SPIE.

*Journal of Vision*, 14 (2): 1, 1–18, doi:10.1167/14.2.1. [PubMed] [Article]

*Nature Communications*, 6, 7900.

*PLoS Computational Biology*, 13 (2), e1005281.

*Journal of Vision*, 16 (13): 2, 1–25, doi:10.1167/16.13.2. [PubMed] [Article]

*Neural Computation*, 24 (4), 827–866.

*Nature Reviews Neuroscience*, 13 (1), 51–62.

*PLoS Computational Biology*, 8 (3), e1002405.

*Statistica Sinica*, 20 (1), 235–238.

*Journal of the American Statistical Association*, 104 (485), 197–208.

*Annual Review of Neuroscience*, 24 (1), 203–238.

*Trends in Cognitive Sciences*, 4 (3), 80–90.

*Neural Computation*, 24 (9), 2384–2421.

*Vision Research*, 32 (2), 203–218.

*Nature*, 415 (6870), 429–433.

*Journal of the Optical Society of America*.

*A, Optics and Image Science*, 4 (12), 2379–2394.

*Neural computation*, 26 (10), 2103–2134.

*Neural Computation*, 21 (1), 239–271.

*Visual Neuroscience*, 14 (5), 897–919.

*Journal of Vision*, 9 (13): 17, 1–16, doi:10.1167/9.13.17. [PubMed] [Article]

*Nature Neuroscience*, 14 (7), 926–932.

*Visual Neuroscience*, 9 (02), 181–197.

*Journal of Educational Psychol*ogy, 24 (6), 417–441.

*Journal of Neurophysiology*, 58 (6), 1233–1258.

*Journal of Neurophysiology*, 58 (6), 1187–1211.

*Neural Computation*, 25 (7), 1870–1890.

*Nature Neuroscience*, 5 (4), 356–363.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 31 (4), 693–706.

*Nature Neuroscience*, 9 (11), 1432–1438.

*PLoS Computational Biology*, 9 (7), e1003143.

*Journal of Vision*, 14 (4): 10, 1–15, doi:10.1167/14.4.10. [PubMed] [Article]

*The Journal of Neuroscience*, 25 (43), 10049–10060.

*Current Opinion in Neurobiology*, 8 (4), 509–515.

*Science*, 249 (4972), 1037–1041.

*Journal of Neurophysiology*, 77 (6), 2879–2909.

*Nature*, 381 (6583), 607–609.

*Vision Research*, 37 (23), 3311–3325.

*Neural Computation*, 29, 2291–2319.

*Advances in neural information processing systems*(pp. 2454–2462). Red Hook, NY: Curran Associates.

*Vision Research*, 50 (2), 181–192.

*Technical University of Denmark*, 7, 15.

*Physical Review Letters*, 73 (6), 814–817.

*Neuron*, 46 (6), 945–956.

*Journal of Vision*, 6 (4): 13, 484–507, doi:10.1167/6.4.13. [PubMed] [Article]

*Proceedings of the National Academy of Sciences, USA*, 90 (22), 10749–10753.

*Annual Review of Neuroscience*, 24 (1), 1193–1216.

*Journal of Neuroscience*, 31 (22), 8295–8305.

*Journal of the Royal Statistical Society: Series B*(

*Statistical Methodology)*, 61 (3), 611–622.

*The Journal of Neuroscience*, 35 (44), 14829–14841.

*IEEE Transactions on Image Processing*, 13 (4), 600–612.

*Nature Neuroscience*, 18 (10), 1509–1517.

*Journal of the Optical Society of America A*, 2 (7), 1087–1093.

*Advances in neural information processing systems*(pp. 793–801). Red Hook, NY: Curran Associates.

*N*stimuli, each stimulus is associated with some category

*k*and an associated stimulus from that category

*l*. Let us denote this pair (

*k*,

*l*) for the

*i*sample point with (

^{th}*k*). Then assuming that the response distribution conditioned on the classes is Gaussian, the likelihood function can be written as

_{i}, l_{i}*Tr*(

*a*) =

*a*,

*Tr*(

**AB**) =

*Tr*(

**BA**) and

*Tr*(

**A**) =

*Tr*(

**A**

*) yields*

^{T}*l*(

**f**) stated in Equation 30 can therefore be found by combining Equations 34–37.

_{2}

**A**

*=*

_{kl,u}**s**

*–*

_{kl}**s**

*. The gradient of the posterior probability can then be evaluated using the following relation with the gradient of the logarithm of the posterior probability*

_{u}_{0}/ KL-divergence cost function

**s**with three stimulus dimensions, from each of two categories

*X*

_{1}and

*X*

_{2}. (For comparison, in the main text the speed and disparity stimuli had 256 and 64 stimulus dimensions, respectively). The simulated stimuli are shown in Figure A1A–C. The first two dimensions of the simulated stimuli contain the information for discriminating the categories; the third stimulus dimension is useless. Specifically, the stimulus distributions are given by

*R*is a 45° rotation matrix that operates on the first two dimensions. Stimuli in both categories are therefore distributed as mixtures of Gaussians with identical first- and second-order stimulus statistics (i.e., same mean and covariance). Thus, all information for discriminating the categories exists in higher-order statistical moments of the first two stimulus dimensions. Hence, because AMA–Gauss is sensitive only to class-conditional mean and covariance differences, it will be blind to the stimulus differences that define the categories.

*X*using Equation 28 can be written as

_{u}