Open Access
Article  |   October 2017
Linking normative models of natural tasks to descriptive models of neural response
Author Affiliations
  • Priyank Jaini
    Cheriton School of Computer Science, Waterloo, Ontario, Canada
    Department of Psychology, University of Pennsylvania, Philadelphia, PA, USA
    http://cs.uwaterloo.ca/pjaini/home/
  • Johannes Burge
    Department of Psychology, University of Pennsylvania, Philadelphia, PA, USA
    Neuroscience Graduate Group, University of Pennsylvania, Philadelphia, PA, USA
    Bioengineering Graduate Group, University of Pennsylvania, Philadelphia, PA, USA
    jburge@sas.upenn.edu
    http://burgelab.psych.upenn.edu/
Journal of Vision October 2017, Vol.17, 16. doi:10.1167/17.12.16
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Priyank Jaini, Johannes Burge; Linking normative models of natural tasks to descriptive models of neural response. Journal of Vision 2017;17(12):16. doi: 10.1167/17.12.16.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Understanding how nervous systems exploit task-relevant properties of sensory stimuli to perform natural tasks is fundamental to the study of perceptual systems. However, there are few formal methods for determining which stimulus properties are most useful for a given natural task. As a consequence, it is difficult to develop principled models for how to compute task-relevant latent variables from natural signals, and it is difficult to evaluate descriptive models fit to neural response. Accuracy maximization analysis (AMA) is a recently developed Bayesian method for finding the optimal task-specific filters (receptive fields). Here, we introduce AMA–Gauss, a new faster form of AMA that incorporates the assumption that the class-conditional filter responses are Gaussian distributed. Then, we use AMA–Gauss to show that its assumptions are justified for two fundamental visual tasks: retinal speed estimation and binocular disparity estimation. Next, we show that AMA–Gauss has striking formal similarities to popular quadratic models of neural response: the energy model and the generalized quadratic model (GQM). Together, these developments deepen our understanding of why the energy model of neural response have proven useful, improve our ability to evaluate results from subunit model fits to neural data, and should help accelerate psychophysics and neuroscience research with natural stimuli.

Introduction
Perceptual systems capture and process sensory stimuli to obtain information about behaviorally relevant properties of the environment. Characterizing the features of sensory stimuli and the processing rules that nervous systems use is central to the study of perceptual systems. Most sensory stimuli are high-dimensional, but only a small set of stimulus features are relevant for any particular task. Thus, perceptual and neural processing in particular tasks is driven by sets of features that occupy a lower dimensional space (i.e., can be described more compactly) than the stimuli. These considerations have motivated perception and neuroscience researchers to develop methods for dimensionality reduction that characterize the statistical properties of proximal stimuli, that describe the responses of neurons to those stimuli, and that specify how those responses could be decoded (Bell & Sejnowski, 1997; Cook & Forzani, 2009; Cook, Forzani, & Yao, 2010; Hotelling, 1933; Lewicki, 2002; McFarland, Cui, & Butts, 2013; Olshausen & Field, 1996; Pagan, Simoncelli, & Rust, 2016; Park, Archer, Priebe, & Pillow, 2013; Ruderman & Bialek, 1994; Rust, Schwartz, Movshon, & Simoncelli, 2005; Schwartz, Pillow, Rust, & Simoncelli, 2006; Tipping & Bishop, 1999; Vintch, Movshon, & Simoncelli, 2015). However, many of these methods are task-independent; that is, they do not explicitly consider the sensory, perceptual, or behavioral tasks for which the encoded information will be used. Empirical studies in psychophysics and neuroscience often focus on the behavioral limits and neurophysiological underpinnings of performance in specific tasks. Thus, there is a partial disconnect between task-independent theories of encoding and common methodological practices in psychophysics, and sensory and systems neuroscience. 
Task-specific normative models prescribe how best to perform a particular task. Task-specific normative models are useful because they provide principled hypotheses about (1) the stimulus features that nervous systems should encode and (2) the processing rules that nervous systems should use to decode the encoded information. Many normative models in widespread use are not directed at specific tasks. Methods for fitting neural response cannot generally be interpreted with respect to specific tasks. Accuracy maximization analysis (AMA) is a Bayesian method for finding the stimulus features that are most useful for specific tasks (Burge & Jaini, 2017; Geisler, Najemnik, & Ing, 2009). In conjunction with carefully calibrated natural stimulus databases, AMA has contributed to the development of normative models of several fundamental tasks in early- and mid-level vision (Burge & Geisler, 2011; Burge & Geisler, 2012; Burge & Geisler, 2014; Burge & Geisler, 2015), by determining the encoding filters (receptive fields) that support optimal performance in each task. These task-specific normative models have, in turn, predicted major aspects of primate neurophysiology and human psychophysical performance with natural and artificial stimuli (Burge & Geisler, 2014, 2015). 
The primary theoretical contribution of this manuscript is to establish formal links between normative models of specific tasks and popular descriptive models of neural response (Figure 1). To do so, we first develop a new form of AMA called AMA–Gauss, which incorporates the assumption that the latent-variable-conditioned filter responses are Gaussian distributed. Then, we use AMA–Gauss to find the filters (receptive fields) and pooling rules that are optimal with natural stimuli for two fundamental tasks: estimating the speed of retinal image motion and estimating binocular disparity (Burge & Geisler, 2014, 2015). For these two tasks, we find that the critical assumption of AMA–Gauss is justified: the optimal filter responses to natural stimuli, conditioned on the latent variable (i.e., speed or disparity), are indeed Gaussian distributed. Then, we show that this empirical finding provides a normative explanation for why neurons that select for motion and disparity have been productively modeled with energy-model-like (i.e., quadratic) computations (Cumming & DeAngelis, 2001; DeAngelis, 2000; Ohzawa, 1998; Ohzawa, DeAngelis, & Freeman, 1990; Ohzawa, DeAngelis, & Freeman, 1997). Finally, we recognize and make explicit the formal similarities between AMA–Gauss and the generalized quadratic model (GQM) (Park et al., 2013; Wu, Park, & Pillow, 2015), a recently developed method for neural systems identification. These advances may help bridge the gap between empirical studies of psychophysical and neurophysiological tasks, methods for neural systems identification, and task-specific normative modeling (Figure 1). 
Figure 1
 
Linking scientific subfields. Perception science benefits when links are drawn between psychophysical and neuroscience studies of particular tasks, task-agnostic statistical procedures that fit models to data, and task-specific normative methods that determine which models are best. The current work develops formal links between the energy model for describing neural response, the generalized quadratic model (GQM) for fitting neural response, and AMA–Gauss for determining the neural response properties that best serve a particular task.
Figure 1
 
Linking scientific subfields. Perception science benefits when links are drawn between psychophysical and neuroscience studies of particular tasks, task-agnostic statistical procedures that fit models to data, and task-specific normative methods that determine which models are best. The current work develops formal links between the energy model for describing neural response, the generalized quadratic model (GQM) for fitting neural response, and AMA–Gauss for determining the neural response properties that best serve a particular task.
In addition to these theoretical contributions, the development of AMA-Gauss represents a technical advance. The major drawback of AMA is its computational expense. Its compute-time for filter learning is quadratic in the number of stimuli in the training set, rendering the method impractical for large-scale problems without specialized computing resources. We demonstrate, both analytically and empirically, that AMA-Gauss reduces compute-time for filter learning from quadratic to linear. Thus, for tasks for which the critical assumption of AMA-Gauss is justified, AMA-Gauss can be of great practical benefit. 
Background
Energy model
Energy models have been remarkably influential in visual neuroscience. The standard energy model posits two Gabor-shaped subunit receptive fields, the responses of which are squared and then summed (Figure 2A). These computations yield decreased sensitivity to the local position of stimulus features (i.e., spatial phase) and increased sensitivity to the task-relevant latent variable. Energy models have been widely used to describe the computations of neurons involved in coding retinal image motion and binocular disparity (Adelson & Bergen, 1985; Cumming & DeAngelis, 2001; DeAngelis, 2000). However, the motion–energy and disparity–energy computations are primarily descriptive models of a neuron's response properties. The energy model does not make explicit how neural responses should be decoded into estimates. 
Figure 2
 
Computations of an energy model neuron, a GQM model neuron, and an AMA–Gauss likelihood neuron. All three have quadratic computations at their core. The energy model and the GQM describe the computations that neurons perform. AMA–Gauss prescribes the computations that neurons should perform to optimize performance in a specific task. (A) The standard energy model assumes two Gabor-shaped orthogonal subunit filters (receptive fields) fgbr to account for a neuron's response. The response of an energy model neuron RE is obtained by adding the squared responses of the filters. (B) The GQM fits multiple arbitrarily-shaped orthogonal subunit receptive fields f that best account for a neuron's response. The response of a GQM model neuron RGQM is obtained by pooling the squared (and linear, not shown) responses of the subunit filters via a weighted sum, and passing the sum through an output nonlinearity. (C) AMA–Gauss finds the optimal subunit filters fopt and quadratic pooling rules for a specific task. Unlike standard forms of the energy model and the GQM, AMA–Gauss incorporates contrast normalization and finds subunit filters that are not necessarily orthogonal. The response \(\def\upalpha{\unicode[Times]{x3B1}}\)\(\def\upbeta{\unicode[Times]{x3B2}}\)\(\def\upgamma{\unicode[Times]{x3B3}}\)\(\def\updelta{\unicode[Times]{x3B4}}\)\(\def\upvarepsilon{\unicode[Times]{x3B5}}\)\(\def\upzeta{\unicode[Times]{x3B6}}\)\(\def\upeta{\unicode[Times]{x3B7}}\)\(\def\uptheta{\unicode[Times]{x3B8}}\)\(\def\upiota{\unicode[Times]{x3B9}}\)\(\def\upkappa{\unicode[Times]{x3BA}}\)\(\def\uplambda{\unicode[Times]{x3BB}}\)\(\def\upmu{\unicode[Times]{x3BC}}\)\(\def\upnu{\unicode[Times]{x3BD}}\)\(\def\upxi{\unicode[Times]{x3BE}}\)\(\def\upomicron{\unicode[Times]{x3BF}}\)\(\def\uppi{\unicode[Times]{x3C0}}\)\(\def\uprho{\unicode[Times]{x3C1}}\)\(\def\upsigma{\unicode[Times]{x3C3}}\)\(\def\uptau{\unicode[Times]{x3C4}}\)\(\def\upupsilon{\unicode[Times]{x3C5}}\)\(\def\upphi{\unicode[Times]{x3C6}}\)\(\def\upchi{\unicode[Times]{x3C7}}\)\(\def\uppsy{\unicode[Times]{x3C8}}\)\(\def\upomega{\unicode[Times]{x3C9}}\)\(\def\bialpha{\boldsymbol{\alpha}}\)\(\def\bibeta{\boldsymbol{\beta}}\)\(\def\bigamma{\boldsymbol{\gamma}}\)\(\def\bidelta{\boldsymbol{\delta}}\)\(\def\bivarepsilon{\boldsymbol{\varepsilon}}\)\(\def\bizeta{\boldsymbol{\zeta}}\)\(\def\bieta{\boldsymbol{\eta}}\)\(\def\bitheta{\boldsymbol{\theta}}\)\(\def\biiota{\boldsymbol{\iota}}\)\(\def\bikappa{\boldsymbol{\kappa}}\)\(\def\bilambda{\boldsymbol{\lambda}}\)\(\def\bimu{\boldsymbol{\mu}}\)\(\def\binu{\boldsymbol{\nu}}\)\(\def\bixi{\boldsymbol{\xi}}\)\(\def\biomicron{\boldsymbol{\micron}}\)\(\def\bipi{\boldsymbol{\pi}}\)\(\def\birho{\boldsymbol{\rho}}\)\(\def\bisigma{\boldsymbol{\sigma}}\)\(\def\bitau{\boldsymbol{\tau}}\)\(\def\biupsilon{\boldsymbol{\upsilon}}\)\(\def\biphi{\boldsymbol{\phi}}\)\(\def\bichi{\boldsymbol{\chi}}\)\(\def\bipsy{\boldsymbol{\psy}}\)\(\def\biomega{\boldsymbol{\omega}}\)\(\def\bupalpha{\unicode[Times]{x1D6C2}}\)\(\def\bupbeta{\unicode[Times]{x1D6C3}}\)\(\def\bupgamma{\unicode[Times]{x1D6C4}}\)\(\def\bupdelta{\unicode[Times]{x1D6C5}}\)\(\def\bupepsilon{\unicode[Times]{x1D6C6}}\)\(\def\bupvarepsilon{\unicode[Times]{x1D6DC}}\)\(\def\bupzeta{\unicode[Times]{x1D6C7}}\)\(\def\bupeta{\unicode[Times]{x1D6C8}}\)\(\def\buptheta{\unicode[Times]{x1D6C9}}\)\(\def\bupiota{\unicode[Times]{x1D6CA}}\)\(\def\bupkappa{\unicode[Times]{x1D6CB}}\)\(\def\buplambda{\unicode[Times]{x1D6CC}}\)\(\def\bupmu{\unicode[Times]{x1D6CD}}\)\(\def\bupnu{\unicode[Times]{x1D6CE}}\)\(\def\bupxi{\unicode[Times]{x1D6CF}}\)\(\def\bupomicron{\unicode[Times]{x1D6D0}}\)\(\def\buppi{\unicode[Times]{x1D6D1}}\)\(\def\buprho{\unicode[Times]{x1D6D2}}\)\(\def\bupsigma{\unicode[Times]{x1D6D4}}\)\(\def\buptau{\unicode[Times]{x1D6D5}}\)\(\def\bupupsilon{\unicode[Times]{x1D6D6}}\)\(\def\bupphi{\unicode[Times]{x1D6D7}}\)\(\def\bupchi{\unicode[Times]{x1D6D8}}\)\(\def\buppsy{\unicode[Times]{x1D6D9}}\)\(\def\bupomega{\unicode[Times]{x1D6DA}}\)\(\def\bupvartheta{\unicode[Times]{x1D6DD}}\)\(\def\bGamma{\bf{\Gamma}}\)\(\def\bDelta{\bf{\Delta}}\)\(\def\bTheta{\bf{\Theta}}\)\(\def\bLambda{\bf{\Lambda}}\)\(\def\bXi{\bf{\Xi}}\)\(\def\bPi{\bf{\Pi}}\)\(\def\bSigma{\bf{\Sigma}}\)\(\def\bUpsilon{\bf{\Upsilon}}\)\(\def\bPhi{\bf{\Phi}}\)\(\def\bPsi{\bf{\Psi}}\)\(\def\bOmega{\bf{\Omega}}\)\(\def\iGamma{\unicode[Times]{x1D6E4}}\)\(\def\iDelta{\unicode[Times]{x1D6E5}}\)\(\def\iTheta{\unicode[Times]{x1D6E9}}\)\(\def\iLambda{\unicode[Times]{x1D6EC}}\)\(\def\iXi{\unicode[Times]{x1D6EF}}\)\(\def\iPi{\unicode[Times]{x1D6F1}}\)\(\def\iSigma{\unicode[Times]{x1D6F4}}\)\(\def\iUpsilon{\unicode[Times]{x1D6F6}}\)\(\def\iPhi{\unicode[Times]{x1D6F7}}\)\(\def\iPsi{\unicode[Times]{x1D6F9}}\)\(\def\iOmega{\unicode[Times]{x1D6FA}}\)\(\def\biGamma{\unicode[Times]{x1D71E}}\)\(\def\biDelta{\unicode[Times]{x1D71F}}\)\(\def\biTheta{\unicode[Times]{x1D723}}\)\(\def\biLambda{\unicode[Times]{x1D726}}\)\(\def\biXi{\unicode[Times]{x1D729}}\)\(\def\biPi{\unicode[Times]{x1D72B}}\)\(\def\biSigma{\unicode[Times]{x1D72E}}\)\(\def\biUpsilon{\unicode[Times]{x1D730}}\)\(\def\biPhi{\unicode[Times]{x1D731}}\)\(\def\biPsi{\unicode[Times]{x1D733}}\)\(\def\biOmega{\unicode[Times]{x1D734}}\)\(R_u^L\) of an AMA–Gauss likelihood neuron represents the likelihood of latent variable Xu. The likelihood is obtained by pooling the squared (and linear, not shown) subunit filter responses, indexed by i and j, via a weighted sum (Equation 20).
Figure 2
 
Computations of an energy model neuron, a GQM model neuron, and an AMA–Gauss likelihood neuron. All three have quadratic computations at their core. The energy model and the GQM describe the computations that neurons perform. AMA–Gauss prescribes the computations that neurons should perform to optimize performance in a specific task. (A) The standard energy model assumes two Gabor-shaped orthogonal subunit filters (receptive fields) fgbr to account for a neuron's response. The response of an energy model neuron RE is obtained by adding the squared responses of the filters. (B) The GQM fits multiple arbitrarily-shaped orthogonal subunit receptive fields f that best account for a neuron's response. The response of a GQM model neuron RGQM is obtained by pooling the squared (and linear, not shown) responses of the subunit filters via a weighted sum, and passing the sum through an output nonlinearity. (C) AMA–Gauss finds the optimal subunit filters fopt and quadratic pooling rules for a specific task. Unlike standard forms of the energy model and the GQM, AMA–Gauss incorporates contrast normalization and finds subunit filters that are not necessarily orthogonal. The response \(\def\upalpha{\unicode[Times]{x3B1}}\)\(\def\upbeta{\unicode[Times]{x3B2}}\)\(\def\upgamma{\unicode[Times]{x3B3}}\)\(\def\updelta{\unicode[Times]{x3B4}}\)\(\def\upvarepsilon{\unicode[Times]{x3B5}}\)\(\def\upzeta{\unicode[Times]{x3B6}}\)\(\def\upeta{\unicode[Times]{x3B7}}\)\(\def\uptheta{\unicode[Times]{x3B8}}\)\(\def\upiota{\unicode[Times]{x3B9}}\)\(\def\upkappa{\unicode[Times]{x3BA}}\)\(\def\uplambda{\unicode[Times]{x3BB}}\)\(\def\upmu{\unicode[Times]{x3BC}}\)\(\def\upnu{\unicode[Times]{x3BD}}\)\(\def\upxi{\unicode[Times]{x3BE}}\)\(\def\upomicron{\unicode[Times]{x3BF}}\)\(\def\uppi{\unicode[Times]{x3C0}}\)\(\def\uprho{\unicode[Times]{x3C1}}\)\(\def\upsigma{\unicode[Times]{x3C3}}\)\(\def\uptau{\unicode[Times]{x3C4}}\)\(\def\upupsilon{\unicode[Times]{x3C5}}\)\(\def\upphi{\unicode[Times]{x3C6}}\)\(\def\upchi{\unicode[Times]{x3C7}}\)\(\def\uppsy{\unicode[Times]{x3C8}}\)\(\def\upomega{\unicode[Times]{x3C9}}\)\(\def\bialpha{\boldsymbol{\alpha}}\)\(\def\bibeta{\boldsymbol{\beta}}\)\(\def\bigamma{\boldsymbol{\gamma}}\)\(\def\bidelta{\boldsymbol{\delta}}\)\(\def\bivarepsilon{\boldsymbol{\varepsilon}}\)\(\def\bizeta{\boldsymbol{\zeta}}\)\(\def\bieta{\boldsymbol{\eta}}\)\(\def\bitheta{\boldsymbol{\theta}}\)\(\def\biiota{\boldsymbol{\iota}}\)\(\def\bikappa{\boldsymbol{\kappa}}\)\(\def\bilambda{\boldsymbol{\lambda}}\)\(\def\bimu{\boldsymbol{\mu}}\)\(\def\binu{\boldsymbol{\nu}}\)\(\def\bixi{\boldsymbol{\xi}}\)\(\def\biomicron{\boldsymbol{\micron}}\)\(\def\bipi{\boldsymbol{\pi}}\)\(\def\birho{\boldsymbol{\rho}}\)\(\def\bisigma{\boldsymbol{\sigma}}\)\(\def\bitau{\boldsymbol{\tau}}\)\(\def\biupsilon{\boldsymbol{\upsilon}}\)\(\def\biphi{\boldsymbol{\phi}}\)\(\def\bichi{\boldsymbol{\chi}}\)\(\def\bipsy{\boldsymbol{\psy}}\)\(\def\biomega{\boldsymbol{\omega}}\)\(\def\bupalpha{\unicode[Times]{x1D6C2}}\)\(\def\bupbeta{\unicode[Times]{x1D6C3}}\)\(\def\bupgamma{\unicode[Times]{x1D6C4}}\)\(\def\bupdelta{\unicode[Times]{x1D6C5}}\)\(\def\bupepsilon{\unicode[Times]{x1D6C6}}\)\(\def\bupvarepsilon{\unicode[Times]{x1D6DC}}\)\(\def\bupzeta{\unicode[Times]{x1D6C7}}\)\(\def\bupeta{\unicode[Times]{x1D6C8}}\)\(\def\buptheta{\unicode[Times]{x1D6C9}}\)\(\def\bupiota{\unicode[Times]{x1D6CA}}\)\(\def\bupkappa{\unicode[Times]{x1D6CB}}\)\(\def\buplambda{\unicode[Times]{x1D6CC}}\)\(\def\bupmu{\unicode[Times]{x1D6CD}}\)\(\def\bupnu{\unicode[Times]{x1D6CE}}\)\(\def\bupxi{\unicode[Times]{x1D6CF}}\)\(\def\bupomicron{\unicode[Times]{x1D6D0}}\)\(\def\buppi{\unicode[Times]{x1D6D1}}\)\(\def\buprho{\unicode[Times]{x1D6D2}}\)\(\def\bupsigma{\unicode[Times]{x1D6D4}}\)\(\def\buptau{\unicode[Times]{x1D6D5}}\)\(\def\bupupsilon{\unicode[Times]{x1D6D6}}\)\(\def\bupphi{\unicode[Times]{x1D6D7}}\)\(\def\bupchi{\unicode[Times]{x1D6D8}}\)\(\def\buppsy{\unicode[Times]{x1D6D9}}\)\(\def\bupomega{\unicode[Times]{x1D6DA}}\)\(\def\bupvartheta{\unicode[Times]{x1D6DD}}\)\(\def\bGamma{\bf{\Gamma}}\)\(\def\bDelta{\bf{\Delta}}\)\(\def\bTheta{\bf{\Theta}}\)\(\def\bLambda{\bf{\Lambda}}\)\(\def\bXi{\bf{\Xi}}\)\(\def\bPi{\bf{\Pi}}\)\(\def\bSigma{\bf{\Sigma}}\)\(\def\bUpsilon{\bf{\Upsilon}}\)\(\def\bPhi{\bf{\Phi}}\)\(\def\bPsi{\bf{\Psi}}\)\(\def\bOmega{\bf{\Omega}}\)\(\def\iGamma{\unicode[Times]{x1D6E4}}\)\(\def\iDelta{\unicode[Times]{x1D6E5}}\)\(\def\iTheta{\unicode[Times]{x1D6E9}}\)\(\def\iLambda{\unicode[Times]{x1D6EC}}\)\(\def\iXi{\unicode[Times]{x1D6EF}}\)\(\def\iPi{\unicode[Times]{x1D6F1}}\)\(\def\iSigma{\unicode[Times]{x1D6F4}}\)\(\def\iUpsilon{\unicode[Times]{x1D6F6}}\)\(\def\iPhi{\unicode[Times]{x1D6F7}}\)\(\def\iPsi{\unicode[Times]{x1D6F9}}\)\(\def\iOmega{\unicode[Times]{x1D6FA}}\)\(\def\biGamma{\unicode[Times]{x1D71E}}\)\(\def\biDelta{\unicode[Times]{x1D71F}}\)\(\def\biTheta{\unicode[Times]{x1D723}}\)\(\def\biLambda{\unicode[Times]{x1D726}}\)\(\def\biXi{\unicode[Times]{x1D729}}\)\(\def\biPi{\unicode[Times]{x1D72B}}\)\(\def\biSigma{\unicode[Times]{x1D72E}}\)\(\def\biUpsilon{\unicode[Times]{x1D730}}\)\(\def\biPhi{\unicode[Times]{x1D731}}\)\(\def\biPsi{\unicode[Times]{x1D733}}\)\(\def\biOmega{\unicode[Times]{x1D734}}\)\(R_u^L\) of an AMA–Gauss likelihood neuron represents the likelihood of latent variable Xu. The likelihood is obtained by pooling the squared (and linear, not shown) subunit filter responses, indexed by i and j, via a weighted sum (Equation 20).
Under what circumstances would energy-model-like computations be optimal? Energy-model-like computations are optimal if quadratic pooling is necessary for determining the likelihood of the task-relevant latent variable. We show in the following material that for retinal speed and binocular disparity estimation, two tasks classically associated with the energy model, quadratic pooling is indeed necessary to optimally decode the task-relevant latent variable (see Results). Therefore energy-model-like computations are optimal for these tasks with natural stimuli. AMA–Gauss is specifically designed to find the receptive fields and pooling rules that optimize performance under these conditions. It is thus likely to help accelerate the development of normative models of other tasks for which the energy model has provided a useful description. 
Generalized quadratic model (GQM)
The standard energy model assumes that the responses of certain neurons can be accounted for by two Gabor-shaped subunit receptive fields. Real neurons are not constrained to have only two subunit receptive fields, nor are their shapes constrained to be Gabor-shaped. The generalized quadratic model (GQM) fits multiple arbitrarily-shaped subunit filters and quadratic pooling rules that best account for a neuron's response (Figure 2B; Park et al., 2013). The GQM is a specific example of a large class of models designed for neural systems identification, collectively known as “subunit models.” The spike-triggered average (STA), spike-triggered covariance (STC), and the generalized linear model (GLM) are popular examples of this class of models. The goal of these models is to provide a computational level description of a neuron's computations that can predict its responses to arbitrary stimuli. 
Unfortunately, a tight description of a neuron's computations does not necessarily provide insight about how (or whether) that neuron and its computations subserve a specific task; after a subunit model has been fit, the purpose of the neuron's computations is often unclear. Thus, although methods for neural systems identification are essential for determining what the components of nervous systems do, they are unlikely to determine why they do what they do. One way to address this issue is to develop normative frameworks (1) that determine the computations that are optimal for particular tasks and (2) that share the same or similar functional forms as popular methods for describing neural response. 
AMA–Gauss is a normative method that is designed to find the filters (receptive fields) and quadratic pooling rules that are optimal for specific sensory-perceptual tasks (Figure 2C; see Methods). AMA–Gauss has a functional form that is closely related to the energy model and the GQM, but it has a different aim. Rather than describing what a neuron does, it prescribes what neurons should do. In fact, given a hypothesis about the function of a particular neuron, AMA–Gauss can predict the subunit filters and pooling rules that will be recovered by the GQM. The development of closely related normative models and methods for neural systems identification is likely to enhance our ability to interpret fits to neural data and accelerate progress in psychophysical and neuroscientific research. 
Methods
This section formally develops AMA–Gauss. To provide context for this technical contribution, we first review the setup and main equations for AMA (Geisler et al., 2009). Then, we derive the main equations for AMA–Gauss, provide a geometric intuition for how it works, and discuss practices for best use. Readers who are more interested in the scientific implications, and less interested in the mathematical formalisms, can skip ahead to Results
Accuracy maximization analysis
The goal of AMA is to find the filters (receptive fields) that extract the most useful stimulus features for a particular task. Consistent with real biological systems, AMA places no constraints on the orthogonality of its filters, and its filter responses are corrupted by noise. AMA searches for the optimal filters with a closed form expression for the cost that relies on a Bayes optimal decoder. The filters are constrained to have unit magnitude (||f|| = 1.0). The expression for the cost requires the specification of five factors (see Figure 3A). These factors are (1) a well-defined task (i.e., a latent variable to estimate from high-dimensional stimuli), (2) a labeled training set of stimuli, (3) a set of encoding filters, (4) a response noise model, (5) and a cost function (Figure 3A). The training set specifies the joint probability distribution Display Formula
\({\cal P}(X,{\bf{s}})\)
between the latent variable X and the stimuli s (Figure 3B) and implicitly defines the prior Display Formula
\({\cal P}(X) = \sum\nolimits_s {\cal P} (X,{\bf{s}})\)
over the latent variable (see Discussion). If the training set is representative, results will generalize well to stimuli outside the training set. 
Figure 3
 
Accuracy maximization analysis. (A) Factors determining the optimal decoder used by AMA during filter learning. (B) Steps for finding optimal task-specific filters via AMA.
Figure 3
 
Accuracy maximization analysis. (A) Factors determining the optimal decoder used by AMA during filter learning. (B) Steps for finding optimal task-specific filters via AMA.
For any particular filter set, the matched Bayes optimal decoder provides the cost by computing the posterior probability over the latent variable Display Formula
\({\cal P}(X|{\bf{R}})\)
, reading out the optimal estimate from the posterior, and then assigning a cost to the error. The steps for finding the optimal task-specific filters are: (1) select a particular stimulus skl from the labeled training set, (2) obtain a set of noisy filter responses R(k, l) from a particular (possibly nonoptimal) set of filters, (3) use the optimal non-linear decoder g(.) to obtain the optimal estimate Display Formula
\({\hat X^{opt}}\)
and its expected cost Display Formula
\({\bar C_{kl}}\)
, (4) repeat for each stimulus and compute the average cost across all stimuli in the training set (5) update the filters to reduce the cost, and (6) repeat until the average cost is minimized. The optimal task-specific filters fopt are those that minimize the cost (Figure 3B). 
Bayes optimal decoder and filter response model
The Bayes optimal decoder gives a closed form expression for the cost for any filter or set of filters, given the training stimuli. The posterior probability of latent variable Xu given the noisy filter responses R(k, l) to stimulus skl is given by Bayes' rule  
\begin{equation}\tag{1}{\cal P}\left({X_u}\left| {{\bf{R}}\left(k,l\right)} \right.\right) = {{{\cal P}\left({\bf{R}}\left(k,l\right)|{X_u}\right){\cal P}\left({X_u}\right)} \over {\sum\nolimits_{i = 1}^{{N_{lvl}}} {\cal P} \left({\bf{R}}\left(k,l\right)|{X_i}\right){\cal P}\left({X_i}\right)}}\end{equation}
where Nlvl is the number of latent variable level, and l indexes the stimuli having latent variable value Xk. The conditional distribution of noisy responses given the latent variable is  
\begin{equation}\tag{2}{\cal P}\left({\bf{R}}|{X_u}\right) = \sum\limits_{v = 1}^{{N_u}} {\cal P} \left({\bf{R}}|{{\bf{s}}_{uv}}\right){\cal P}\left({{\bf{s}}_{uv}}|{X_u}\right)\end{equation}
where Nu is the number of stimuli having latent variable level Xu, and v indexes training stimuli having that latent variable value. Conveniently, Display Formula
\({\cal P}({{\bf{s}}_{uv}}|{X_u})\)
and Display Formula
\({\cal P}({X_u})\)
are determined by the training set; Display Formula
\({\cal P}({{\bf{s}}_{uv}}|{X_u}) = {1 \over {{N_u}}}\)
is the probability of particular stimulus v with latent variable Xu given that there are Nu such stimuli, and Display Formula
\({\cal P}({X_u}) = {{{N_u}} \over N}\)
is the fraction of all stimuli having latent variable Xu. Therefore, Equation 1 reduces to  
\begin{equation}\tag{3}{\cal P}\left({X_u}|{\bf{R}}\left(k,l\right)\right) = {{\sum\nolimits_{v = 1}^{{N_u}} {\cal P} \left({\bf{R}}\left(k,l\right)|{{\bf{s}}_{uv}}\right)} \over {\sum\nolimits_{i = 1}^{{N_{lvl}}} {\sum\nolimits_{J = 1}^{{N_i}} {\cal P} } \left({\bf{R}}\left(k,l\right)|{{\bf{s}}_{ij}}\right)}}\end{equation}
Equation 3 indicates that the posterior probability is given by the sum of the within-level stimulus likelihoods, normalized by the sum of all stimulus likelihoods.  
Our aim is to understand task-specific information processing in biological systems. Thus, the response noise model should be consistent with the properties of biological encoders. AMA uses scaled additive (e.g., Poisson-like) Gaussian noise, a broadly used model of neural noise in early visual cortex (Geisler & Albrecht, 1997). Equations 47 define the response model, and specify the distribution of noisy filter responses Display Formula
\({\cal P}({\bf{R}}|{{\bf{s}}_{uv}})\)
to each stimulus. For an individual filter ft from set of filters Display Formula
\({\bf{f}} = [{{\bf{f}}_1},{{\bf{f}}_2},...,{{\bf{f}}_q}]\)
(where q is the number of filters), the mean response rt, noisy response Rt, and noise variance Display Formula
\(\sigma _t^2\)
to stimulus suv having latent variable value Xu are  
\begin{equation}\tag{4}{r_{uv,t}} = {\bf{f}}_t^T{{\bf{s}}_{uv}}\end{equation}
 
\begin{equation}\tag{5}{R_{uv,t}} = {r_{uv,t}} + {\eta _t}\end{equation}
 
\begin{equation}\tag{6}{\eta _t}\sim {\cal N}\left(0,\sigma _{uv,t}^2\right)\end{equation}
 
\begin{equation}\tag{7}\sigma _{uv,t}^2 = \alpha |{r_{uv,t}}| + \sigma _0^2\end{equation}
where η is a noise sample, α is the Fano factor, and Display Formula
\(\sigma _0^2\)
is baseline noise variance. The proximal stimulus Display Formula
\({\bf{s}} = {{{\bf{x}} - \bar{\bf{x}}} \over {||{\bf{x}} - \bar{\bf{x}}||}}\)
is contrast-normalized consistent with standard models (Albrecht & Geisler, 1991; Heeger, 1992), where x is an (possibly noise corrupted) intensity stimulus. If q filters are considered simultaneously, the response distributions Display Formula
\({\cal P}({\bf{R}}|{\bf{s}})\)
, and the variables in Equations 47 become q-dimensional: mean response vector Display Formula
\({\bf{r}} = [{r_1},{r_2},...,{r_q}]\)
, noisy response vector Display Formula
\({\bf{R}} = [{R_1},{R_2},...,{R_q}]\)
, and response noise covariance matrix Λ.  
The posterior probability distribution over the latent variable given the noisy filter responses to any stimulus in the training set is fully specified by Equations 37. The next step is to define the cost associated with a noisy response to an individual stimulus. The cost is given by  
\begin{equation}\tag{8}{C_{kl}} = \sum\limits_X \left[ \gamma \left(\hat X_{kl}^{opt},X\right){\cal P}\left(X|{\bf{R}}\left(k,l\right)\right)\right]\end{equation}
where γ(.) is an arbitrary cost function and Display Formula
\(\hat X_{kl}^{opt}\)
is the optimal estimate associated with noisy response R(k, l). The overall cost for a set of filters is the expected cost for each stimulus averaged over all stimuli  
\begin{equation}\tag{9}\bar C = {1 \over N}\sum\limits_{k,l}^N {{{\Bbb E}_{{\bf{R}}(k,l)}}} \left[{C_{kl}}\right]\end{equation}
The goal of AMA is to obtain the filters f that minimize the overall cost  
\begin{equation}\tag{10}{{\bf{f}}^{opt}} = \mathop {{\mathop{\rm argmin}\nolimits} }\limits_{\bf{f}} \>\bar C\end{equation}
where fopt are the optimal filters.  
A single evaluation of the posterior probability distribution (Equation 3) for each stimulus in the training set requires Display Formula
\({\cal O}({N^2}{N_{lvl}})\)
operations where N is the total number of stimuli and Nlvl is the number of latent variable levels in the training set. As noted earlier, this compute time makes AMA impractical for large scale problems without specialized computing resources. 
There are at least two methods for achieving significant computational savings in optimization problems: employing models with strong parametric assumptions, and employing stochastic gradient descent routines. Both methods have drawbacks. Models with strong parametric assumptions are only appropriate for cases in which the assumptions approximately hold. Stochastic gradient descent routines are noisy and may not converge to the optimum filters. We have previously developed AMA–SGD, a stochastic gradient descent routine for AMA (Burge & Jaini, 2017). Here, we develop AMA–Gauss, a model with strong parametric assumptions. 
AMA–Gauss
In this section, we first introduce AMA–Gauss and highlight its advantages over AMA. Subsequently, we provide expressions for AMA–Gauss likelihood function, L2 and L0 cost functions, and their gradients. We believe this is a valuable step toward making AMA a more practical tool in vision research. 
AMA–Gauss: Class-conditional Gaussian assumption
AMA–Gauss is a version of AMA that makes the parametric assumption that the filter responses are Gaussian distributed when they are conditioned on a particular value of the latent variable  
\begin{equation}\tag{11}{\cal P}\left({\bf{R}}|{X_u}\right) = {\cal N}\left({\bf{R}};{\mu _u},{\Sigma _u}\right)\end{equation}
where R are responses to stimuli having latent variable level Xu,  
\begin{equation}\tag{12}{\mu _u} = {1 \over {{N_u}}}\sum\limits_{v = 1}^{{N_u}} {{{\bf{f}}^T}} {{\bf{s}}_{uv}} = {{\bf{f}}^T}{{\bf{s}}_u}\end{equation}
is the class-conditional mean of the noisy filter responses and  
\begin{equation}\tag{13}{\Sigma _u} = {1 \over {{N_u}}}\left[\sum\limits_{v = 1}^{{N_u}} {\left( {\left({{\bf{f}}^T}{{\bf{s}}_{uv}} - {{\bf{f}}^T}{{\bf{s}}_u}\right){{\left({{\bf{f}}^T}{{\bf{s}}_{uv}} - {{\bf{f}}^T}{{\bf{s}}_u}\right)}^T}} \right)} \right] + \Lambda \end{equation}
is the class-conditional covariance of the noisy filter responses. The first term in Equation 13 is the class-conditional covariance of the expected filter responses. The second term in Equation 13, Λ, is the covariance matrix of the filter response noise Display Formula
\(\eta \sim {\cal N}({\bf{0}},\Lambda )\)
. There are two major reasons for making the Gaussian assumption. First, if the response distributions are Gaussian, then AMA–Gauss will return the same filters as AMA while simultaneously providing huge savings in compute time. Second, the assumption is justified for at least two fundamental visual tasks in early vision (see Results; Burge & Geisler, 2014, 2015). With time, we speculate that similar statements will be justified for other sensory-perceptual tasks.  
Under the AMA–Gauss assumption, the posterior probability (Equation 1) of latent variable Xu is  
\begin{equation}\tag{14}{\cal P}\left({X_u}|{\bf{R}}\left(k,l\right)\right) = {{{\cal N}\left({\bf{R}}\left(k,l\right);{\mu _u},{\Sigma _u}\right)} \over {\sum\nolimits_{i = 1}^{{N_{lvl}}} {\cal N} \left({\bf{R}}\left(k,l\right);{\mu _i},{\Sigma _i}\right)}}\end{equation}
where Nlvl is the number of latent variable levels. The AMA–Gauss posterior (Equation 14), has a simpler form than the AMA posterior (Equation 3). Hence, whereas a single evaluation of the AMA posterior probability distribution requires Display Formula
\({\cal O}({N^2}{N_{lvl}})\)
operations (Equation 3), the AMA–Gauss posterior requires only Display Formula
\({\cal O}(N{N_{lvl}})\)
operations where N is the number of stimuli in the training set (see Results). This reduction in compute-time substantially improves the practicality of AMA when the Gaussian assumption is justified. Even if the Gaussian assumption is not justified, AMA–Gauss is guaranteed to make the best possible use of first- and second-order conditional response statistics, and could thus provide a decent initialization at low computational cost.  
AMA–Gauss: Derivations of the likelihood function, costs, and gradients
Analytic solutions for the optimal filters under AMA–Gauss (and AMA) are not available in closed form. Here, we provide expressions for the AMA–Gauss likelihood function, L2 cost, L0 cost, and their gradients. 
The maximum likelihood AMA–Gauss encoding filters Display Formula
\({{\bf{f}}_{\cal L}}\)
are those that simultaneously maximize the likelihood of the correct latent variable Xk across all stimuli in the training set. Stimuli having latent variable value Xk are indexed by l, and the ith stimulus in the training set is denoted (ki, li). The likelihood function of the AMA–Gauss filters is  
\begin{equation}\tag{15}{\cal L}\left({\bf{f}}\right) = \prod\limits_{i = 1}^N \left[{\left(2\pi \right)^{ - {d \over 2}}}|{\Sigma _{{k_i}}}{|^{ - {1 \over 2}}} \exp \left[ - {1 \over 2}{\left({\bf{R}}\left({k_i},{l_i}\right) - {\mu _{{k_i}}}\right)^T}\Sigma _{{k_i}}^{ - 1}\left({\bf{R}}\left({k_i},{l_i}\right) - {\mu _{{k_i}}}\right)\right]\right] \end{equation}
The maximum likelihood filters can be determined by maximizing the likelihood function or, equivalently, minimizing the negative log-likelihood function  
\begin{equation}{{\bf{f}}_{\cal L}} = \mathop {{\rm{argmin}}}\limits_{\bf{f}} \>\left[ - \log {\cal L}\left({\bf{f}}\right)\right]\end{equation}
In practice, the expected negative log-likelihood is easier to minimize. Complete derivations of the likelihood function, the expected log-likelihood function, and closed form expressions for the associated gradients are provided in Appendix A. These expressions can be used to estimate the maximum-likelihood filters via gradient descent.  
Next, we derive the AMA–Gauss cost for two popular cost functions for which the minimum mean squared error (MMSE) estimate and maximum a posteriori (MAP) are optimal: the L2 and L0 cost. The cost function specifies the penalty assigned to different types of error. For the L2 (i.e. squared error) cost function, the expected cost for each stimulus skl (Equation 9) is  
\begin{equation}\tag{16}{\bar C_{kl}} = {{\Bbb E}_{{\bf{R}}\left(k,l\right)}}\left[{\left(\hat X_{kl}^{opt} - {X_k}\right)^2}\right]\end{equation}
where the optimal estimate Display Formula
\(\hat X_{kl}^{opt} = \sum\nolimits_{u = 1}^{{N_{lvl}}} {{X_u}} {\cal P}({X_u}|{\bf{R}}(k,l))\)
is the mean of the posterior.  
For the L0 (i.e., 0,1) cost function, the expected cost across all stimuli is closely related to the KL-divergence of the observed posterior and an idealized posterior with all its mass at the correct latent variable Xk; in both cases, cost is determined only by the posterior probability mass at the correct level of the latent variable (Burge & Jaini, 2017; Geisler et al., 2009). Here, the expected KL-divergence per stimulus is equal to the negative log-posterior probability at the correct level (Geisler et al., 2009)  
\begin{equation}\tag{17}{\bar C_{kl}} = {{\Bbb E}_{{\bf{R}}(k,l)}}\left[ - \log {\cal P}\left({X_k}|{\bf{R}}\left(k,l\right)\right)\right]\end{equation}
 
In a slight abuse of terminology, we refer to this divergence as the L0 or KL-divergence cost. 
The gradient of the total expected cost across all stimuli can be evaluated by calculating the gradient of the cost for each stimulus Display Formula
\({\nabla _{\bf{f}}}{\bar C_{kl}}\)
(see Equation 9). Hence, the gradient of the total expected cost is  
\begin{equation}\tag{18}{\nabla _{\bf{f}}}\bar C = {1 \over N}\sum\limits_{k,l}^N {{\nabla _{\bf{f}}}} {\bar C_{kl}}\end{equation}
The gradient of the cost for each stimulus can be evaluated by calculating the gradient of the posterior probability. Complete derivations of the cost and the gradient of the cost for the L2 and L0 cost functions are given in Appendix B and Appendix C, respectively.  
Cost is minimized when responses to stimuli having different latent variable values overlap as little as possible. The cost functions (i.e., max-likelihood, L0 cost, L2 cost) exert pressure on the filters to produce class-conditional response distributions that are as different as possible given the constraints imposed by the stimuli. Hence the optimal filters will (1) maximize the differences between the class-conditional means or covariances and (2) minimize the generalized variance for each class-conditional response distribution. (Generalized variance is a measure of overall scatter, represents the squared volume of the ellipse, and is given by the determinant of the covariance matrix.) 
AMA–Gauss: Geometric intuition
Figure 4 provides a geometric intuition for the relationship between the filter response distributions, the likelihood, and the posterior probability distribution for two simple hypothetical cases. Both cases have three latent variable values. In one case, the information about the latent variable is carried by the class-conditional mean (Figure 4A–C). In the other case, the information about the latent variable is carried by the class-conditional covariance (Figure 4DF). In all cases, the class-conditional responses to stimuli having the same latent variable value are Gaussian distributed. With a single filter, the response distributions are one-dimensional (Figure 4A, D). For any observed noisy response R, the likelihood of a particular level of the latent variable Xu is found by evaluating its response distribution at the observed response (blue dot; Figure 4A, D). The posterior probability of latent variable Xu is obtained by normalizing with the sum of the likelihoods (blue, red, and green dots; Figure 4B, E). With two filters, the response distributions are two-dimensional (red, blue, and green ellipses with corresponding marginals; Figure 4C, F). The second filter will increase the posterior probability mass at the correct value of the latent variable (not shown) because the second filter selects for useful stimulus features that the first filter does not. These hypothetical cases illustrate why cost is minimized when mean or covariance differences are maximized between classes and generalized variance is minimized within classes. The filters that make the response distributions as different as possible make it as easy as possible to decode the latent variable. 
Figure 4
 
Relationship between conditional filter response distributions, likelihood, and posterior probability. Two hypothetical cases are considered, each with three latent variable values. (A) One-dimensional (i.e., single filter) Gaussian conditional response distributions, when information about the latent variable is carried only by the class-conditional mean; distribution means, but not variances, change with the latent variable. The blue distribution represents the response distribution to stimuli having latent variable value Xk. The red and green distributions represent response distributions to stimuli having different latent variables values\({X_u} \ne {X_k}\). The blue dot represents the likelihood \({\cal L}({X_u};{R_1}(k,l)) = {\cal N}({R_1}(k,l);{\mu _u},{\Sigma _u})\) that observed noisy filter response R1(k, l) to stimulus sk,l was elicited by a stimulus having latent variable level Xu = Xk. Red and green dots represent the likelihoods that the response was elicited by a stimulus having latent variable \({X_u} \ne {X_k}\) (i.e., by a stimulus having the incorrect latent variable value). (B) Posterior probability over the latent variable given the noisy observed response in (A). The posterior probability of the correct latent variable value (in this case, Xk) is given by the likelihood of the correct latent variable value normalized by the sum of all likelihoods. Colored boxes surrounding entries in the inset equation indicate the likelihood of each latent variable. (C) Two-dimensional (i.e., two-filter) Gaussian response distributions. Each ellipse represents the joint filter responses to all stimuli having the same latent variable value. The second filter improves decoding performance by selecting for useful stimulus features that the first filter does not. The black dot near the center of the blue ellipse represents an observed noisy joint response R(k, l) to stimulus sk,l. The likelihood \({\cal L}({X_u};{\bf{R}}(k,l)) = {\cal N}({\bf{R}}(k,l);{\mu _u},{\Sigma _u})\) that the observed response was elicited by a stimulus having latent variable value Xu is obtained by evaluating the joint Gaussian at the noisy response; in this case, the product of the likelihoods represented by the blue dots on the single filter response distributions. (D–F) Same as A–C, but where information about the latent variable is carried by the class-conditional covariance instead of the mean; ellipse orientation, but not location, changes with the latent variable. AMA–Gauss finds the filters yielding conditional response distributions that are as different from each other as possible, given stimulus constraints.
Figure 4
 
Relationship between conditional filter response distributions, likelihood, and posterior probability. Two hypothetical cases are considered, each with three latent variable values. (A) One-dimensional (i.e., single filter) Gaussian conditional response distributions, when information about the latent variable is carried only by the class-conditional mean; distribution means, but not variances, change with the latent variable. The blue distribution represents the response distribution to stimuli having latent variable value Xk. The red and green distributions represent response distributions to stimuli having different latent variables values\({X_u} \ne {X_k}\). The blue dot represents the likelihood \({\cal L}({X_u};{R_1}(k,l)) = {\cal N}({R_1}(k,l);{\mu _u},{\Sigma _u})\) that observed noisy filter response R1(k, l) to stimulus sk,l was elicited by a stimulus having latent variable level Xu = Xk. Red and green dots represent the likelihoods that the response was elicited by a stimulus having latent variable \({X_u} \ne {X_k}\) (i.e., by a stimulus having the incorrect latent variable value). (B) Posterior probability over the latent variable given the noisy observed response in (A). The posterior probability of the correct latent variable value (in this case, Xk) is given by the likelihood of the correct latent variable value normalized by the sum of all likelihoods. Colored boxes surrounding entries in the inset equation indicate the likelihood of each latent variable. (C) Two-dimensional (i.e., two-filter) Gaussian response distributions. Each ellipse represents the joint filter responses to all stimuli having the same latent variable value. The second filter improves decoding performance by selecting for useful stimulus features that the first filter does not. The black dot near the center of the blue ellipse represents an observed noisy joint response R(k, l) to stimulus sk,l. The likelihood \({\cal L}({X_u};{\bf{R}}(k,l)) = {\cal N}({\bf{R}}(k,l);{\mu _u},{\Sigma _u})\) that the observed response was elicited by a stimulus having latent variable value Xu is obtained by evaluating the joint Gaussian at the noisy response; in this case, the product of the likelihoods represented by the blue dots on the single filter response distributions. (D–F) Same as A–C, but where information about the latent variable is carried by the class-conditional covariance instead of the mean; ellipse orientation, but not location, changes with the latent variable. AMA–Gauss finds the filters yielding conditional response distributions that are as different from each other as possible, given stimulus constraints.
AMA–Gauss: Best practices
The AMA–Gauss method developed here does not automatically determine the number of stimuli to train on, or the number of task-specific filters to learn; these choices are left to the user. 
To obtain representative results (i.e., to minimize sampling error) the training set must be of sufficient size. AMA–Gauss uses the sample mean and covariance to approximate the Gaussian distributions of filter responses conditional on each value of the latent variable (Equation 11). Training sets with at least 250 stimuli per level tend to give representative results. 
To extract all task-relevant information from each stimulus a sufficient number of receptive fields must be learned. In general, the best practice is to learn filters until the change in the value of the total cost is negligible (Geisler et al., 2009). The current paper aims to demonstrate the properties and usefulness of AMA–Gauss rather than determine the best number of filters; for clarity, we show only four filters for each task (see Results). Previous work has shown that, for the two tasks considered here, eight filters are required to capture nearly all task-relevant information (Burge & Geisler, 2014, 2015). The results presented in this paper hold for all eight filters, but we show only four for ease of presentation. 
Results
Retinal speed estimation and binocular disparity estimation are canonical visual tasks. Accurate and precise estimation of retinal image motion is critical for the accurate estimation of object motion and self-motion through the environment. Accurate and precise estimation of binocular disparity is critical for the accurate estimation of depth and the control of fixational eye movements. Although of fundamental importance for mobile seeing organisms, both tasks are difficult in natural conditions because of the enormous variability and complexity in natural images. 
The plan for the results section is as follows. First, we use AMA–Gauss1 to find the receptive fields that are optimal for estimating speed and disparity from local patches of natural images. Second, we compare AMA–Gauss and AMA and show that both methods (1) learn the same filters and (2) converge to the same cost for both tasks. Third, we verify that AMA–Gauss achieves the expected reductions in compute-time: filter-learning with AMA–Gauss is linear whereas AMA is quadratic in the number of stimuli in the training set. Fourth, we show that the class-conditional filter responses are approximately Gaussian, thereby justifying the Gaussian assumption for these tasks. Fifth, we show how contrast normalization contributes to the Gaussianity of the class-conditional responses. Sixth, we explain how the filter response distributions determine the likelihood functions and optimal pooling rules. Seventh, we explain how these results provide a normative explanation for why energy-model-like computations describe the response properties of neurons involved in these tasks. Eighth, and last, we establish the formal relationship between AMA–Gauss and the GQM, a recently developed method for neural systems identification. 
For each task, we obtained an existing labeled training set of natural photographic stimuli consisting of approximately 10,000 randomly sampled stimuli. All stimuli subtended 1° of visual angle. Perspective projection, physiological optics, and the wavelength sensitivity, spatial sampling, and temporal integration functions of the foveal cones were accurately modeled. Input noise was added to each stimulus with a noise level just high enough to mask retinal image detail that would be undetectable by the human visual system (Williams, 1985). Both training sets had flat prior probability distributions Display Formula
\({\cal P}(X)\)
over the latent variable (see Discussion). The training set for speed estimation consisted of 10,500 stimuli [10,500 stimuli = 500 stimuli/level × 21 levels; (Burge & Geisler, 2015)]. Retinal speeds ranged from −8°/s to +8°/s; negative and positive speeds correspond with leftward and rightward drifting movies. Each stimulus had a duration of 250 ms. The training set for disparity estimation consisted of 7,600 stimuli [7,600 stimuli = 400 stimuli/level × 19 levels; (Burge & Geisler, 2014)]. Binocular disparities ranged from −16.875 arcmin to +16.875 arcmin; negative and positive disparities correspond to uncrossed and crossed disparities. (Note that although these training sets have a discrete number of latent variable values, AMA filters can be learned with discrete- or with real-valued latent variables.) For extensive additional details on these training sets and for ideal observer performance in these tasks, please see Burge and Geisler, 2014, 2015. One important limitation of these datasets is that all motion signals were rigid and that all disparity signals were planar. Future work will examine the impact of nonrigid motion (e.g., looming) and local depth variation (e.g., occlusion) on performance (see Discussion). 
Before processing, retinal image stimuli for both tasks were vertically averaged under a raised cosine window (0.5° at half-height). Vertically oriented linear receptive fields respond identically to the original and vertically averaged stimuli, and canonical receptive fields for both tasks are vertically oriented (Burge & Geisler, 2014, 2015). Thus, the vertically averaged signals represent the signals available to the orientation column that would be most useful to the task. Future work will examine the impact of off-vertical image features on performance. 
Next, we used AMA–Gauss to find the optimal filters for both tasks. The results presented as follows were obtained using the L0 cost function and constant, additive, independent filter response noise. In general, we have found that the optimal filters are quite robust to the choice of cost function when trained with natural stimuli (Burge & Jaini, 2017). Figure 5 shows results for the retinal speed estimation task. Figure 6 shows results for the disparity estimation task. AMA and AMA–Gauss learn nearly identical encoding filters (Figure 5A and 6A; ρ > 0.96) and exhibit nearly identical estimation costs (Figure 5B and 6B); note, however, that these filter and performance similarities are not guaranteed (see Appendix D). AMA–Gauss also dramatically reduces compute time (Figures 5C, D and 6C, D). With AMA, the time required to learn filters increases quadratically with their number of stimuli in the training set. With AMA–Gauss, filter learning time increases linearly with the number of stimuli. Finally, the class-conditional filter responses are approximately Gaussian (Figures 5E, F and 6E, F), indicating that the Gaussian assumption is justified for both tasks. Quadratic computations are therefore required to determine the likelihood of a particular value of the latent variable. The posterior probability distribution over the latent variable Display Formula
\({\cal P}(X|{\bf{R}})\)
can be obtained from the likelihoods by straightforward application of Bayes' rule. 
Figure 5
 
Speed estimation task: filters, cost, compute time, and class-conditional response distributions. (A) AMA–Gauss and AMA filters for estimating speed (−8 to +8°/s) from natural image movies are nearly identical; ρ > 0.96 for all filters. (B) The cost for all the filters for both the models is identical. (C) Compute time for 50 evaluations of the posterior probability distribution is linear with AMA–Gauss, and quadratic with the full AMA model, in the training set size. (D) Same data as in (C) but on log–log axes. (E), (F) Joint filter responses, conditioned on each level of the latent variable, are approximately Gaussian. Different colors indicate different speeds. Individual symbols represent responses to individual stimuli. Thin black curves show that the filter response distributions, marginalized over the latent variable \({\cal P}(R) = \sum\nolimits_{u = 1}^{{N_{lvl}}} {\cal P} (R|{X_u}){\cal P}({X_u})\), are heavier-tailed than Gaussians (see Results and Discussion).
Figure 5
 
Speed estimation task: filters, cost, compute time, and class-conditional response distributions. (A) AMA–Gauss and AMA filters for estimating speed (−8 to +8°/s) from natural image movies are nearly identical; ρ > 0.96 for all filters. (B) The cost for all the filters for both the models is identical. (C) Compute time for 50 evaluations of the posterior probability distribution is linear with AMA–Gauss, and quadratic with the full AMA model, in the training set size. (D) Same data as in (C) but on log–log axes. (E), (F) Joint filter responses, conditioned on each level of the latent variable, are approximately Gaussian. Different colors indicate different speeds. Individual symbols represent responses to individual stimuli. Thin black curves show that the filter response distributions, marginalized over the latent variable \({\cal P}(R) = \sum\nolimits_{u = 1}^{{N_{lvl}}} {\cal P} (R|{X_u}){\cal P}({X_u})\), are heavier-tailed than Gaussians (see Results and Discussion).
Figure 6
 
Disparity estimation task: filters, cost, compute time, and class-conditional response distributions. (A) AMA–Gauss and AMA filters for estimating disparity from natural stereo-images (−15 to +15 arcmin). (B–F) Caption format same as Figure 5B–F.
Figure 6
 
Disparity estimation task: filters, cost, compute time, and class-conditional response distributions. (A) AMA–Gauss and AMA filters for estimating disparity from natural stereo-images (−15 to +15 arcmin). (B–F) Caption format same as Figure 5B–F.
Response normalization, response Gaussianity, and decoding performance
Contrast varies significantly in natural stimuli. How does contrast normalization affect the filter responses? For the class of problems considered here (e.g., retinal speed estimation, binocular disparity estimation, and other energy-model-related tasks), neurophysiologically plausible contrast normalization (Albrecht & Geisler, 1991; Heeger, 1992) must be built into the filter response model (Equation 4) for the class-conditional filter responses Display Formula
\({\cal P}({\bf{R}}|{X_u})\)
to be Gaussian distributed. [Note that some standard models of normalization are computationally equivalent (Albrecht & Geisler, 1991; Heeger, 1992), but that other more specialized forms of normalization are not (Carandini & Heeger, 2012; Coen-Cagli, Dayan, & Schwartz, 2012; Gao & Vasconcelos, 2009).) In AMA–Gauss, the input stimulus s is a contrast normalized (Display Formula
\(||{\bf{s}}||\, \le 1.0\)
) version of a (possibly noise-corrupted) intensity stimulus x with mean intensity Display Formula
\(\bar{\bf{x}}\)
. Luminance normalization converts the intensity stimulus to a contrast stimulus Display Formula
\({\bf{c}} = {{{\bf{x}} - \bar{\bf{x}}} \over \bar{\bf{x}}}\)
by subtracting off and dividing by the mean. Contrast normalization converts the contrast stimulus to a contrast normalized signal with unit magnitude (or less) Display Formula
\({\bf{s}} = {{\bf{c}} \over {\root {} \of {nc_{50}^2 + \sum\nolimits_i {{\bf{c}}_i^2} } }}\)
where c50 is an additive constant and n is the dimensionality of (e.g., number of pixels defining) each stimulus. Here, we assumed that the value of the additive constant is c50 = 0.0. The effect of the value of c50 has been studied previously (Burge & Geisler, 2014). 
To examine the effect of contrast normalization on the class-conditional filter response distributions, we computed the filter responses to the same stimuli with and without contrast normalization. With contrast normalization, filter response distributions are approximately Gaussian (Figure 7A, B, E, F). Without contrast normalization, filter response distributions have tails much heavier than Gaussian (Figure 7C, D, G, H). (Note that AMA–Gauss learns very similar filters with and without contrast normalization. Normalization does not change which stimulus features should be selected; it changes only how the selected features are represented.) Thus, biologically realistic normalization helps Gaussianize the conditional response distributions. Related results have been reported by other groups (Lyu & Simoncelli, 2009; Wang, Bovik, Sheikh, & Simoncelli, 2004). 
Figure 7
 
Filter responses with and without contrast normalization. (A) Class-conditional filter response distributions \({\cal P}({R_t}|{X_u})\) to contrast-normalized stimuli for each individual filter and each level of the latent variable in the speed estimation task. For visualization, responses are transformed to Z-scores by subtracting off the mean and dividing by the standard deviation. Gaussian probability density is overlaid for purposes of comparison. (B) Kurtosis of the two-dimensional conditional response distributions from filters 1 and 2 (violet; also see Figure 5E) and filters 3 and 4 (green; also see Figure 5F) for all levels of the latent variable. A two-dimensional Gaussian has a kurtosis of 8.0. Kurtosis was estimated by fitting a multidimensional generalized Gaussian via maximum likelihood methods. (C, D) Same as A, B but without contrast normalization. (E–H) Same as (A–D), but for the task of disparity estimation.
Figure 7
 
Filter responses with and without contrast normalization. (A) Class-conditional filter response distributions \({\cal P}({R_t}|{X_u})\) to contrast-normalized stimuli for each individual filter and each level of the latent variable in the speed estimation task. For visualization, responses are transformed to Z-scores by subtracting off the mean and dividing by the standard deviation. Gaussian probability density is overlaid for purposes of comparison. (B) Kurtosis of the two-dimensional conditional response distributions from filters 1 and 2 (violet; also see Figure 5E) and filters 3 and 4 (green; also see Figure 5F) for all levels of the latent variable. A two-dimensional Gaussian has a kurtosis of 8.0. Kurtosis was estimated by fitting a multidimensional generalized Gaussian via maximum likelihood methods. (C, D) Same as A, B but without contrast normalization. (E–H) Same as (A–D), but for the task of disparity estimation.
Contrast normalization not only Gaussianizes the response distributions; it also improves performance. If response distributions are heavy-tailed and have strong peaks at zero, then the Gaussian assumption is violated and attempts to decode the latent variable from those responses suffer. Contrast normalization reduces the peak at zero, thereby reducing decoding difficulty. Figure 8 compares decoding cost in the speed and disparity tasks with and without contrast normalization and shows that failing to normalize harms performance. Thus, contrast normalization improves task performance by decreasing kurtosis and increasing response Gaussianity. 
Figure 8
 
Decoding performance with and without contrast normalization. (A) Contrast normalization decreases decoding cost for the speed estimation task. (B) Percentage increase in cost without contrast normalization. (C, D) Same as (A, B) but for the disparity estimation task. The same result holds for different cost functions (e.g., squared error) and larger number of filters. If eight filters are used in the disparity task, failing to contrast normalize can decrease performance by ∼40%.
Figure 8
 
Decoding performance with and without contrast normalization. (A) Contrast normalization decreases decoding cost for the speed estimation task. (B) Percentage increase in cost without contrast normalization. (C, D) Same as (A, B) but for the disparity estimation task. The same result holds for different cost functions (e.g., squared error) and larger number of filters. If eight filters are used in the disparity task, failing to contrast normalize can decrease performance by ∼40%.
Subunit response models (e.g., the standard energy model, the GQM, and other LN models) are widely used to describe and fit neurons. They do not generally incorporate normalization (Adelson & Bergen, 1985; Park et al., 2013; Rust et al., 2005; Vintch et al., 2015). This fact is unsurprising. Many laboratory experiments use high-contrast white noise stimuli to map neural receptive fields (Jones & Palmer, 1987a, 1987b). Linear subunit receptive field responses to Gaussian noise are guaranteed to be Gaussian, so the lack of contrast normalization does not hurt performance in common laboratory conditions. With natural signals, the failure to normalize can hurt performance. Perhaps this is one reason why subunit models tend to generalize poorly to natural stimuli (but see Eickenberg, Rowekamp, Kouh, & Sharpee, 2012). It may be useful to incorporate response normalization in future instantiations of these models. 
Data-constrained likelihood functions
The class-conditional response distributions fully determine the likelihood function over the latent variable for any joint filter response R to an arbitrary stimulus. When the class-conditional response distributions are Gaussian, as they are here, the log-likelihood of latent variable value Xu is quadratic in the encoding filter responses  
\begin{equation}\tag{19}{ \log {\cal L}\left({X_u};{\bf{R}}\right) = \log {\cal P}\left({\bf{R}}|{X_u}\right) = - {1 \over 2}{\left({\bf{R}} - {\mu _u}\right)^T}\Sigma _u^{ - 1}\left({\bf{R}} - {\mu _u}\right) + {\zeta _u}} \end{equation}
where Display Formula
\({\zeta _u} = - {1 \over 2}\log |2\pi {\Sigma _u}|\)
. (Note that the likelihood function over the latent variable (Equation 19) is distinct from likelihood function over the AMA–Gauss filters; Equation 15.) Carrying out the matrix multiplication shows that the log-likelihood can be re-expressed as the weighted sum of squared, sum-squared, and linear filter responses  
\begin{equation}\tag{20}{ \log \left[{\cal P}\left({\bf{R}}|{X_u}\right)\right] = \sum\limits_{i = 1}^q {{{\bf{w}}_{i,u}}} {R_i} + \sum\limits_{ii = 1}^q {{{\bf{w}}_{ii,u}}} R_i^2 + \sum\limits_{i = 1}^{q - 1} {\sum\limits_{j = i + 1}^q {{{\bf{w}}_{ij,u}}} } {\left({R_i} + {R_j}\right)^2} + {{\zeta ^{\prime} }_u} } \end{equation}
where q is the number of filters and where the weights are functions of the class-conditional mean and covariance for each value Xu of the latent variable (Burge & Geisler, 2014, 2015). Specifically,  
\begin{equation}\tag{21}{{\bf{w}}_{i,u}} = \Sigma _u^{ - 1}{\mu _u}\end{equation}
 
\begin{equation}\tag{22}{{\bf{w}}_{ii,u}} = - diag\left(\Sigma _u^{ - 1}\right) + 0.5\Sigma _u^{ - 1}{\bf{1}}\end{equation}
 
\begin{equation}\tag{23}{{\bf{w}}_{ij,u}} = - 0.5\Sigma _{ij,u}^{ - 1},\forall ij{\rm{\ where}}\;j \gt i\end{equation}
 
\begin{equation}\tag{24}{\zeta ^{\prime} _u} = - 0.5\mu _u^T\Sigma _u^{ - 1}{\mu _u} + {\zeta _u}\end{equation}
where diag(.) is a function that returns the diagonal of a matrix and 1 is a column vector of ones. [Note that in these equations i and j index different filters (see Figure 2), not different latent variables and stimuli, as they do elsewhere in this manuscript.] These equations (Equations 2024) indicate that the log-likelihood of latent variable value Xu is obtained by pooling the squared (and linear) responses of each receptive field with weights determined by the mean μu and covariance Σu) of the subunit responses to stimuli with latent variable Xu.  
In the speed and disparity estimation tasks, nearly all of the information about the latent variable is carried by the class-conditional covariance; the covariance of the filter responses to natural stimuli changes significantly with changes in the latent variable (see Figures 4DF, 5E, F, and 6E, F). Thus, the weights on the squared and the sum-squared filter responses change dramatically with the value of the latent variable (Figure 9). In the speed estimation task, for example, the weights w34(X) on the sum-squared response of filter 3 and filter 4 peak at 0°/s (see Figure 9A). This peak results from the fact that the filter 3 and filter 4 response covariance is highest at 0°/s (see Figure 5F; Equation 23). In contrast, very little information is carried by the class-conditional means; the mean filter responses to natural stimuli are always approximately zero. Hence, the weights on the linear subunit responses are approximately zero (see Equation 21, Figure 4AC). 
Figure 9
 
Quadratic pooling weights for computing the likelihood. Weights on squared and sum-squared filter responses (wii(X) and wij(X)) as a function of the latent variable. Weights on the linear filter responses are all approximately zero and are not shown. (A) Weights for speed estimation task. (B) Weights for disparity estimation task. Weights on squared responses are at maximum magnitude when the variance of the corresponding filter responses is at minimum. Weights on sum-squared responses are at maximum magnitude for latent variables yielding maximum response covariance (see Figures 5E, F and 6E, F).
Figure 9
 
Quadratic pooling weights for computing the likelihood. Weights on squared and sum-squared filter responses (wii(X) and wij(X)) as a function of the latent variable. Weights on the linear filter responses are all approximately zero and are not shown. (A) Weights for speed estimation task. (B) Weights for disparity estimation task. Weights on squared responses are at maximum magnitude when the variance of the corresponding filter responses is at minimum. Weights on sum-squared responses are at maximum magnitude for latent variables yielding maximum response covariance (see Figures 5E, F and 6E, F).
The filter response distributions determine the computations (i.e., quadratic pooling rules and weights) required to compute the likelihood of different latent variable values. If these computations (Equation 2024) are paired with an exponential output nonlinearity and implemented in a neuron, the neuron's response Display Formula
\(R_u^L = {\cal L}({X_u};{\bf{R}})\)
would represent the likelihood that a stimulus having a particular value Xu of the latent variable elicited the observed filter responses R. This latent variable value Xu would be the preferred stimulus of the likelihood neuron. We refer to this hypothetical neuron as an AMA–Gauss likelihood neuron (see Equation 20). 
Four example, likelihood functions are shown in Figure 10A, one for each of four stimuli having a true speed of −4°/s. Figure 10B shows four likelihood functions for stimuli having a true speed of 0°/s. Figure 10C, D show likelihood functions for stimuli having −15 arcmin and 0 arcmin of binocular disparity, respectively. These plots show the likelihood functions, but they are not the standard way of assessing the response properties of neurons in cortex. 
Figure 10
 
Likelihood functions for speed and disparity tasks. (A) Likelihood functions for four randomly chosen natural image movies having true speeds of 4°/s. Each likelihood function represents the population response of the set of likelihood neurons, arranged by their preferred speeds. To ease the visual comparison, the likelihood functions are normalized to a constant volume by the sum of the likelihoods. (B) Same as (A), but for movies with a true speed of 0°/s. (C, D) Same as (A, B) but for stereo-images with −7.5 arcmin and 0.0 arcmin of disparity, respectively. (E) Tuning curves of speed-tuned likelihood neurons. For speeds sufficiently different from zero, tuning curves are approximately log-Gaussian and increase in width with speed. For near-zero speeds, tuning curves are approximately Gaussian. Each curve represents the mean response (i.e., tuning curve) of a likelihood neuron having a different preferred speed, normalized to a common maximum. Gray areas indicate 68% confidence intervals. (F) Tuning curves of disparity-tuned likelihood neurons.
Figure 10
 
Likelihood functions for speed and disparity tasks. (A) Likelihood functions for four randomly chosen natural image movies having true speeds of 4°/s. Each likelihood function represents the population response of the set of likelihood neurons, arranged by their preferred speeds. To ease the visual comparison, the likelihood functions are normalized to a constant volume by the sum of the likelihoods. (B) Same as (A), but for movies with a true speed of 0°/s. (C, D) Same as (A, B) but for stereo-images with −7.5 arcmin and 0.0 arcmin of disparity, respectively. (E) Tuning curves of speed-tuned likelihood neurons. For speeds sufficiently different from zero, tuning curves are approximately log-Gaussian and increase in width with speed. For near-zero speeds, tuning curves are approximately Gaussian. Each curve represents the mean response (i.e., tuning curve) of a likelihood neuron having a different preferred speed, normalized to a common maximum. Gray areas indicate 68% confidence intervals. (F) Tuning curves of disparity-tuned likelihood neurons.
The response properties of neurons in cortex are most commonly assessed by their tuning curves. Likelihood neuron tuning curves are obtained by first computing the mean likelihood neuron response across all natural stimuli having latent variable value Xk  
\begin{equation}\tag{25}\bar R_u^L\left({X_k}\right) = {1 \over {{N_k}}}\sum\limits_{l = 1}^{{N_k}} {\cal L} \left({X_u};{\bf{R}}\left(k,l\right)\right)\end{equation}
and then repeating for all values of the latent variable. Tuning curves for a population of likelihood neurons Display Formula
\({\overline {\bf{R}} ^L}(X)\)
having a range of preferred speeds are shown in Figure 10E. The speed tuning curves are approximately Gaussian for preferred speeds near 0°/s and approximately log-Gaussian otherwise. Consistent with these results, neurons in middle temporal (MT) area have approximately log-Gaussian speed tuning curves, and have bandwidths that increase systematically with speed (Nover, Anderson, & DeAngelis, 2005). It is also interesting to note that while quadratic computations are required to optimally decode the latent variable directly from the filter responses (see Figure 5E, F), likelihood neuron responses are linearly separable in speed. Similar points can be made about the disparity likelihood neurons (Figure 10F). The computations reported here thus constitute a general recipe for how to construct selective, invariant neurons having an arbitrary preferred stimulus (latent variable) from the responses of a small, well-chosen set of receptive fields.  
Linking AMA–Gauss and the energy model
Neural activity involved in many visual tasks has been productively modeled by energy-model-like (i.e., quadratic) computations (Cumming & DeAngelis, 2001; Emerson, Bergen, & Adelson, 1992; Peng & Shi, 2010). We have shown that in two classic tasks (retinal speed and binocular disparity estimation), the class-conditional filter response distributions to natural stimuli are approximately Gaussian distributed. In such cases, quadratic combinations of the filter responses are the optimal computations and yield the likelihood of a particular value of the latent variable (Equation 20). The weights are determined by the filter responses to natural stimuli (see Methods). Thus, if these computations were instantiated in a neuron, then its response would represent the likelihood of latent variable (Figure 2C). The current results therefore constitute a normative explanation for why energy-model-like computations account for response properties of neurons involved in these tasks. 
Interestingly, in recent years, discrepancies have emerged between the properties of neurons in cortex and the energy models that are often used to describe them (Cumming & DeAngelis, 2001; Rust et al., 2005; Tanabe, Haefner, & Cumming, 2011). Many of these discrepancies are a natural consequence of the optimal computations for estimating disparity and motion from natural stimuli (Burge & Geisler, 2014, 2015). For example, the responses of motion- and disparity-selective neurons have both been found to depend on multiple excitatory and suppressive subunit receptive fields, rather than the two exclusively excitatory subunit receptive fields posited by the energy model. Multiple subunit receptive fields have increased potential to select task-relevant information from each stimulus. Excitatory and inhibitory weighting schemes are required to use the selected information optimally. The quadratic computations in Equation 20 specify exactly how to optimally weight and sum the responses from multiple receptive fields to achieve selectivity for particular latent variable values (also see Figure 9). These computations yield more selective, invariant tuning curves (and improved estimation performance) than those of the standard energy model (Burge & Geisler, 2014, 2015), and follow directly from the normative framework employed here. 
Linking AMA–Gauss and the GQM: Connecting normative and response triggered analyses
In this section, we establish the formal similarities between AMA–Gauss and the generalized model (GQM), a recently developed subunit model for neural systems identification (Park et al., 2013; Wu et al., 2015). The goal of the GQM is to identify (fit) the subunit receptive fields that drive a neuron's response (Figure 2B). The goal of AMA–Gauss is to find the subunit receptive fields and quadratic pooling rules that are best for a particular task (Figure 2C). AMA can thus generate predictions about the subunit receptive fields that the GQM will recover from a neuron, under the hypothesis that the neuron computes the likelihood of a task-relevant latent variable. 
The GQM assumes that a neuron's spiking or intracellular voltage response to a stimulus is given by  
\begin{equation}\tag{26}y\sim P\left(f\left(Q\left({\bf{x}}\right)\right)\right)\quad {\rm{where}}\>Q\left({\bf{x}}\right) = {{\bf{x}}^T}C{\bf{x}} + {{\bf{b}}^T}{\bf{x}} + a\end{equation}
where y is the neural response, P(.) is the noise model, f(.) is a nonlinearity, and Display Formula
\(\lambda = f(Q({\bf{x}}))\)
is the mean response. In Park et al., 2013, the authors use maximum likelihood methods to recover the parameters of the model given a set of stimuli, the neuron's response to each stimulus, and a noise model. In AMA–Gauss, the log-likelihood of latent variable Xu is given by  
\begin{equation}\tag{27}l\left({X_u}\right) = - {1 \over 2}{\left({\bf{R}} - {\mu _u}\right)^T}\Sigma _u^{ - 1}\left({\bf{R}} - {\mu _u}\right) + {\zeta _u}\end{equation}
where μu and Σu are the class-conditional response mean and covariance and ζu is a constant. The noisy filter response vector R is given by the projection of the stimulus onto the filters f plus noise (Equations 4, 5). Hence, Equation 27 can be rewritten as  
\begin{equation}\tag{28}l\left({X_u}\right) = - {1 \over 2}\bigg({{\bf{x}}^T}{\bf{f}}\Sigma _u^{ - 1}{{\bf{f}}^T}{\bf{x}} - 2\left(\mu _u^T\Sigma _u^{ - 1} - \eta _u^T\Sigma _u^{ - 1}\right){{\bf{f}}^T}{\bf{x}} + \mu _u^T\Sigma _u^{ - 1}{\mu _u} - \eta _u^T\Sigma _u^{ - 1}{\eta _u} + 2\eta _u^T\Sigma _u^{ - 1}{\mu _u}\bigg) + {\zeta _u}\end{equation}
 
\begin{equation}{\rm{or}}\quad l\left({X_u}\right) = {{\bf{x}}^T}C{\bf{x}} + {{\bf{b}}^T}{\bf{x}} + a\end{equation}
where Display Formula
\(C = - {1 \over 2}{{\bf{f}}^T}\Sigma _u^{ - 1}{\bf{f}}\)
is a rank-q matrix where q is the number of filters, Display Formula
\({{\bf{b}}^T} = \mu _u^T\Sigma _u^{ - 1}{{\bf{f}}^T} - \eta _u^T\Sigma _u^{ - 1}{{\bf{f}}^T}\)
, and Display Formula
\(a = - {1 \over 2}\mu _u^T\Sigma _u^{ - 1}{\mu _u} + {1 \over 2}\eta _u^T\Sigma _u^{ - 1}{\eta _u} - \eta _u^T\Sigma _u^{ - 1}{\mu _u} + {\zeta _u}\)
. (Parameter values under the expected log-likelihood are provided in Appendix E). The parameters of the GQM are therefore simple functions of the AMA–Gauss encoding filters f and their responses to natural stimuli, conditional on latent variable Xu. Given a hypothesis about the functional purpose of a neuron's activity, AMA–Gauss could predict the parameters that the GQM would recover via response-triggered analyses.  
The primary formal distinction between AMA–Gauss and the GQM is that AMA–Gauss explicitly models noise in the encoding filter responses, whereas the GQM models noise only after quadratic pooling of the filter responses; that is, the GQM implicitly assumes noiseless filter responses. When subunit responses are noiseless, all subunit receptive fields spanning the same subspace (i.e., all linear filter combinations) provide an equivalent encoding. When responses are noisy (as they are in all biological systems), the stimulus encodings provided by different filters spanning the same subspace are no longer equivalent (Burge & Jaini, 2017). Future work will examine whether this distinction between AMA and the GQM can be leveraged to overcome a limitation common to all standard subunit models, namely, that their descriptions of neurons are unique only up to the subspace spanned by the subunit receptive fields (but see Kaardal, Fitzgerald, Berry, & Sharpee, 2013). 
Discussion
Accuracy maximization analysis (AMA) is a supervised Bayesian method for task-specific dimensionality reduction; it returns the encoding filters (receptive fields) that select the stimulus features that provide the most useful information about the task-relevant latent variable (Geisler et al., 2009). In conjunction with carefully collected databases of natural images and scenes and psychophysical experimental techniques, AMA has contributed to the development of ideal observers for several fundamental sensory-perceptual tasks in early- and mid-level vision (Geisler et al., 2009; Burge & Geisler, 2011, 2014, 2015). Unfortunately, AMA's compute time is high enough to render the method impractical for large problems without specialized computing resources. 
We have developed AMA–Gauss, which makes the assumption that the class-conditional filter responses are Gaussian distributed and have shown that AMA–Gauss markedly reduces compute-time without compromising performance when the assumption is justified. We show that the assumption is justified for two fundamentally important visual tasks with natural stimuli (see Figure 5 and Figure 6; Burge & Geisler, 2014, 2015). These results provide a normative explanation for why energy model-like computations have proven useful in the study of motion and disparity estimation and discrimination. We speculate that the assumption will prove justified for other energy-model-related tasks in early vision (e.g., motion-in-depth estimation). AMA–Gauss also has the same formal structure as the generalized quadratic model (GQM) a recently developed method for neural systems identification, raising the possibility that a single framework could be used both to predict and estimate the properties of involved in particular tasks. 
There are several important implications of these results. First, the optimal filters and the optimal pooling rules for decoding the latent variable are all determined by the properties of natural stimuli. If the training sets are representative of stimuli encountered in natural viewing, then the computations reported here should be optimal for the tasks of speed and disparity estimation. Second, at the right level of abstraction, the optimal solutions to these two different tasks share deep similarities, thereby raising the possibility that the same normative computational framework will apply to all energy-model related tasks. 
Response distributions: Gaussian vs. heavy-tailed
The results reported here may appear to conflict with the widely reported finding that receptive field responses to natural images are highly non-Gaussian, with heavy tails sharp peaks at zero (Cadieu & Olshausen, 2012; Field, 1987; Olshausen & Field, 1997). There are two explanations for this apparent discrepancy. First, previous analyses generally have not incorporated contrast normalization. Second, previous analyses are generally unsupervised and therefore do not condition on relevant latent variables (e.g. motion; Cadieu & Olshausen, 2012). Note that even when contrast normalization is incorporated and the class-conditional responses are Gaussian, the filter responses, marginalized over the latent variable, tend to be heavy-tailed because the marginals are mixtures of Gaussians Display Formula
\({\cal P}({R_t}) = \sum\nolimits_u {\cal P} ({R_t}|{X_u}){\cal P}({X_u})\)
with different variances (see black curves in Figure 5E, F and Figure 6E, F). Therefore, our results are more similar to previous results than it may appear at first glance (Ruderman & Bialek, 1994). In general, heavy-tailed response distributions are yielded by response models that do not incorporate biologically plausible contrast normalization and response analyses that do not include latent variable conditionalization (Lyu & Simoncelli, 2009; Wang et al., 2004). Incorporating response normalization and latent variable conditionalization, as we have here, may help reveal statistical properties of receptive field responses to complex natural stimuli that have not yet been fully appreciated. 
Likelihood functions: Data-constrained vs assumed
Evolution selects organisms because they perform certain critical sensory, perceptual, and behavioral tasks better than their evolutionary competitors. Certain features of sensory stimuli are more useful for some tasks than others. The stimulus features that are most useful to encode thus depend on the task-relevant latent variables that will be decoded from the stimuli. However, many models of neural encoding do not explicitly consider the tasks for which the encoded information will be decoded (Olshausen & Field, 1997; Simoncelli & Olshausen, 2001) and many task-specific models of neural decoding do not explicitly consider how sensory stimuli and neural encoders constrain the information available for decoding (Ernst & Banks, 2002; Ma, Beck, Latham, & Pouget, 2006). 
The approach advanced here is an early attempt to address both issues simultaneously. By performing task-specific analyses using thousands of individual natural stimuli, learning the optimal filters, and characterizing the class-conditional responses to natural stimuli, we determined the likelihood functions that optimize performance in natural viewing. The likelihood functions that result from the filter response distributions are (on average) log-Gaussian in speed and disparity, with widths that increase with the value of the latent variable. In previous work with natural stimuli, we showed that the optimal receptive fields, response distributions, and resulting likelihood functions are robust to significant variation in the shape of the prior, cost function, and noise power (Burge & Jaini, 2017). It is reasonable to conclude that the task and the constraints imposed by natural stimuli are the most important determinants of the width and shape of the likelihood functions. 
Some prominent theories of neural processing operate on the assumption that likelihood functions can take on arbitrary widths and shapes via flexible allocation of neural resources (Ganguli & Simoncelli, 2014; Girshick, Landy, & Simoncelli, 2011; Seung & Sompolinsky, 1993; Wei & Stocker, 2015). Some reports have gone further to suggest that, in the context of Bayesian efficient coding, the prior probability distribution over the latent variable is the primary factor determining the widths and shapes of the likelihood functions (Ganguli & Simoncelli, 2014; Wei & Stocker, 2015). These reports predict that if the prior probability distribution is flat, the likelihood functions will be symmetric and have widths that remain constant with changes in the value of the latent variable. These reports also predict that if the prior probability distribution is nonuniform (e.g., peaked at zero), the likelihood functions will be asymmetric with widths that change systematically with the latent variable. 
In the tasks that we examined, we found that asymmetric likelihood functions optimize performance despite the fact that the training sets from which the optimal filters were learned had flat priors over the latent variable (see Results; Burge & Geisler, 2014, 2015; Burge & Jaini, 2017). These results appear at odds with the predictions of previous reports. However, these previous reports do not model the impact of natural stimulus variation. The implicit assumption is that task-irrelevant (“nuisance”) stimulus variation can be ignored (Ganguli & Simoncelli, 2014; Wei & Stocker, 2015). If the goal is to understand optimal task-specific processing of natural signals, our results indicate that such variation cannot be ignored. Indeed, task-relevant and irreducible task-irrelevant natural stimulus variability are almost certainly the most important determinants of likelihood function shapes and widths. 
In natural viewing, visual estimates are driven primarily by stimulus measurements (likelihood functions), not by prior distributions. If estimates were driven only by the prior, observers could not respond to spatial or temporal changes in the environment. A full account of task-specific perceptual processing and its underlying neurophysiology must therefore incorporate natural stimulus variability. Future studies on the efficient allocation of neural resources should verify that the likelihood functions used in modeling efforts can be constructed by nervous systems given the constraints imposed by natural stimuli. 
Natural vs. artificial stimuli
The problem of estimating speed and disparity from natural images is different from the problem of estimating speed and disparity with artificial laboratory stimuli in at least one important respect. Variability among natural stimuli having the same latent variable level is substantially greater than variability amongst artificial stimuli commonly used in vision and visual neuroscience experiments. In motion experiments (Figure 11A), drifting Gabors and random-dot kinematograms are common artificial stimuli. In disparity experiments, phase-shifted binocular Gabors and random-dot stereograms are common artificial stimuli (Figure 11B). The statistical properties of these artificial stimuli are notably different than the statistical properties of natural stimuli. Gabors have Gaussian amplitude spectra and random-dot stereograms have delta auto-correlation functions. Natural stimuli have a rich variety of textures and shapes that cause significant variation in their 1/f amplitude spectra and auto-correlation functions. 
Figure 11
 
Natural stimuli, artificial stimuli, and class-conditional responses. Many different retinal images are consistent with a given value of the task-relevant latent variable. These differences cause within-class (task-irrelevant) stimulus variation. Within-class stimulus variation is greater for natural stimuli than for typical artificial stimuli used in laboratory experiments. (A) Stimuli for speed estimation experiments. Two different example stimuli are shown for each stimulus type: natural stimuli (represented by cartoon line drawings), Gabor stimuli, and random-dot stimuli. Both example stimuli for each stimulus type drift at exactly the same speed, but create different retinal images. Natural stimuli cause more within-class retinal stimulus variation than artificial stimuli. (B) Same as (A), but for disparity. (C) Speed task: class-conditional responses to contrast-fixed 1.0 cpd drifting Gabors with random phase (speed task). Colors indicate different speeds. Ellipses represent filter responses to natural stimuli having the same speeds. (D) Disparity task: Class-conditional responses to contrast-fixed 1.5 cpd binocular Gabors with random phase. Class-conditional responses no longer have Gaussian structure, and instead have ring structure.
Figure 11
 
Natural stimuli, artificial stimuli, and class-conditional responses. Many different retinal images are consistent with a given value of the task-relevant latent variable. These differences cause within-class (task-irrelevant) stimulus variation. Within-class stimulus variation is greater for natural stimuli than for typical artificial stimuli used in laboratory experiments. (A) Stimuli for speed estimation experiments. Two different example stimuli are shown for each stimulus type: natural stimuli (represented by cartoon line drawings), Gabor stimuli, and random-dot stimuli. Both example stimuli for each stimulus type drift at exactly the same speed, but create different retinal images. Natural stimuli cause more within-class retinal stimulus variation than artificial stimuli. (B) Same as (A), but for disparity. (C) Speed task: class-conditional responses to contrast-fixed 1.0 cpd drifting Gabors with random phase (speed task). Colors indicate different speeds. Ellipses represent filter responses to natural stimuli having the same speeds. (D) Disparity task: Class-conditional responses to contrast-fixed 1.5 cpd binocular Gabors with random phase. Class-conditional responses no longer have Gaussian structure, and instead have ring structure.
To examine the impact of artificial stimuli on the class-conditional responses, we created artificial stimulus sets comprised of contrast-fixed, phase-randomized Gabors drifting at different speeds and having different amounts of disparity. For each task, the spatial frequency of the carrier was closely matched to the preferred spatial frequency of the first two optimal filters (1.0 cpd for speed, 1.5 cpd for disparity). Joint filter responses to these artificial stimuli are shown in Figure 11C, D; they are notably different than the filter responses to natural stimuli. Although the class-conditional responses to Gabors are approximately aligned with the major axis of the Gaussian characterizing responses to corresponding natural stimuli, the responses themselves are no longer Gaussian distributed, exhibiting ring-shaped structure instead. Thus, determining the optimal rules for processing natural stimuli by analyzing only artificial stimuli is likely to be a difficult enterprise. 
These results suggest another conclusion that may be somewhat counterintuitive given the history of the field. The tradition in vision science has been to eliminate irrelevant stimulus variation from experimental protocols by using simple artificial stimuli. These stimuli are easy to characterize mathematically and manipulate parametrically. But artificial stimuli lack the richness and variability that visual systems evolved to process. Analyzing complex, variable natural stimuli may reveal simple (e.g., Gaussian) statistical structure that might otherwise be missed. We believe that the results presented here highlight the importance of conducting rigorous, well-controlled, task-focused computational and behavioral investigations with natural stimuli. These investigations complement classic studies with artificial stimuli, and provide a fuller picture of how visual systems function in natural circumstances. 
Limitations and future directions
The results presented here represent the first in what we hope is a series of steps to link normative models for natural tasks and descriptive models of neural response. However, although we believe that developing AMA-Gauss and demonstrating its links to methods for neural systems identification are useful advances, several limitations should be kept in mind. Here, we address the drawbacks of the natural stimulus sets, the general applicability of AMA–Gauss, and the importance of the links that we have drawn to descriptive models of neural response. 
The natural image sets used in this manuscript had natural contrast distributions and photographic textures, but they lacked natural depth structure. All motion signals were rigid and all disparity signals were planar. Future work will examine the impact of non-rigid motion (e.g., looming) and local depth variation (e.g., occlusion) on performance. We have recently collected a dataset of stereo images that addresses this limitation (Burge, McCann, & Geisler, 2016). Each stereo image has co-registered distance data from which ground truth disparity patterns can be computed. Pilot analyses suggest that the results presented in the current manuscript hold for natural stereo images with local depth variation. We suspect, but we are not yet well-positioned to show, that the same will be true of motion signals having natural depth variation. 
AMA–Gauss is the appropriate normative framework for understanding energy-model-related tasks, but the general usefulness of AMA–Gauss is unknown. AMA–Gauss makes the best possible use of the first- and second-order filter response statistics, but it is blind to higher-order response statistics that may exist in natural motion (Nitzany & Victor, 2014) and natural disparity signals (see Appendix D). To increase generality, one could develop a further variant of AMA that incorporates rectification into the response model. This modification would confer the ability, at least in principle, to pick up on potentially useful higher-order motion and disparity cues, and provide a normative model that complements other methods for neural systems identification (McFarland et al., 2013). 
Conclusion
In this paper, we develop AMA–Gauss, a new form of AMA that incorporates the assumption that the class-conditional filter responses are Gaussian distributed. We use AMA–Gauss to establish links between task-specific normative models of speed and disparity estimation and the motion- and disparity-energy models, two popular descriptive models of neurons that are selective for those quantities. Our results suggest that energy-model-like (i.e., quadratic) computations are optimal for these tasks in natural scenes. We also establish the formal similarities between AMA–Gauss and the generalized quadratic model (GQM), a recently developed model for neural systems identification. The developments presented here forge links between normative task-specific modeling and powerful statistical tools for describing neural response, and demonstrate the importance of analyzing natural signals in perception and neuroscience research. 
Acknowledgments
P. J. was supported by the Cheriton Graduate Scholarship and startup funds from the University of Pennsylvania. J. B. was supported by National Institutes of Health Grant EY011747 and startup funds from the University of Pennsylvania. 
Commercial relationships: none. 
Corresponding author: Johannes Burge. 
Address: Department of Psychology, University of Pennsylvania, Philadelphia, PA, USA. 
References
Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2 (2), 284–299.
Albrecht, D. G., & Geisler, W. S. (1991). Motion selectivity and the contrast-response function of simple cells in the visual cortex. Visual Neuroscience, 7 (6), 531–546.
Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37 (23), 3327–3338.
Burge, J., & Geisler, W. S. (2011). Optimal defocus estimation in individual natural images. Proceedings of the National Academy of Sciences, USA, 108 (40), 16849–16854.
Burge, J., & Geisler, W. S. (2012). Optimal defocus estimates from individual images for autofocusing a digital camera. In Proceedings of the IS&T/SPIE 47th annual meeting. Proceedings of SPIE.
Burge, J., & Geisler, W. S. (2014). Optimal disparity estimation in natural stereo images. Journal of Vision, 14 (2): 1, 1–18, doi:10.1167/14.2.1. [PubMed] [Article]
Burge, J., & Geisler, W. S. (2015). Optimal speed estimation in natural image movies predicts human performance. Nature Communications, 6, 7900.
Burge, J., & Jaini, P. (2017). Accuracy maximization analysis for sensory-perceptual tasks: Computational improvements, filter robustness, and coding advantages for scaled additive noise. PLoS Computational Biology, 13 (2), e1005281.
Burge, J., McCann, B. C., & Geisler, W. S. (2016). Estimating 3d tilt from local image cues in natural scenes. Journal of Vision, 16 (13): 2, 1–25, doi:10.1167/16.13.2. [PubMed] [Article]
Cadieu, C. F., & Olshausen, B. A. (2012). Learning intermediate-level representations of form and motion from natural movies. Neural Computation, 24 (4), 827–866.
Carandini, M., & Heeger, D. J. (2012). Normalization as a canonical neural computation. Nature Reviews Neuroscience, 13 (1), 51–62.
Coen-Cagli, R., Dayan, P., & Schwartz, O. (2012). Cortical surround interactions and perceptual salience via natural scene statistics. PLoS Computational Biology, 8 (3), e1002405.
Cook, R., Forzani, L., & Yao, A. (2010). Necessary and sufficient conditions for consistency of a method for smoothed functional inverse regression. Statistica Sinica, 20 (1), 235–238.
Cook, R. D., & Forzani, L. (2009). Likelihood-based sufficient dimension reduction. Journal of the American Statistical Association, 104 (485), 197–208.
Cumming, B., & DeAngelis, G. (2001). The physiology of stereopsis. Annual Review of Neuroscience, 24 (1), 203–238.
DeAngelis, G. C. (2000). Seeing in three dimensions: the neurophysiology of stereopsis. Trends in Cognitive Sciences, 4 (3), 80–90.
Eickenberg, M., Rowekamp, R. J., Kouh, M., & Sharpee, T. O. (2012). Characterizing responses of translation-invariant neurons to natural stimuli: maximally informative invariant dimensions. Neural Computation, 24 (9), 2384–2421.
Emerson, R. C., Bergen, J. R., & Adelson, E. H. (1992). Directionally selective complex cells and the computation of motion energy in cat visual cortex. Vision Research, 32 (2), 203–218.
Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415 (6870), 429–433.
Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America. A, Optics and Image Science, 4 (12), 2379–2394.
Ganguli, D., & Simoncelli, E. P. (2014). Efficient sensory encoding and Bayesian inference with heterogeneous neural populations. Neural computation, 26 (10), 2103–2134.
Gao, D., & Vasconcelos, N. (2009). Decision-theoretic saliency: computational principles, biological plausibility, and implications for neurophysiology and psychophysics. Neural Computation, 21 (1), 239–271.
Geisler, W. S., & Albrecht, D. G. (1997). Visual cortex neurons in monkeys and cats: Detection, discrimination, and identification. Visual Neuroscience, 14 (5), 897–919.
Geisler, W. S., Najemnik, J., & Ing, A. D. (2009). Optimal stimulus encoders for natural tasks. Journal of Vision, 9 (13): 17, 1–16, doi:10.1167/9.13.17. [PubMed] [Article]
Girshick, A. R., Landy, M. S., & Simoncelli, E. P. (2011). Cardinal rules: Visual orientation perception reflects knowledge of environmental statistics. Nature Neuroscience, 14 (7), 926–932.
Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9 (02), 181–197.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24 (6), 417–441.
Jones, J. P., & Palmer, L. A. (1987a). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58 (6), 1233–1258.
Jones, J. P., & Palmer, L. A. (1987b). The two-dimensional spatial structure of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58 (6), 1187–1211.
Kaardal, J., Fitzgerald, J. D., Berry, M. J., & Sharpee, T. O. (2013). Identifying functional bases for multidimensional neural computations. Neural Computation, 25 (7), 1870–1890.
Lewicki, M. S. (2002). Efficient coding of natural sounds. Nature Neuroscience, 5 (4), 356–363.
Lyu, S., & Simoncelli, E. P. (2009). Modeling multiscale sub-bands of photographic images with fields of Gaussian scale mixtures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31 (4), 693–706.
Ma, W. J., Beck, J. M., Latham, P. E., & Pouget, A. (2006). Bayesian inference with probabilistic population codes. Nature Neuroscience, 9 (11), 1432–1438.
McFarland, J. M., Cui, Y., & Butts, D. A. (2013). Inferring nonlinear neuronal computation based on physiologically plausible inputs. PLoS Computational Biology, 9 (7), e1003143.
Nitzany, E. I., & Victor, J. D. (2014). The statistics of local motion signals in naturalistic movies. Journal of Vision, 14 (4): 10, 1–15, doi:10.1167/14.4.10. [PubMed] [Article]
Nover, H., Anderson, C. H., & DeAngelis, G. C. (2005). A logarithmic, scale-invariant representation of speed in macaque middle temporal area accounts for speed discrimination performance. The Journal of Neuroscience, 25 (43), 10049–10060.
Ohzawa, I. (1998). Mechanisms of stereoscopic vision: The disparity energy model. Current Opinion in Neurobiology, 8 (4), 509–515.
Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249 (4972), 1037–1041.
Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the cat's visual cortex. Journal of Neurophysiology, 77 (6), 2879–2909.
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381 (6583), 607–609.
Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37 (23), 3311–3325.
Pagan, M., Simoncelli, E. P., & Rust, N. C. (2016). Neural quadratic discriminant analysis: Nonlinear decoding with V1-like computation. Neural Computation, 29, 2291–2319.
Park, I. M., Archer, E. W., Priebe, N., & Pillow, J. W. (2013). Spectral methods for neural characterization using generalized quadratic models. In Burges, C. J. C. Bottou, L. Welling, M. Ghahramani, Z. & Weinberger K. Q. (Eds.), Advances in neural information processing systems (pp. 2454–2462). Red Hook, NY: Curran Associates.
Peng, Q., & Shi, B. E. (2010). The changing disparity energy model. Vision Research, 50 (2), 181–192.
Petersen, K. B., & Pedersen, M. S. (2008). The matrix cookbook. Technical University of Denmark, 7, 15.
Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Physical Review Letters, 73 (6), 814–817.
Rust, N. C., Schwartz, O., Movshon, J. A., & Simoncelli, E. P. (2005). Spatiotemporal elements of macaque v1 receptive fields. Neuron, 46 (6), 945–956.
Schwartz, O., Pillow, J. W., Rust, N. C., & Simoncelli, E. P. (2006). Spike-triggered neural characterization. Journal of Vision, 6 (4): 13, 484–507, doi:10.1167/6.4.13. [PubMed] [Article]
Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Sciences, USA, 90 (22), 10749–10753.
Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24 (1), 1193–1216.
Tanabe, S., Haefner, R. M., & Cumming, B. G. (2011). Suppressive mechanisms in monkey V1 help to solve the stereo correspondence problem. Journal of Neuroscience, 31 (22), 8295–8305.
Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61 (3), 611–622.
Vintch, B., Movshon, J. A., & Simoncelli, E. P. (2015). A convolutional subunit model for neuronal responses in macaque v1. The Journal of Neuroscience, 35 (44), 14829–14841.
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13 (4), 600–612.
Wei, X.-X., & Stocker, A. A. (2015). A Bayesian observer model constrained by efficient coding can explain ‘anti-Bayesian' percepts. Nature Neuroscience, 18 (10), 1509–1517.
Williams, D. R. (1985). Visibility of interference fringes near the resolution limit. Journal of the Optical Society of America A, 2 (7), 1087–1093.
Wu, A., Park, I. M., & Pillow, J. W. (2015). Convolutional spike-triggered covariance analysis for neural subunit models. In Cortes, C. Lawrence, N. D. Lee, D. D. Sugiyama, M. & Garnett R. (Eds.), Advances in neural information processing systems (pp. 793–801). Red Hook, NY: Curran Associates.
Footnotes
1  AMA–Gauss software (MATLAB) is available at https://www.github.com/BurgeLab/AMA.
Appendix A: Gradient of the likelihood function
In any given training set having N stimuli, each stimulus is associated with some category k and an associated stimulus from that category l. Let us denote this pair (k, l) for the ith sample point with (ki, li). Then assuming that the response distribution conditioned on the classes is Gaussian, the likelihood function can be written as  
\begin{equation}{\cal L}\left({\bf f}\right)=\prod_{i=1}^N\left(2\pi\right)^{-{d \over 2}}\left\vert\Sigma_{k_i}\right\vert^{- {1 \over 2}}\exp\left[-{1 \over 2}\left({\bf R}\left(k_i,l_i\right)-{\bf \mu}_{k_i}\right)^T\Sigma_{k_i}^{-1}\left({\bf R}\left(k_i,l_i\right)-{\bf \mu}_{k_i}\right)\right]\end{equation}
Substituting the expression for the noisy responses (Equation 5) and defining Display Formula
\(l({\bf{f}}) = \log {\cal L}({\bf{f}})\)
yields the log-likelihood function of the AMA–Gauss filters  
\begin{equation}l\left({\bf{f}}\right) = {\zeta _u} - {1 \over 2}\sum\limits_{i = 1}^N {\log } |{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda | + {\left({{\bf{f}}^T}{{\bf{s}}_{{k_i}{l_i}}} - {{\bf{f}}^T}{{\bf{s}}_{{k_i}}} + \eta \right)^T}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}} \left({{\bf{f}}^T}{{\bf{s}}_{{k_i}{l_i}}} - {{\bf{f}}^T}{{\bf{s}}_{{k_i}}} + \eta \right)\end{equation}
where Display Formula
\({{\bf{s}}_{{k_i}}} = {1 \over {{N_{{k_i}}}}}\sum\nolimits_{{m_i} = 1}^{{N_{{k_i}}}} {{{\bf{s}}_{{k_i}{m_i}}}} \)
and Display Formula
\({B_{{k_i}}} = {1 \over {{N_{{k_i}}}}}\sum\nolimits_{{m_i} = 1}^{{N_{{k_i}}}} ( {{\bf{s}}_{{k_i},{m_i}}} - {{\bf{s}}_{{k_i}}}){({{\bf{s}}_{{k_i},{m_i}}} - {{\bf{s}}_{{k_i}}})^T}\)
are the class-conditional stimulus mean and covariance matrix, respectively, and Display Formula
\({\zeta _u} = - {1 \over 2}\log |2\pi {\Sigma _u}|\)
is a constant.  
Rearranging to segregate terms that do not depend on noise samples  
\begin{equation}\tag {29}\eqalign{ l\left({\bf{f}}\right) = {\zeta _u} - {1 \over 2}\sum\limits_{i = 1}^N \big[ &\log |{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda | + {\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right)^T}{\bf{f}}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}{{\bf{f}}^T}\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right) \cr&+ {\eta ^T}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}{{\bf{f}}^T}\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right) + {\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right)^T}{\bf{f}}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}\eta + {\eta ^T}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}\eta \big]} \end{equation}
where Display Formula
\({{\bf{f}}^T}B{\bf{f}} + \Lambda \)
is a symmetric matrix. Recognizing that each term in Equation 29 is a scalar, and rewriting using the properties that Tr(a) = a, Tr(AB) = Tr(BA) and Tr(A) = Tr(AT) yields  
\begin{equation}\tag {30}\eqalign{ l\left({\bf{f}}\right) = {\zeta _u} - {1 \over 2}\sum\limits_{i = 1}^N \biggr[& \log |{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda | + Tr\biggr({\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}{{\bf{f}}^T} \left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right) {\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right)^T}{\bf{f}}\biggr) \cr&+ 2Tr\left({\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right)^T}{\bf{f}}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}\eta \right) + Tr\left({\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}\eta {\eta ^T}\right)\biggr] }\end{equation}
To determine the gradient of the log-likelihoodDisplay Formula
\({\nabla _{\bf{f}}}l({\bf{f}})\)
, we derive the gradient of each term in Equation 30 separately as follows. Before doing so, we state some standard matrix results that will be used in the derivation (Petersen, Pedersen et al., 2008).  
\begin{equation}\tag{31}{{\partial \log \left(det\left({\bf{X}}\right)\right)} \over {\partial {\bf{X}}}} = {\left({{\bf{X}}^T}\right)^{ - 1}}\end{equation}
 
\begin{equation}\tag{32}{{\partial \over {\partial {\bf{X}}}}Tr\left({\left(A + {{\bf{X}}^T}{\bf{CX}}\right)^{ - 1}}{{\bf{X}}^T}{\bf{BX}}\right) = - 2{\bf{CX}}{\left(A + {{\bf{X}}^T}{\bf{CX}}\right)^{ - 1}}{{\bf{X}}^T}{\bf{BX}}{\left(A + {{\bf{X}}^T}{\bf{CX}}\right)^{ - 1}} + 2{\bf{BX}}{\left(A + {{\bf{X}}^T}{\bf{CX}}\right)^{ - 1}} } \end{equation}
 
\begin{equation}\tag{33}{\partial \over {\partial {\bf{X}}}}Tr{\left({{\bf{X}}^T}{\bf{CX}}\right)^{ - 1}}{\bf{A}} = - {\bf{CX}}{\left({{\bf{X}}^T}{\bf{CX}}\right)^{ - 1}}\left({\bf{A}} + {{\bf{A}}^T}\right){\left({{\bf{X}}^T}{\bf{CX}}\right)^{ - 1}}\end{equation}
The gradient of the first term in Equation 30 is obtained by using Equation 31 and the chain rule of differentiation  
\begin{equation}\tag {34}\eqalign{&{\nabla _{\bf{f}}}\log |\overbrace {{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda }^{\bf{Y}}|\; = {{\partial \log |{\bf{Y}}|} \over {\partial {\bf{Y}}}}{{\partial \left(\overbrace {{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda }^{\bf{Y}}\right)} \over {\partial {\bf{f}}}} \cr&{\nabla _{\bf{f}}}\log |{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda |\; = 2{B_{{k_i}}}{\bf{f}}{\left({\bf{f}}{B_{{k_i}}}{{\bf{f}}^T} + \Lambda \right)^{ - 1}} \cr} \end{equation}
 
The gradient of the second term in Equation 30 is obtained using Equation 32  
\begin{equation}\tag{35} {{\nabla _{\bf{f}}}Tr\left({\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}{{\bf{f}}^T}\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right){\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right)^T}{\bf{f}}\right) = 2\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right){\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right)^T}{\bf{f}}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}} - 2{B_{{k_i}}}{\bf{f}}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}{{\bf{f}}^T}\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right){\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right)^T}{\bf{f}}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}} \end{equation}
The gradient of the third term is obtained using Equation 33 and the chain rule of differentiation  
\begin{equation}\tag{36}{\nabla _{\bf{f}}}2Tr\left({\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right)^T}{\bf{f}}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}\eta \right) = 2\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right){\eta ^T}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}} + 2{B_{{k_i}}}{\bf{f}}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}\left(\eta {\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right)^T}{\bf{f}} + {{\bf{f}}^T}\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right){\eta ^T}\right){\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}} \end{equation}
The gradient of the fourth term is similarly obtained using Equation 33  
\begin{equation}\tag{37}{\nabla _{\bf{f}}}Tr\left({\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}\eta {\eta ^T}\right) = - 4{B_{{k_i}}}{\bf{f}}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}\left(\eta {\eta ^T}\right){\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}\end{equation}
The full gradient of the AMA–Gauss filter log-likelihood l(f) stated in Equation 30 can therefore be found by combining Equations 3437.  
The gradient of the expected log-likelihood follows directly from the gradient of the log-likelihood. The response noise Display Formula
\(\eta \sim {\cal N}({\bf{0}},\Lambda )\)
is normally distributed (Equation 6); therefore, Display Formula
\({{\Bbb E}_\eta }[{\eta ^T}{({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}})^{ - 1}}\eta ] = Tr({({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}})^{ - 1}}\Lambda )\)
. Substituting into Equation 30 yields the expected log-likelihood of the AMA–Gauss filters  
\begin{equation}\tag{38}{{\Bbb E}_\eta }\left[l\left({\bf{f}}\right)\right] = {\zeta _u} - {1 \over 2}\sum\limits_{i = 1}^N \left[ \log |{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda | - {\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right)^T}{\bf{f}}{\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}{{\bf{f}}^T}\left({{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}\right) - Tr\left({\left({{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda \right)^{ - 1}}\Lambda \right)\right]\end{equation}
The gradient of the expected log-likelihood, using Equations 34, 35, and 37, is given by  
\begin{equation}\tag{39}{\nabla _{\bf{f}}}{{\Bbb E}_\eta }\left[ {l\left( {\bf{f}} \right)} \right] = - \sum\limits_{i = 1}^N {\bigg[ {{B_{{k_i}}}{\bf{f}}{{\left( {{\bf{f}}{B_{{k_i}}}{{\bf{f}}^T} + \Lambda } \right)}^{ - 1}} - {B_{{k_i}}}{\bf{f}}{{\left( {{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda } \right)}^{ - 1}}{{\bf{f}}^T}\left( {{{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}} \right){{\left( {{{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}} \right)}^T}{\bf{f}}{{\left( {{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda } \right)}^{ - 1}} + \left( {{{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}} \right){{\left( {{{\bf{s}}_{{k_i}{l_i}}} - {{\bf{s}}_{{k_i}}}} \right)}^T}{\bf{f}}{{\left( {{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda } \right)}^{ - 1}} - {1 \over 2}{B_{{k_i}}}{\bf{f}}{{\left( {{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda } \right)}^{ - 1}}\left( {\Lambda + {\Lambda ^T}} \right){{\left( {{{\bf{f}}^T}{B_{{k_i}}}{\bf{f}} + \Lambda } \right)}^{ - 1}}} \bigg]} \end{equation}
 
Appendix B: Gradient of L2 cost function
The average expected cost across all the stimuli is  
\begin{equation}\tag{40}\bar C = {1 \over N}\sum\limits_{k,l} {{{\bar C}_{kl}}} \end{equation}
Given the squared error loss function, the expected cost per stimuli can be written as  
\begin{equation}\tag{41}{\bar C_{kl}} = {{\Bbb E}_{{\bf{R}}\left(k,l\right)}}\left[{\left(\hat X_{kl}^{opt} - {X_k}\right)^2}\right]\end{equation}
where Display Formula
\(\hat X_{kl}^{{\rm{opt}}} = \sum\nolimits_{u = 1}^{{N_{lvl}}} {{X_u}} {\cal P}({X_u}|{\bf{R}}(k,l))\)
since the optimal estimate for a squared error function is the mean of the posterior, i.e., Display Formula
\({\Bbb E}[{X_u}|{\bf{R}}(k,l)]\)
. Using the approximation that the expected cost of each stimulus is equal to the cost given the expected response (Geisler et al., 2009) yields  
\begin{equation}\tag{42}{\bar C_{kl}} \cong {\left(\sum\limits_{u = 1}^{{N_{lvl}}} {{X_u}} {\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right) - {X_k}\right)^2}\end{equation}
Therefore, to evaluate the gradient of the total cost we just need to evaluate the expression for the gradient of the expected cost of each stimulus. Hence,  
\begin{equation}\tag{43}{ {\nabla _{{{\bf{f}}_q}}}{{\bar C}_{kl}} = {\nabla _{{{\bf{f}}_q}}}{\left(\hat X_{kl}^{opt} - {X_k}\right)^2} = 2\left(\hat X_{kl}^{opt} - {X_k}\right){\nabla _{{{\bf{f}}_q}}}\hat X_{kl}^{opt} } \end{equation}
The gradient of the optimal estimate given the mean response is  
\begin{equation}\tag{44}{\nabla _{{{\bf{f}}_q}}}\hat X_{kl}^{opt} = \sum\limits_{u = 1}^{{N_{lvl}}} {{X_u}} \left[{\nabla _{{{\bf{f}}_q}}}{\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right)\right]\end{equation}
Hence, the problem reduces to finding Display Formula
\([{\nabla _{{{\bf{f}}_q}}}{\cal P}({X_u}|{\bf{r}}(k,l))]\)
 
\begin{equation}\tag{45}{\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right) = {{{\cal N}\left({\bf{r}}\left(k,l\right);{\mu _u},{\Sigma _u}\right)} \over {\sum\limits_{i = 1}^{{N_{lvl}}} {\cal N} \left({\bf{r}}\left(k,l\right);{\mu _i},{\Sigma _i}\right)}}\end{equation}
Making substitutions in Equation 45 gives  
\begin{equation}\tag{46}{\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right){\rm{\ }} = {{|{\Sigma _u}{|^{ - 0.5}}\exp\left[ { - 0.5{{\left({\bf{r}}\left(k,l\right) - {\mu _u}\right)}^T}\Sigma _u^{ - 1}\left({\bf{r}}\left(k,l\right) - {\mu _u}\right)} \right]} \over {\sum\nolimits_{i = 1}^{{N_{lvl}}} | \;{\Sigma _i}{|^{ - 0.5}}exp\left[ { - 0.5{{\left({\bf{r}}\left(k,l\right) - {\mu _i}\right)}^T}\Sigma _i^{ - 1}\left({\bf{r}}\left(k,l\right) - {\mu _i}\right)} \right]}}\end{equation}
 
\begin{equation}\tag{47} = {{|{{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda {|^{ - 0.5}}\exp\left[ { - 0.5{\bf{A}}_{{{_{kl}}_,}_u}^T{\bf{f}}{{\left({{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda \right)}^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,u}}} \right]} \over {\sum\nolimits_{i = 1}^{{N_{lvl}}} | \,{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda {|^{ - 0.5}}\exp\left[ { - 0.5{\bf{A}}_{kl,i}^T{\bf{f}}{{\left({{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda \right)}^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,i}}} \right]}}\end{equation}
where Akl,u = sklsu. The gradient of the posterior probability can then be evaluated using the following relation with the gradient of the logarithm of the posterior probability  
\begin{equation}\tag{48}{\nabla _{{{\bf{f}}_q}}}{\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right) = {\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right)\left[{\nabla _{{{\bf{f}}_q}}}\log {\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right)\right]\end{equation}
 
Taking the natural logarithm of the posterior yields  
\begin{equation}\tag{49}\log {\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right) = - \log \sum\limits_{i = 1}^{{N_{lvl}}} {{{|{{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda {|^{0.5}}} \over {|{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda {|^{0.5}}}}} \exp\bigg\lbrack{1 \over 2}\left({\bf{A}}_{kl,u}^T{\bf{f}}{\left({{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda \right)^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,u}} - {\bf{A}}_{kl,i}^T{\bf{f}}{\left({{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda \right)^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,i}}\right)\bigg\rbrack\end{equation}
Next, we define new variables to simplify this expression for the log posterior probability and the subsequent derivation of its gradient. Let each term in the summation in Equation 49 be  
\begin{equation}\tag{50}{Z_i}\left(u,k,l,{\bf{f}}\right) = {T_i}\left(u,k,l,{\bf{f}}\right)\exp\lbrack{1/2}{U_i}\left(u,k,l,{\bf{f}}\right)\rbrack\end{equation}
where Display Formula
\({T_i}(u,k,l,{\bf{f}}) = {{|{{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda {|^{0.5}}} \over {|{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda {|^{0.5}}}}\)
is the scale factor in each term in the summation in Equation 50 and where Display Formula
\({U_i}(u,k,l,{\bf{f}}) = {\bf{A}}_{kl,u}^T{\bf{f}}{({{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda )^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,u}} - {\bf{A}}_{kl,i}^T{\bf{f}}{({{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda )^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,i}}\)
is the exponentiated term in each term of the sum in Equation 50. Hence, by substituting Equation 50 into Equation 49 the simplified expression for the log posterior is  
\begin{equation}\tag{51}\log {\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right) = - \log \sum\limits_{i = 1}^{{N_{lvl}}} {{Z_i}} \left(u,k,l,{\bf{f}}\right)\end{equation}
The gradient of the log posterior probability can therefore be expressed as  
\begin{equation}\tag{52}{\nabla _{\bf{f}}}\log {\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right) = {\nabla _{\bf{f}}}\left( - \log \sum\limits_{i = 1}^{{N_{lvl}}} {{Z_i}} \left(u,k,l,{\bf{f}}\right)\right)\end{equation}
 
The gradient of the log is  
\begin{equation}\tag{53}{\nabla _{\bf{f}}}\log {\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right) = {{\sum\nolimits_{i = 1}^{{N_{lvl}}} {{\nabla _{\bf{f}}}} {Z_i}\left(u,k,l,{\bf{f}}\right)} \over {\sum\nolimits_{i = 1}^{{N_{lvl}}} {{Z_i}} \left(u,k,l,{\bf{f}}\right)}}\end{equation}
Expanding the numerator by substituting Equation 50 using the chain rule for differentiation  
\begin{equation}\tag{54}{\nabla _{\bf{f}}}\log {\cal P}\left({X_u}|{\bf{r}}\left(k,l\right)\right) = - {1 \over {\sum\nolimits_{v = 1}^{{N_{lvl}}} {{Z_v}} \left(u,k,l,{\bf{f}}\right)}}\sum\limits_{i = 1}^{{N_{lvl}}} \bigg( \exp\lbrack({1/2}{U_i}\left(u,k,l,{\bf{f}}\right)\rbrack{\nabla _{\bf{f}}}{T_i}\left(u,k,l,{\bf{f}}\right) + {1 \over 2}{T_i}\left(u,k,l,{\bf{f}}\right)\exp\lbrack{1/2}{U_i}\left(u,k,l,{\bf{f}}\right)\rbrack{\nabla _{\bf{f}}}{U_i}\left(u,k,l,{\bf{f}}\right)\bigg)\end{equation}
The remaining terms to be evaluated are Display Formula
\({\nabla _{\bf{f}}}{T_i}(u,k,l,{\bf{f}})\)
andDisplay Formula
\({\nabla _{\bf{f}}}{U_i}(u,k,l,{\bf{f}})\)
.  
The expression for Display Formula
\({\nabla _{\bf{f}}}{T_i}(u,k,l,{\bf{f}})\)
is  
\begin{equation}\tag{55}{\nabla _{\bf{f}}}{T_i}\left(u,k,l,{\bf{f}}\right) = {\nabla _{\bf{f}}}{{|{{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda {|^{0.5}}} \over {|{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda {|^{0.5}}}} = {{|{{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda {|^{0.5}}|{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda {|^{0.5}}\left({{\left({{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda \right)}^{ - 1}}{B_u}{\bf{f}} - {{\left({{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda \right)}^{ - 1}}{B_i}{\bf{f}}\right)} \over {|{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda |}} = {{|{{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda {|^{0.5}}} \over {|{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda {|^{0.5}}}}\left({B_u}{\bf{f}}{\left({{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda \right)^{ - 1}} - {B_i}{\bf{f}}{\left({{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda \right)^{ - 1}}\right)\end{equation}
The expression for Display Formula
\({\nabla _{\bf{f}}}{U_i}(u,k,l,{\bf{f}})\)
is  
\begin{equation}\tag {56}\eqalign{ {U_i}\left( {u,k,l,{\bf{f}}} \right) &= {\rm{Tr}}\left( {{\bf{A}}_{kl,u}^T{\bf{f}}{{\left( {{{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda } \right)}^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,u}} - {\bf{A}}_{kl,i}^T{\bf{f}}{{\left( {{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda } \right)}^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,i}}} \right) \cr&= {\rm{Tr}}\left( {{\bf{A}}_{kl,u}^T{\bf{f}}{{\left( {{{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda } \right)}^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,u}}} \right) - {\rm{Tr}}\left( {{\bf{A}}_{kl,i}^T{\bf{f}}{{\left( {{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda } \right)}^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,i}}} \right) \cr&= {\rm{Tr}}\left( {{{\left( {{{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda } \right)}^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,u}}{\bf{A}}_{kl,u}^T{\bf{f}}} \right) - {\rm{Tr}}\left( {{{\left( {{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda } \right)}^{ - 1}}{{\bf{f}}^T}{{\bf{A}}_{kl,i}}{\bf{A}}_{kl,i}^T{\bf{f}}} \right) \cr&= {\rm{Tr}}\left( {{{\left( {{{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda } \right)}^{ - 1}}{{\bf{f}}^T}{D_{kl,u}}{\bf{f}}} \right) - {\rm{Tr}}\left( {{{\left( {{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda } \right)}^{ - 1}}{{\bf{f}}^T}{D_{kl,i}}{\bf{f}}} \right) {\nabla _{\bf{f}}}{U_i}\left( {u,k,l,{\bf{f}}} \right) \cr&= {\nabla _{\bf{f}}}{\rm{Tr}}\left( {{{\left( {{{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda } \right)}^{ - 1}}{{\bf{f}}^T}{D_{kl,u}}{\bf{f}}} \right) - {\nabla _{\bf{f}}}{\rm{Tr}}\left( {{{\left( {{{\bf{f}}^T}{B_i}{\bf{f}} + \Lambda } \right)}^{ - 1}}{{\bf{f}}^T}{D_{kl,i}}{\bf{f}}} \right) }\end{equation}
where Display Formula
\({D_{kl,u}} = {{\bf{A}}_{kl,u}}{\bf{A}}_{kl,u}^T\)
. The expression for the gradient of the trace in Equation 56 is obtained by using Equation 32. Thus,  
\begin{equation}\tag{57}{\nabla _{\bf{f}}}Tr({({{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda )^{ - 1}}{{\bf{f}}^T}{D_{kl,u}}{\bf{f}}) = - 2{B_u}{\bf{f}}{({{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda )^{ - 1}}{{\bf{f}}^T}{D_{kl,u}}{\bf{f}}{({{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda )^{ - 1}} + 2{D_{kl,u}}{\bf{f}}{({{\bf{f}}^T}{B_u}{\bf{f}} + \Lambda )^{ - 1}}\end{equation}
 
The gradient Display Formula
\({\nabla _{\bf{f}}}{U_i}(u,k,l,{\bf{f}})\)
is obtained by substituting Equation 57 into Equation 56. The gradient of Display Formula
\(\log {\cal P}({X_u}|{\bf{r}}(k,l))\)
is obtained by substituting Equation 55 and Equation 56 into Equation 54. The gradient of the posterior probability is obtained by plugging Equation 54 into Equation 48. The gradient of the cost for each stimulus is obtained by plugging Equation 48 into Equation 44, and then plugging that result into Equation 43
Appendix C: AMA–Gauss gradient with L0 / KL-divergence cost function
The total cost for a set of filters is given by the average expected cost across all stimuli  
\begin{equation}\tag{58}\bar C = {1 \over N}\sum\limits_{k,l}^N {{{\Bbb E}_{{\bf{R}}(k,l)}}} \left[{C_{kl}}\right]\end{equation}
Given the 0,1 cost function, the cost associated with the filter response to an arbitrary stimulus is given byDisplay Formula
\({C_{kl}} = 1 - {\cal P}\left({X_k}|{\bf{R}}\left(k,l\right)\right)\)
. This cost is monotonic with KL-divergence and we refer to this cost as the KL-cost.  
\begin{equation}\tag{59}{C_{kl}} = - \log {\cal P}\left({X_k}|{\bf{R}}\left(k,l\right)\right)\end{equation}
We approximate the expected cost associated with each stimulus with the expected cost given the mean response (Geisler et al., 2009). Thus, we have  
\begin{equation}\tag{60}{{\Bbb E}_{{\bf{R}}\left(k,l\right)}}\left[{C_{kl}}\right] = - \int_{ - \infty }^\infty {\log } {\cal P}\left({X_k}|{\bf{R}}\left(k,l\right)\right){\cal P}\left({\bf{R}}\left(k,l\right)|{{\bf{s}}_{kl}}\right)d{\bf{R}}\left(k,l\right)\end{equation}
 
\begin{equation}\tag{61} \cong - \log {\cal P}\left({X_k}|{\bf{r}}\left(k,l\right)\right)\end{equation}
Therefore, the total cost for a set of filters is given by  
\begin{equation}\tag{62}\bar C = - {1 \over N}\sum\limits_{k,l}^N {\log } {\cal P}\left({X_k}|{\bf{r}}\left(k,l\right)\right)\end{equation}
Hence, the gradient of the total expected cost Display Formula
\(\bar C\)
can then be written as  
\begin{equation}\tag{63}{\nabla _{\bf{f}}}\bar C = {1 \over N}\sum\limits_{k,l}^N {{\nabla _{{{\bf{f}}_q}}}} \left[\log {\cal P}\left({X_k}|{\bf{r}}\left(k,l\right)\right)\right]\end{equation}
The full expression for the expected cost Display Formula
\(\bar C\)
is obtained by substituting the expression for Display Formula
\({\nabla _{{{\bf{f}}_q}}}[\log {\cal P}({X_k}|{\bf{r}}(k,l))]\)
given by Equations 54, 55, and 56 in Appendix B.  
Appendix D: AMA vs. AMA–Gauss: A simulated example yielding discrepant results
Here, we show AMA and AMA–Gauss are not guaranteed to give equivalent results, and therefore that the similarity of the results presented in the main text is not a foregone conclusion. We show that AMA learns the correct filters and that AMA–Gauss does not when the class-conditional response distributions are non-Gaussian. We simulate stimuli s with three stimulus dimensions, from each of two categories X1 and X2. (For comparison, in the main text the speed and disparity stimuli had 256 and 64 stimulus dimensions, respectively). The simulated stimuli are shown in Figure A1AC. The first two dimensions of the simulated stimuli contain the information for discriminating the categories; the third stimulus dimension is useless. Specifically, the stimulus distributions are given by  
\begin{equation}\tag{64}{\cal P}\left({\bf{s}}|{X_1}\right) = {1 \over 2}{\cal N}\left({\bf{s}};{\bf{0}},{\Sigma _A}\right) + {1 \over 2}{\cal N}\left({\bf{s}};{\bf{0}},{\Sigma _B}\right)\end{equation}
 
\begin{equation}\tag{65}{\cal P}\left({\bf{s}}|{X_2}\right) = {1 \over 2}{\cal N}\left({\bf{s}};{\bf{0}},R{\Sigma _A}{R^T}\right) + {1 \over 2}{\cal N}\left({\bf{s}};{\bf{0}},R{\Sigma _B}{R^T}\right)\end{equation}
 
\begin{equation}{\rm{where\ }}\qquad {\Sigma _A} = \left[ {\matrix{ {{V_{large}}}&0&0 \cr 0&{{V_{small}}}&0 \cr 0&0&1 \cr } } \right],\quad {\Sigma _B} = \left[ {\matrix{ {{V_{small}}}&0&0 \cr 0&{{V_{large}}}&0 \cr 0&0&1 \cr } } \right]\end{equation}
and R is a 45° rotation matrix that operates on the first two dimensions. Stimuli in both categories are therefore distributed as mixtures of Gaussians with identical first- and second-order stimulus statistics (i.e., same mean and covariance). Thus, all information for discriminating the categories exists in higher-order statistical moments of the first two stimulus dimensions. Hence, because AMA–Gauss is sensitive only to class-conditional mean and covariance differences, it will be blind to the stimulus differences that define the categories.  
Figure A1
 
AMA vs. AMA–Gauss with simulated non-Gaussian class-conditional stimulus distributions. (A–C) Simulated stimuli from category 1 (red) and category 2 (blue). Each subplot plots two of the three stimulus dimensions against each other. All of the information for discriminating the categories is in the first two stimulus dimensions. AMA filters (black) and AMA–Gauss (gray) filters (which have the same dimensionality as the stimuli) are represented by arrows. AMA filters place all weight in the first two informative stimulus dimensions; that is, the two-dimensional subspace defined by the two AMA filters coincides with the most informative stimulus dimensions. AMA–Gauss filter weights are placed randomly across the three stimulus dimensions; that is, the two-dimensional subspace defined by the two AMA-Gauss filters will be random with respect to the most informative subspace. (D) Class-conditional AMA filter responses RAMA allow the categories to be discriminated. (E) Class-conditional AMA–Gauss filter responses RAMA–Gauss do not allow the categories to be discriminated.
Figure A1
 
AMA vs. AMA–Gauss with simulated non-Gaussian class-conditional stimulus distributions. (A–C) Simulated stimuli from category 1 (red) and category 2 (blue). Each subplot plots two of the three stimulus dimensions against each other. All of the information for discriminating the categories is in the first two stimulus dimensions. AMA filters (black) and AMA–Gauss (gray) filters (which have the same dimensionality as the stimuli) are represented by arrows. AMA filters place all weight in the first two informative stimulus dimensions; that is, the two-dimensional subspace defined by the two AMA filters coincides with the most informative stimulus dimensions. AMA–Gauss filter weights are placed randomly across the three stimulus dimensions; that is, the two-dimensional subspace defined by the two AMA-Gauss filters will be random with respect to the most informative subspace. (D) Class-conditional AMA filter responses RAMA allow the categories to be discriminated. (E) Class-conditional AMA–Gauss filter responses RAMA–Gauss do not allow the categories to be discriminated.
AMA and AMA–Gauss were each tasked with learning two filters that discriminate the stimulus categories. AMA learns filters that enable the categories to be discriminated; its filters place all weight on the first two informative stimulus dimensions (Figure A1A, black arrows), and zero weight on the third uninformative stimulus dimension (Figure A1B, C). AMA–Gauss is blind to the stimulus statistics that enable the stimulus categories to be discriminated; its filters place their weights randomly, often putting substantial weight on the uninformative third stimulus dimension (Figure A1AC, gray arrows). AMA–Gauss fails to learn filters that allow the categories to be nicely discriminated. 
Figure A1D, E show AMA and AMA–Gauss conditional response distributions. AMA filter responses capture the mixture distribution that defines each category, and AMA–Gauss does not. Thus, the results in the main text are not simply due to the fact that Gaussians often provide good generic approximations of distributions with a lot of probability mass in one place. 
Appendix E: Connection between AMA–Gauss and GQM
The log-likelihood of latent variable Xu using Equation 28 can be written as  
\begin{equation}\tag{66}l\left({X_u}\right) = - {1 \over 2}\left({{\bf{x}}^T}{\bf{f}}\Sigma _u^{ - 1}{{\bf{f}}^T}{\bf{x}} - 2\left(\mu _u^T\Sigma _u^{ - 1} - \eta _u^T\Sigma _u^{ - 1}\right){{\bf{f}}^T}{\bf{x}} + \mu _u^T\Sigma _u^{ - 1}{\mu _u} - \eta _u^T\Sigma _u^{ - 1}{\eta _u} + 2\eta _k^T\Sigma _u^{ - 1}{\mu _u}\right) + {\zeta _u}\end{equation}
where Display Formula
\({\zeta _u} = - {1 \over 2}\log |2\pi {\Sigma _u}|\)
is a constant. The expected log-likelihood can then be written as  
\begin{equation}\tag{67}{{\Bbb E}_\eta }[l\left({X_u}\right)] = - {1 \over 2}\left({{\bf{x}}^T}{\bf{f}}\Sigma _u^{ - 1}{{\bf{f}}^T}{\bf{x}} - 2\mu _u^T\Sigma _u^{ - 1}{{\bf{f}}^T}{\bf{x}} + \mu _u^T\Sigma _u^{ - 1}{\mu _u} - Tr\left(\Sigma _u^{ - 1}\Lambda \right)\right) + {\zeta _u}\end{equation}
It is evident from Equation 67 that Display Formula
\({{\Bbb E}_\eta }[l({X_u})]\)
is of the form Display Formula
\({{\bf{x}}^T}C{\bf{x}} + {{\bf{b}}^T}{\bf{x}} + a\)
where  
\begin{equation}\tag{68}C = - {1 \over 2}{\bf{f}}\Sigma _u^{ - 1}{{\bf{f}}^T}\end{equation}
 
\begin{equation}\tag{69}{{\bf{b}}^T} = \mu _u^T\Sigma _u^{ - 1}{{\bf{f}}^T}\end{equation}
 
\begin{equation}\tag{70}{\rm{and}}\>\>\>a = - {1 \over 2}\mu _u^T\Sigma _u^{ - 1}{\mu _u} + {1 \over 2}Tr\left(\Sigma _u^{ - 1}\Lambda \right) + {\zeta _u}\end{equation}
 
Figure 1
 
Linking scientific subfields. Perception science benefits when links are drawn between psychophysical and neuroscience studies of particular tasks, task-agnostic statistical procedures that fit models to data, and task-specific normative methods that determine which models are best. The current work develops formal links between the energy model for describing neural response, the generalized quadratic model (GQM) for fitting neural response, and AMA–Gauss for determining the neural response properties that best serve a particular task.
Figure 1
 
Linking scientific subfields. Perception science benefits when links are drawn between psychophysical and neuroscience studies of particular tasks, task-agnostic statistical procedures that fit models to data, and task-specific normative methods that determine which models are best. The current work develops formal links between the energy model for describing neural response, the generalized quadratic model (GQM) for fitting neural response, and AMA–Gauss for determining the neural response properties that best serve a particular task.
Figure 2
 
Computations of an energy model neuron, a GQM model neuron, and an AMA–Gauss likelihood neuron. All three have quadratic computations at their core. The energy model and the GQM describe the computations that neurons perform. AMA–Gauss prescribes the computations that neurons should perform to optimize performance in a specific task. (A) The standard energy model assumes two Gabor-shaped orthogonal subunit filters (receptive fields) fgbr to account for a neuron's response. The response of an energy model neuron RE is obtained by adding the squared responses of the filters. (B) The GQM fits multiple arbitrarily-shaped orthogonal subunit receptive fields f that best account for a neuron's response. The response of a GQM model neuron RGQM is obtained by pooling the squared (and linear, not shown) responses of the subunit filters via a weighted sum, and passing the sum through an output nonlinearity. (C) AMA–Gauss finds the optimal subunit filters fopt and quadratic pooling rules for a specific task. Unlike standard forms of the energy model and the GQM, AMA–Gauss incorporates contrast normalization and finds subunit filters that are not necessarily orthogonal. The response \(\def\upalpha{\unicode[Times]{x3B1}}\)\(\def\upbeta{\unicode[Times]{x3B2}}\)\(\def\upgamma{\unicode[Times]{x3B3}}\)\(\def\updelta{\unicode[Times]{x3B4}}\)\(\def\upvarepsilon{\unicode[Times]{x3B5}}\)\(\def\upzeta{\unicode[Times]{x3B6}}\)\(\def\upeta{\unicode[Times]{x3B7}}\)\(\def\uptheta{\unicode[Times]{x3B8}}\)\(\def\upiota{\unicode[Times]{x3B9}}\)\(\def\upkappa{\unicode[Times]{x3BA}}\)\(\def\uplambda{\unicode[Times]{x3BB}}\)\(\def\upmu{\unicode[Times]{x3BC}}\)\(\def\upnu{\unicode[Times]{x3BD}}\)\(\def\upxi{\unicode[Times]{x3BE}}\)\(\def\upomicron{\unicode[Times]{x3BF}}\)\(\def\uppi{\unicode[Times]{x3C0}}\)\(\def\uprho{\unicode[Times]{x3C1}}\)\(\def\upsigma{\unicode[Times]{x3C3}}\)\(\def\uptau{\unicode[Times]{x3C4}}\)\(\def\upupsilon{\unicode[Times]{x3C5}}\)\(\def\upphi{\unicode[Times]{x3C6}}\)\(\def\upchi{\unicode[Times]{x3C7}}\)\(\def\uppsy{\unicode[Times]{x3C8}}\)\(\def\upomega{\unicode[Times]{x3C9}}\)\(\def\bialpha{\boldsymbol{\alpha}}\)\(\def\bibeta{\boldsymbol{\beta}}\)\(\def\bigamma{\boldsymbol{\gamma}}\)\(\def\bidelta{\boldsymbol{\delta}}\)\(\def\bivarepsilon{\boldsymbol{\varepsilon}}\)\(\def\bizeta{\boldsymbol{\zeta}}\)\(\def\bieta{\boldsymbol{\eta}}\)\(\def\bitheta{\boldsymbol{\theta}}\)\(\def\biiota{\boldsymbol{\iota}}\)\(\def\bikappa{\boldsymbol{\kappa}}\)\(\def\bilambda{\boldsymbol{\lambda}}\)\(\def\bimu{\boldsymbol{\mu}}\)\(\def\binu{\boldsymbol{\nu}}\)\(\def\bixi{\boldsymbol{\xi}}\)\(\def\biomicron{\boldsymbol{\micron}}\)\(\def\bipi{\boldsymbol{\pi}}\)\(\def\birho{\boldsymbol{\rho}}\)\(\def\bisigma{\boldsymbol{\sigma}}\)\(\def\bitau{\boldsymbol{\tau}}\)\(\def\biupsilon{\boldsymbol{\upsilon}}\)\(\def\biphi{\boldsymbol{\phi}}\)\(\def\bichi{\boldsymbol{\chi}}\)\(\def\bipsy{\boldsymbol{\psy}}\)\(\def\biomega{\boldsymbol{\omega}}\)\(\def\bupalpha{\unicode[Times]{x1D6C2}}\)\(\def\bupbeta{\unicode[Times]{x1D6C3}}\)\(\def\bupgamma{\unicode[Times]{x1D6C4}}\)\(\def\bupdelta{\unicode[Times]{x1D6C5}}\)\(\def\bupepsilon{\unicode[Times]{x1D6C6}}\)\(\def\bupvarepsilon{\unicode[Times]{x1D6DC}}\)\(\def\bupzeta{\unicode[Times]{x1D6C7}}\)\(\def\bupeta{\unicode[Times]{x1D6C8}}\)\(\def\buptheta{\unicode[Times]{x1D6C9}}\)\(\def\bupiota{\unicode[Times]{x1D6CA}}\)\(\def\bupkappa{\unicode[Times]{x1D6CB}}\)\(\def\buplambda{\unicode[Times]{x1D6CC}}\)\(\def\bupmu{\unicode[Times]{x1D6CD}}\)\(\def\bupnu{\unicode[Times]{x1D6CE}}\)\(\def\bupxi{\unicode[Times]{x1D6CF}}\)\(\def\bupomicron{\unicode[Times]{x1D6D0}}\)\(\def\buppi{\unicode[Times]{x1D6D1}}\)\(\def\buprho{\unicode[Times]{x1D6D2}}\)\(\def\bupsigma{\unicode[Times]{x1D6D4}}\)\(\def\buptau{\unicode[Times]{x1D6D5}}\)\(\def\bupupsilon{\unicode[Times]{x1D6D6}}\)\(\def\bupphi{\unicode[Times]{x1D6D7}}\)\(\def\bupchi{\unicode[Times]{x1D6D8}}\)\(\def\buppsy{\unicode[Times]{x1D6D9}}\)\(\def\bupomega{\unicode[Times]{x1D6DA}}\)\(\def\bupvartheta{\unicode[Times]{x1D6DD}}\)\(\def\bGamma{\bf{\Gamma}}\)\(\def\bDelta{\bf{\Delta}}\)\(\def\bTheta{\bf{\Theta}}\)\(\def\bLambda{\bf{\Lambda}}\)\(\def\bXi{\bf{\Xi}}\)\(\def\bPi{\bf{\Pi}}\)\(\def\bSigma{\bf{\Sigma}}\)\(\def\bUpsilon{\bf{\Upsilon}}\)\(\def\bPhi{\bf{\Phi}}\)\(\def\bPsi{\bf{\Psi}}\)\(\def\bOmega{\bf{\Omega}}\)\(\def\iGamma{\unicode[Times]{x1D6E4}}\)\(\def\iDelta{\unicode[Times]{x1D6E5}}\)\(\def\iTheta{\unicode[Times]{x1D6E9}}\)\(\def\iLambda{\unicode[Times]{x1D6EC}}\)\(\def\iXi{\unicode[Times]{x1D6EF}}\)\(\def\iPi{\unicode[Times]{x1D6F1}}\)\(\def\iSigma{\unicode[Times]{x1D6F4}}\)\(\def\iUpsilon{\unicode[Times]{x1D6F6}}\)\(\def\iPhi{\unicode[Times]{x1D6F7}}\)\(\def\iPsi{\unicode[Times]{x1D6F9}}\)\(\def\iOmega{\unicode[Times]{x1D6FA}}\)\(\def\biGamma{\unicode[Times]{x1D71E}}\)\(\def\biDelta{\unicode[Times]{x1D71F}}\)\(\def\biTheta{\unicode[Times]{x1D723}}\)\(\def\biLambda{\unicode[Times]{x1D726}}\)\(\def\biXi{\unicode[Times]{x1D729}}\)\(\def\biPi{\unicode[Times]{x1D72B}}\)\(\def\biSigma{\unicode[Times]{x1D72E}}\)\(\def\biUpsilon{\unicode[Times]{x1D730}}\)\(\def\biPhi{\unicode[Times]{x1D731}}\)\(\def\biPsi{\unicode[Times]{x1D733}}\)\(\def\biOmega{\unicode[Times]{x1D734}}\)\(R_u^L\) of an AMA–Gauss likelihood neuron represents the likelihood of latent variable Xu. The likelihood is obtained by pooling the squared (and linear, not shown) subunit filter responses, indexed by i and j, via a weighted sum (Equation 20).
Figure 2
 
Computations of an energy model neuron, a GQM model neuron, and an AMA–Gauss likelihood neuron. All three have quadratic computations at their core. The energy model and the GQM describe the computations that neurons perform. AMA–Gauss prescribes the computations that neurons should perform to optimize performance in a specific task. (A) The standard energy model assumes two Gabor-shaped orthogonal subunit filters (receptive fields) fgbr to account for a neuron's response. The response of an energy model neuron RE is obtained by adding the squared responses of the filters. (B) The GQM fits multiple arbitrarily-shaped orthogonal subunit receptive fields f that best account for a neuron's response. The response of a GQM model neuron RGQM is obtained by pooling the squared (and linear, not shown) responses of the subunit filters via a weighted sum, and passing the sum through an output nonlinearity. (C) AMA–Gauss finds the optimal subunit filters fopt and quadratic pooling rules for a specific task. Unlike standard forms of the energy model and the GQM, AMA–Gauss incorporates contrast normalization and finds subunit filters that are not necessarily orthogonal. The response \(\def\upalpha{\unicode[Times]{x3B1}}\)\(\def\upbeta{\unicode[Times]{x3B2}}\)\(\def\upgamma{\unicode[Times]{x3B3}}\)\(\def\updelta{\unicode[Times]{x3B4}}\)\(\def\upvarepsilon{\unicode[Times]{x3B5}}\)\(\def\upzeta{\unicode[Times]{x3B6}}\)\(\def\upeta{\unicode[Times]{x3B7}}\)\(\def\uptheta{\unicode[Times]{x3B8}}\)\(\def\upiota{\unicode[Times]{x3B9}}\)\(\def\upkappa{\unicode[Times]{x3BA}}\)\(\def\uplambda{\unicode[Times]{x3BB}}\)\(\def\upmu{\unicode[Times]{x3BC}}\)\(\def\upnu{\unicode[Times]{x3BD}}\)\(\def\upxi{\unicode[Times]{x3BE}}\)\(\def\upomicron{\unicode[Times]{x3BF}}\)\(\def\uppi{\unicode[Times]{x3C0}}\)\(\def\uprho{\unicode[Times]{x3C1}}\)\(\def\upsigma{\unicode[Times]{x3C3}}\)\(\def\uptau{\unicode[Times]{x3C4}}\)\(\def\upupsilon{\unicode[Times]{x3C5}}\)\(\def\upphi{\unicode[Times]{x3C6}}\)\(\def\upchi{\unicode[Times]{x3C7}}\)\(\def\uppsy{\unicode[Times]{x3C8}}\)\(\def\upomega{\unicode[Times]{x3C9}}\)\(\def\bialpha{\boldsymbol{\alpha}}\)\(\def\bibeta{\boldsymbol{\beta}}\)\(\def\bigamma{\boldsymbol{\gamma}}\)\(\def\bidelta{\boldsymbol{\delta}}\)\(\def\bivarepsilon{\boldsymbol{\varepsilon}}\)\(\def\bizeta{\boldsymbol{\zeta}}\)\(\def\bieta{\boldsymbol{\eta}}\)\(\def\bitheta{\boldsymbol{\theta}}\)\(\def\biiota{\boldsymbol{\iota}}\)\(\def\bikappa{\boldsymbol{\kappa}}\)\(\def\bilambda{\boldsymbol{\lambda}}\)\(\def\bimu{\boldsymbol{\mu}}\)\(\def\binu{\boldsymbol{\nu}}\)\(\def\bixi{\boldsymbol{\xi}}\)\(\def\biomicron{\boldsymbol{\micron}}\)\(\def\bipi{\boldsymbol{\pi}}\)\(\def\birho{\boldsymbol{\rho}}\)\(\def\bisigma{\boldsymbol{\sigma}}\)\(\def\bitau{\boldsymbol{\tau}}\)\(\def\biupsilon{\boldsymbol{\upsilon}}\)\(\def\biphi{\boldsymbol{\phi}}\)\(\def\bichi{\boldsymbol{\chi}}\)\(\def\bipsy{\boldsymbol{\psy}}\)\(\def\biomega{\boldsymbol{\omega}}\)\(\def\bupalpha{\unicode[Times]{x1D6C2}}\)\(\def\bupbeta{\unicode[Times]{x1D6C3}}\)\(\def\bupgamma{\unicode[Times]{x1D6C4}}\)\(\def\bupdelta{\unicode[Times]{x1D6C5}}\)\(\def\bupepsilon{\unicode[Times]{x1D6C6}}\)\(\def\bupvarepsilon{\unicode[Times]{x1D6DC}}\)\(\def\bupzeta{\unicode[Times]{x1D6C7}}\)\(\def\bupeta{\unicode[Times]{x1D6C8}}\)\(\def\buptheta{\unicode[Times]{x1D6C9}}\)\(\def\bupiota{\unicode[Times]{x1D6CA}}\)\(\def\bupkappa{\unicode[Times]{x1D6CB}}\)\(\def\buplambda{\unicode[Times]{x1D6CC}}\)\(\def\bupmu{\unicode[Times]{x1D6CD}}\)\(\def\bupnu{\unicode[Times]{x1D6CE}}\)\(\def\bupxi{\unicode[Times]{x1D6CF}}\)\(\def\bupomicron{\unicode[Times]{x1D6D0}}\)\(\def\buppi{\unicode[Times]{x1D6D1}}\)\(\def\buprho{\unicode[Times]{x1D6D2}}\)\(\def\bupsigma{\unicode[Times]{x1D6D4}}\)\(\def\buptau{\unicode[Times]{x1D6D5}}\)\(\def\bupupsilon{\unicode[Times]{x1D6D6}}\)\(\def\bupphi{\unicode[Times]{x1D6D7}}\)\(\def\bupchi{\unicode[Times]{x1D6D8}}\)\(\def\buppsy{\unicode[Times]{x1D6D9}}\)\(\def\bupomega{\unicode[Times]{x1D6DA}}\)\(\def\bupvartheta{\unicode[Times]{x1D6DD}}\)\(\def\bGamma{\bf{\Gamma}}\)\(\def\bDelta{\bf{\Delta}}\)\(\def\bTheta{\bf{\Theta}}\)\(\def\bLambda{\bf{\Lambda}}\)\(\def\bXi{\bf{\Xi}}\)\(\def\bPi{\bf{\Pi}}\)\(\def\bSigma{\bf{\Sigma}}\)\(\def\bUpsilon{\bf{\Upsilon}}\)\(\def\bPhi{\bf{\Phi}}\)\(\def\bPsi{\bf{\Psi}}\)\(\def\bOmega{\bf{\Omega}}\)\(\def\iGamma{\unicode[Times]{x1D6E4}}\)\(\def\iDelta{\unicode[Times]{x1D6E5}}\)\(\def\iTheta{\unicode[Times]{x1D6E9}}\)\(\def\iLambda{\unicode[Times]{x1D6EC}}\)\(\def\iXi{\unicode[Times]{x1D6EF}}\)\(\def\iPi{\unicode[Times]{x1D6F1}}\)\(\def\iSigma{\unicode[Times]{x1D6F4}}\)\(\def\iUpsilon{\unicode[Times]{x1D6F6}}\)\(\def\iPhi{\unicode[Times]{x1D6F7}}\)\(\def\iPsi{\unicode[Times]{x1D6F9}}\)\(\def\iOmega{\unicode[Times]{x1D6FA}}\)\(\def\biGamma{\unicode[Times]{x1D71E}}\)\(\def\biDelta{\unicode[Times]{x1D71F}}\)\(\def\biTheta{\unicode[Times]{x1D723}}\)\(\def\biLambda{\unicode[Times]{x1D726}}\)\(\def\biXi{\unicode[Times]{x1D729}}\)\(\def\biPi{\unicode[Times]{x1D72B}}\)\(\def\biSigma{\unicode[Times]{x1D72E}}\)\(\def\biUpsilon{\unicode[Times]{x1D730}}\)\(\def\biPhi{\unicode[Times]{x1D731}}\)\(\def\biPsi{\unicode[Times]{x1D733}}\)\(\def\biOmega{\unicode[Times]{x1D734}}\)\(R_u^L\) of an AMA–Gauss likelihood neuron represents the likelihood of latent variable Xu. The likelihood is obtained by pooling the squared (and linear, not shown) subunit filter responses, indexed by i and j, via a weighted sum (Equation 20).
Figure 3
 
Accuracy maximization analysis. (A) Factors determining the optimal decoder used by AMA during filter learning. (B) Steps for finding optimal task-specific filters via AMA.
Figure 3
 
Accuracy maximization analysis. (A) Factors determining the optimal decoder used by AMA during filter learning. (B) Steps for finding optimal task-specific filters via AMA.
Figure 4
 
Relationship between conditional filter response distributions, likelihood, and posterior probability. Two hypothetical cases are considered, each with three latent variable values. (A) One-dimensional (i.e., single filter) Gaussian conditional response distributions, when information about the latent variable is carried only by the class-conditional mean; distribution means, but not variances, change with the latent variable. The blue distribution represents the response distribution to stimuli having latent variable value Xk. The red and green distributions represent response distributions to stimuli having different latent variables values\({X_u} \ne {X_k}\). The blue dot represents the likelihood \({\cal L}({X_u};{R_1}(k,l)) = {\cal N}({R_1}(k,l);{\mu _u},{\Sigma _u})\) that observed noisy filter response R1(k, l) to stimulus sk,l was elicited by a stimulus having latent variable level Xu = Xk. Red and green dots represent the likelihoods that the response was elicited by a stimulus having latent variable \({X_u} \ne {X_k}\) (i.e., by a stimulus having the incorrect latent variable value). (B) Posterior probability over the latent variable given the noisy observed response in (A). The posterior probability of the correct latent variable value (in this case, Xk) is given by the likelihood of the correct latent variable value normalized by the sum of all likelihoods. Colored boxes surrounding entries in the inset equation indicate the likelihood of each latent variable. (C) Two-dimensional (i.e., two-filter) Gaussian response distributions. Each ellipse represents the joint filter responses to all stimuli having the same latent variable value. The second filter improves decoding performance by selecting for useful stimulus features that the first filter does not. The black dot near the center of the blue ellipse represents an observed noisy joint response R(k, l) to stimulus sk,l. The likelihood \({\cal L}({X_u};{\bf{R}}(k,l)) = {\cal N}({\bf{R}}(k,l);{\mu _u},{\Sigma _u})\) that the observed response was elicited by a stimulus having latent variable value Xu is obtained by evaluating the joint Gaussian at the noisy response; in this case, the product of the likelihoods represented by the blue dots on the single filter response distributions. (D–F) Same as A–C, but where information about the latent variable is carried by the class-conditional covariance instead of the mean; ellipse orientation, but not location, changes with the latent variable. AMA–Gauss finds the filters yielding conditional response distributions that are as different from each other as possible, given stimulus constraints.
Figure 4
 
Relationship between conditional filter response distributions, likelihood, and posterior probability. Two hypothetical cases are considered, each with three latent variable values. (A) One-dimensional (i.e., single filter) Gaussian conditional response distributions, when information about the latent variable is carried only by the class-conditional mean; distribution means, but not variances, change with the latent variable. The blue distribution represents the response distribution to stimuli having latent variable value Xk. The red and green distributions represent response distributions to stimuli having different latent variables values\({X_u} \ne {X_k}\). The blue dot represents the likelihood \({\cal L}({X_u};{R_1}(k,l)) = {\cal N}({R_1}(k,l);{\mu _u},{\Sigma _u})\) that observed noisy filter response R1(k, l) to stimulus sk,l was elicited by a stimulus having latent variable level Xu = Xk. Red and green dots represent the likelihoods that the response was elicited by a stimulus having latent variable \({X_u} \ne {X_k}\) (i.e., by a stimulus having the incorrect latent variable value). (B) Posterior probability over the latent variable given the noisy observed response in (A). The posterior probability of the correct latent variable value (in this case, Xk) is given by the likelihood of the correct latent variable value normalized by the sum of all likelihoods. Colored boxes surrounding entries in the inset equation indicate the likelihood of each latent variable. (C) Two-dimensional (i.e., two-filter) Gaussian response distributions. Each ellipse represents the joint filter responses to all stimuli having the same latent variable value. The second filter improves decoding performance by selecting for useful stimulus features that the first filter does not. The black dot near the center of the blue ellipse represents an observed noisy joint response R(k, l) to stimulus sk,l. The likelihood \({\cal L}({X_u};{\bf{R}}(k,l)) = {\cal N}({\bf{R}}(k,l);{\mu _u},{\Sigma _u})\) that the observed response was elicited by a stimulus having latent variable value Xu is obtained by evaluating the joint Gaussian at the noisy response; in this case, the product of the likelihoods represented by the blue dots on the single filter response distributions. (D–F) Same as A–C, but where information about the latent variable is carried by the class-conditional covariance instead of the mean; ellipse orientation, but not location, changes with the latent variable. AMA–Gauss finds the filters yielding conditional response distributions that are as different from each other as possible, given stimulus constraints.
Figure 5
 
Speed estimation task: filters, cost, compute time, and class-conditional response distributions. (A) AMA–Gauss and AMA filters for estimating speed (−8 to +8°/s) from natural image movies are nearly identical; ρ > 0.96 for all filters. (B) The cost for all the filters for both the models is identical. (C) Compute time for 50 evaluations of the posterior probability distribution is linear with AMA–Gauss, and quadratic with the full AMA model, in the training set size. (D) Same data as in (C) but on log–log axes. (E), (F) Joint filter responses, conditioned on each level of the latent variable, are approximately Gaussian. Different colors indicate different speeds. Individual symbols represent responses to individual stimuli. Thin black curves show that the filter response distributions, marginalized over the latent variable \({\cal P}(R) = \sum\nolimits_{u = 1}^{{N_{lvl}}} {\cal P} (R|{X_u}){\cal P}({X_u})\), are heavier-tailed than Gaussians (see Results and Discussion).
Figure 5
 
Speed estimation task: filters, cost, compute time, and class-conditional response distributions. (A) AMA–Gauss and AMA filters for estimating speed (−8 to +8°/s) from natural image movies are nearly identical; ρ > 0.96 for all filters. (B) The cost for all the filters for both the models is identical. (C) Compute time for 50 evaluations of the posterior probability distribution is linear with AMA–Gauss, and quadratic with the full AMA model, in the training set size. (D) Same data as in (C) but on log–log axes. (E), (F) Joint filter responses, conditioned on each level of the latent variable, are approximately Gaussian. Different colors indicate different speeds. Individual symbols represent responses to individual stimuli. Thin black curves show that the filter response distributions, marginalized over the latent variable \({\cal P}(R) = \sum\nolimits_{u = 1}^{{N_{lvl}}} {\cal P} (R|{X_u}){\cal P}({X_u})\), are heavier-tailed than Gaussians (see Results and Discussion).
Figure 6
 
Disparity estimation task: filters, cost, compute time, and class-conditional response distributions. (A) AMA–Gauss and AMA filters for estimating disparity from natural stereo-images (−15 to +15 arcmin). (B–F) Caption format same as Figure 5B–F.
Figure 6
 
Disparity estimation task: filters, cost, compute time, and class-conditional response distributions. (A) AMA–Gauss and AMA filters for estimating disparity from natural stereo-images (−15 to +15 arcmin). (B–F) Caption format same as Figure 5B–F.
Figure 7
 
Filter responses with and without contrast normalization. (A) Class-conditional filter response distributions \({\cal P}({R_t}|{X_u})\) to contrast-normalized stimuli for each individual filter and each level of the latent variable in the speed estimation task. For visualization, responses are transformed to Z-scores by subtracting off the mean and dividing by the standard deviation. Gaussian probability density is overlaid for purposes of comparison. (B) Kurtosis of the two-dimensional conditional response distributions from filters 1 and 2 (violet; also see Figure 5E) and filters 3 and 4 (green; also see Figure 5F) for all levels of the latent variable. A two-dimensional Gaussian has a kurtosis of 8.0. Kurtosis was estimated by fitting a multidimensional generalized Gaussian via maximum likelihood methods. (C, D) Same as A, B but without contrast normalization. (E–H) Same as (A–D), but for the task of disparity estimation.
Figure 7
 
Filter responses with and without contrast normalization. (A) Class-conditional filter response distributions \({\cal P}({R_t}|{X_u})\) to contrast-normalized stimuli for each individual filter and each level of the latent variable in the speed estimation task. For visualization, responses are transformed to Z-scores by subtracting off the mean and dividing by the standard deviation. Gaussian probability density is overlaid for purposes of comparison. (B) Kurtosis of the two-dimensional conditional response distributions from filters 1 and 2 (violet; also see Figure 5E) and filters 3 and 4 (green; also see Figure 5F) for all levels of the latent variable. A two-dimensional Gaussian has a kurtosis of 8.0. Kurtosis was estimated by fitting a multidimensional generalized Gaussian via maximum likelihood methods. (C, D) Same as A, B but without contrast normalization. (E–H) Same as (A–D), but for the task of disparity estimation.
Figure 8
 
Decoding performance with and without contrast normalization. (A) Contrast normalization decreases decoding cost for the speed estimation task. (B) Percentage increase in cost without contrast normalization. (C, D) Same as (A, B) but for the disparity estimation task. The same result holds for different cost functions (e.g., squared error) and larger number of filters. If eight filters are used in the disparity task, failing to contrast normalize can decrease performance by ∼40%.
Figure 8
 
Decoding performance with and without contrast normalization. (A) Contrast normalization decreases decoding cost for the speed estimation task. (B) Percentage increase in cost without contrast normalization. (C, D) Same as (A, B) but for the disparity estimation task. The same result holds for different cost functions (e.g., squared error) and larger number of filters. If eight filters are used in the disparity task, failing to contrast normalize can decrease performance by ∼40%.
Figure 9
 
Quadratic pooling weights for computing the likelihood. Weights on squared and sum-squared filter responses (wii(X) and wij(X)) as a function of the latent variable. Weights on the linear filter responses are all approximately zero and are not shown. (A) Weights for speed estimation task. (B) Weights for disparity estimation task. Weights on squared responses are at maximum magnitude when the variance of the corresponding filter responses is at minimum. Weights on sum-squared responses are at maximum magnitude for latent variables yielding maximum response covariance (see Figures 5E, F and 6E, F).
Figure 9
 
Quadratic pooling weights for computing the likelihood. Weights on squared and sum-squared filter responses (wii(X) and wij(X)) as a function of the latent variable. Weights on the linear filter responses are all approximately zero and are not shown. (A) Weights for speed estimation task. (B) Weights for disparity estimation task. Weights on squared responses are at maximum magnitude when the variance of the corresponding filter responses is at minimum. Weights on sum-squared responses are at maximum magnitude for latent variables yielding maximum response covariance (see Figures 5E, F and 6E, F).
Figure 10
 
Likelihood functions for speed and disparity tasks. (A) Likelihood functions for four randomly chosen natural image movies having true speeds of 4°/s. Each likelihood function represents the population response of the set of likelihood neurons, arranged by their preferred speeds. To ease the visual comparison, the likelihood functions are normalized to a constant volume by the sum of the likelihoods. (B) Same as (A), but for movies with a true speed of 0°/s. (C, D) Same as (A, B) but for stereo-images with −7.5 arcmin and 0.0 arcmin of disparity, respectively. (E) Tuning curves of speed-tuned likelihood neurons. For speeds sufficiently different from zero, tuning curves are approximately log-Gaussian and increase in width with speed. For near-zero speeds, tuning curves are approximately Gaussian. Each curve represents the mean response (i.e., tuning curve) of a likelihood neuron having a different preferred speed, normalized to a common maximum. Gray areas indicate 68% confidence intervals. (F) Tuning curves of disparity-tuned likelihood neurons.
Figure 10
 
Likelihood functions for speed and disparity tasks. (A) Likelihood functions for four randomly chosen natural image movies having true speeds of 4°/s. Each likelihood function represents the population response of the set of likelihood neurons, arranged by their preferred speeds. To ease the visual comparison, the likelihood functions are normalized to a constant volume by the sum of the likelihoods. (B) Same as (A), but for movies with a true speed of 0°/s. (C, D) Same as (A, B) but for stereo-images with −7.5 arcmin and 0.0 arcmin of disparity, respectively. (E) Tuning curves of speed-tuned likelihood neurons. For speeds sufficiently different from zero, tuning curves are approximately log-Gaussian and increase in width with speed. For near-zero speeds, tuning curves are approximately Gaussian. Each curve represents the mean response (i.e., tuning curve) of a likelihood neuron having a different preferred speed, normalized to a common maximum. Gray areas indicate 68% confidence intervals. (F) Tuning curves of disparity-tuned likelihood neurons.
Figure 11
 
Natural stimuli, artificial stimuli, and class-conditional responses. Many different retinal images are consistent with a given value of the task-relevant latent variable. These differences cause within-class (task-irrelevant) stimulus variation. Within-class stimulus variation is greater for natural stimuli than for typical artificial stimuli used in laboratory experiments. (A) Stimuli for speed estimation experiments. Two different example stimuli are shown for each stimulus type: natural stimuli (represented by cartoon line drawings), Gabor stimuli, and random-dot stimuli. Both example stimuli for each stimulus type drift at exactly the same speed, but create different retinal images. Natural stimuli cause more within-class retinal stimulus variation than artificial stimuli. (B) Same as (A), but for disparity. (C) Speed task: class-conditional responses to contrast-fixed 1.0 cpd drifting Gabors with random phase (speed task). Colors indicate different speeds. Ellipses represent filter responses to natural stimuli having the same speeds. (D) Disparity task: Class-conditional responses to contrast-fixed 1.5 cpd binocular Gabors with random phase. Class-conditional responses no longer have Gaussian structure, and instead have ring structure.
Figure 11
 
Natural stimuli, artificial stimuli, and class-conditional responses. Many different retinal images are consistent with a given value of the task-relevant latent variable. These differences cause within-class (task-irrelevant) stimulus variation. Within-class stimulus variation is greater for natural stimuli than for typical artificial stimuli used in laboratory experiments. (A) Stimuli for speed estimation experiments. Two different example stimuli are shown for each stimulus type: natural stimuli (represented by cartoon line drawings), Gabor stimuli, and random-dot stimuli. Both example stimuli for each stimulus type drift at exactly the same speed, but create different retinal images. Natural stimuli cause more within-class retinal stimulus variation than artificial stimuli. (B) Same as (A), but for disparity. (C) Speed task: class-conditional responses to contrast-fixed 1.0 cpd drifting Gabors with random phase (speed task). Colors indicate different speeds. Ellipses represent filter responses to natural stimuli having the same speeds. (D) Disparity task: Class-conditional responses to contrast-fixed 1.5 cpd binocular Gabors with random phase. Class-conditional responses no longer have Gaussian structure, and instead have ring structure.
Figure A1
 
AMA vs. AMA–Gauss with simulated non-Gaussian class-conditional stimulus distributions. (A–C) Simulated stimuli from category 1 (red) and category 2 (blue). Each subplot plots two of the three stimulus dimensions against each other. All of the information for discriminating the categories is in the first two stimulus dimensions. AMA filters (black) and AMA–Gauss (gray) filters (which have the same dimensionality as the stimuli) are represented by arrows. AMA filters place all weight in the first two informative stimulus dimensions; that is, the two-dimensional subspace defined by the two AMA filters coincides with the most informative stimulus dimensions. AMA–Gauss filter weights are placed randomly across the three stimulus dimensions; that is, the two-dimensional subspace defined by the two AMA-Gauss filters will be random with respect to the most informative subspace. (D) Class-conditional AMA filter responses RAMA allow the categories to be discriminated. (E) Class-conditional AMA–Gauss filter responses RAMA–Gauss do not allow the categories to be discriminated.
Figure A1
 
AMA vs. AMA–Gauss with simulated non-Gaussian class-conditional stimulus distributions. (A–C) Simulated stimuli from category 1 (red) and category 2 (blue). Each subplot plots two of the three stimulus dimensions against each other. All of the information for discriminating the categories is in the first two stimulus dimensions. AMA filters (black) and AMA–Gauss (gray) filters (which have the same dimensionality as the stimuli) are represented by arrows. AMA filters place all weight in the first two informative stimulus dimensions; that is, the two-dimensional subspace defined by the two AMA filters coincides with the most informative stimulus dimensions. AMA–Gauss filter weights are placed randomly across the three stimulus dimensions; that is, the two-dimensional subspace defined by the two AMA-Gauss filters will be random with respect to the most informative subspace. (D) Class-conditional AMA filter responses RAMA allow the categories to be discriminated. (E) Class-conditional AMA–Gauss filter responses RAMA–Gauss do not allow the categories to be discriminated.
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×