A mathematical model and a possible neural mechanism are proposed to account for how fixational drift motion in the retina confers a benefit for the discrimination of high-acuity targets. We show that by simultaneously estimating object shape and eye motion, neurons in visual cortex can compute a higher quality representation of an object by averaging out non-uniformities in the retinal sampling lattice. The model proposes that this is accomplished by two separate populations of cortical neurons — one providing a representation of object shape and another representing eye position or motion — which are coupled through specific multiplicative connections. Combined with recent experimental findings, our model suggests that the visual system may utilize principles not unlike those used in computational imaging for achieving “super-resolution” via camera motion.

*mitigate*this blur, viewing the eye position drift as a hindrance.

*f*

^{2}spatial power spectrum of natural images, when combined with the statistics of eye motion, results in a flattening of the power spectrum over the joint spatiotemporal frequency domain. They further show that, when this signal is sent through the temporal filtering properties of RGCs, high spatial frequency details get amplified and that more global spatial structures such as contours could be detected from spike synchrony. Their theory is complementary to ours in that they address limitations imposed by postreceptoral mechanisms (e.g., limited dynamic range and bandwidth of the optic nerve) and subsequent processes of feature extraction, assuming the image signal has been adequately sampled by the cones such that it can be treated as a continuous function of space and time,

*I*(

*x*,

*y*,

*t*). The focus of our work, by contrast, is to understand how spatial detail at the very highest spatial frequencies (50 cycles/deg) can be perceived and discriminated despite the fact that spatial information at these scales is compromised owing to the punctate nature of cone sampling, inhomogeneities in the retinal cone mosaic, and among the cones themselves. We also take into account the punctate encoding in time by RGCs–that is, signals are conveyed to the brain not as continuous waveforms, but as a sequence of spikes. We propose a computational mechanism for decoding images that have been sampled and temporally encoded in this way, and we quantitatively evaluate its performance, corroborating the psychophysical measurements of Ratnam et al.

*S*, given the incoming spikes,

*R*, where the trajectory of the eye,

*X*, is an unknown variable.

^{1}If both

*X*and

*R*were known,

*S*could be easily estimated by accumulating evidence from spikes after the motion is used to correct for the translation of the eye. Likewise, if both

*S*and

*R*were known,

*X*could be estimated by finding the translation of the stimulus pattern,

*S*, that provides the best spatial alignment with the spike patterns,

*R*, across time. In the case where only

*R*is known,

*X*and

*S*must be

*jointly inferred*, because one variable is needed to estimate the other.

*S*and

*X*. The prior on the eye trajectory,

*p*(

*X*), is a diffusive random walk with a diffusion constant \(D_C^{infer}\). The prior on the stimulus pattern,

*p*(

*S*), is constructed by constraining

*S*to be given by \(S=D\, A\), where

*D*is a “dictionary matrix” whose columns are some elementary spatial patterns, and the vector

*A*is a set of latent variables that specify how much of each pattern is present. The spatial structure in

*S*can then be modeled with a simple, factorial prior over

*A*,

*p*(

*A*) = ∏

_{i}

*p*(

*A*

_{i}). The relationships between

*R*,

*X*,

*S*, and

*A*are described by the probabilistic graphical model shown in Figure 1B. The joint distribution of the nodes in the graph,

*N*, is

*p*(

*N*) = ∏

_{i}

*p*(

*N*

_{i}|

*N*

_{π(i)}), where π(

*i*) denotes the parents of node

*i*in the graph defining the model (parent-child relationships are denoted by arrows in the diagram). All quantities of interest are computed by marginalizing the joint distribution.

*p*(

*A*|

*R*), given by

*p*(

*R*|

*X*,

*S*) reflects the probabilistic (Poisson) model used in generating the spikes (Appendix Equation 11). The posterior

*p*(

*A*|

*R*) assigns a probability for every possible stimulus pattern

*S*=

*DA*given the spikes

*R*coming from the retina, taking into account all possible eye movement trajectories weighted by their probability. We use a series of approximations to derive a computationally tractable, causal, and online computation to estimate

*A*(see the Appendix for details). First, only the most probable set of latent shape variables is considered, \(\hat{A} = {\rm argmax}_A p(A|R)\). The second is to deal with the intractable sum over all possible eye trajectories by using an online approximation of the EM algorithm. The EM algorithm maximizes log

*P*(

*A*|

*R*) in an iterative manner by alternating between two steps, one for estimating

*X*, which comes from introducing a variational distribution

*q*(

*X*), and the other for estimating

*A*. To make time explicit in

*X*and

*R*, we henceforth rewrite them as

*X*

_{0: T}= (

*X*

_{0},

*X*

_{1}, …

*X*

_{T}) and

*R*

_{0: T}= (

*R*

_{0},

*R*

_{1}, …

*R*

_{T}), where

*T*is the total number of time steps in the simulation.

*R*

_{t}denotes the number of spikes emitted from each RGC in the time interval [

*t*,

*t*+ Δ

*t*]. Because

*R*

_{t}depends only on the current eye position,

*X*

_{t}, and the stimulus,

*S*, we can derive a set of EM update equations as follows:

*t*,

*X*

_{t}, given the spikes

*R*

_{0: T}and the current estimate of the spatial pattern

*A*

^{′}, while Equation 3 estimates

*A*given the spikes

*R*

_{0: T}and estimated eye positions

*X*

_{0: T}. The traditional EM algorithm repeatedly applies these equations for some number of iterations. For simplicity,

*A*can be initialized to zero. Note that although these update equations are guaranteed to converge to a critical point of log

*P*(

*A*|

*R*) by repeatedly applying them (and initializing them with

*A*= 0), they are still non-causal (requiring spikes from the future to estimate quantities at the current time

*t*), and Equation 3 is not amenable to online processing because it requires optimizing over a batch of quantities from

*t*= 0:

*T*.

*t*is approximated by replacing it with a filtering estimate that only takes into account spikes up to time

*t*:

*A*given the spikes from 0 to

*t*(computed via Equation 8). The steps involved in this calculation are shown graphically in Figure 1C. A particle filter with resampling (Doucet & Johansen, 2009) is used to represent and propagate

*q*

_{t}(

*X*

_{t}) from one timestep to the next (see Equations 48, 49).

*A*in Equation 3 is also modified to be causal and online. First, we denote the negative expected log-likelihood of

*A*at time

*t*as

*A*agrees with the position estimate and spikes at time

*t*. A causal approximation to the update for

*A*at time

*t*may be obtained by considering the sum of these energies only up to time

*t*, along with the log-prior:

*t*is replaced by a quadratic approximation, resulting in the following update for the next time step:

*A*given the spikes from 0 to

*t*, and

*t*, the second term corresponds with the energy coming from the new set of incoming spikes at time

*t*+ 1, and the last two terms correspond with the log prior on

*A*. The quadratic approximation of \(E_r^t(A)\) corresponds with a Gaussian approximation in probability, and so as

*H*grows over time, the uncertainty shrinks, meaning that this term has increasing influence in determining the optimal value of

*A*over time. log

*p*(

*A*) is either the sum of absolute values of

*A*(to encourage sparsity) or a quadratic function of

*A*.

*L*

_{1}loss term. The basic computations required to compute the gradient are specified in Equations 50–56 of the Appendix. Figure 1D shows a graphical illustration of the computation owing to the gradient of the second term, \(E_r^{t+1}(A)\), which updates

*A*according to each new set of incoming spikes. This results in a “dynamic routing” circuit (Olshausen, Anderson, & Van Essen, 1993), in which RGC spikes

*R*are routed into different elements of the internal shape estimate

*A*via another set of units representing the internal position estimate

*X*that multiplicatively gate the RGC’s.

*t*is updated based on the current estimate of the stimulus pattern \(S^t=D\, A^t\) and the incoming spikes

*R*

_{t}(Equation 5). Second, the new estimate of the stimulus pattern (represented by latent factors

*A*) is generated by minimizing Equation 8, which takes into account the new spikes and the updated estimate of eye position. Third, the estimate of the uncertainty of the latent factors,

*H*, is updated (Equation 9).

*P*(

*X*), is set to \(20\, \text{arcmin}^2/s\), which matches that of the recorded eye motions. Although the subject in the experiment is asked to report which of four orientations the E is in, our task requires estimating the entire shape. The prior used to infer the shape,

*P*(

*S*), uses a simple dictionary of non-overlapping square blocks of size 0.8 arcmin × 0.8 arcmin, with no sparsity imposed.

*half*the distance between the cones (Macleod et al., 1992), the strokes of the E can fall between the cones. In other words, even a retina with uniformly tiled cones has spatially non uniform sensitivity to the diffraction-limited stimuli in the experiments of (Ratnam et al., 2017). It is remarkable that both the mathematical model and human subjects can recover the stimulus given the gaps and irregularities in sensitivity in the retinal cone lattice (Harmening, Tuten, Roorda, & Sincich, 2014).

*p*(

*S*) to capture this structure. For this we turn to the sparse coding model of V1 (Olshausen & Field, 1997), which uses the generative model

*S*=

*DA*, where

*D*is a dictionary of features learned from the statistics of natural images and

*A*is a set of latent variables with a sparse prior

*p*(

*A*). The goal in this case is to infer the latent factors (or image features),

*A*, rather than a pictorial description of the pattern,

*S*, from the incoming spikes,

*R*. The equations for inferring

*A*given

*S*are usually interpreted as describing the dynamics of a neural network where the elements of

*A*correspond with the activations of cortical neurons that have “Gabor-like” receptive fields similar to neurons in V1 (given by the dictionary,

*D*) (Rozell, Johnson, Baraniuk, & Olshausen, 2008). In this case, we infer

*A*given only the spikes

*R*, which change as patterns drift over the retina. The resulting Equations (5, 8) can be interpreted as describing the interactions between two separate populations of neurons that work together to jointly infer the eye position

*X*and the latent factors

*A*. The neurons representing the latent factors

*A*will appear to have dynamic, Gabor-like receptive fields that track features as they drift across the retina rather than remaining locked in retinotopic coordinates. Our experiments simulating this model on whitened natural scene patches (whitening the stimulus serves as an approximation to the center surround receptive field structure of RGCs) demonstrate that the sparse prior improves the inference of spatial patterns drawn from natural images (Figure 4).

*X*, and another representing the stimulus pattern,

*A*. We hypothesize these two populations to reside in area V1. The incoming spikes

*R*would be carried by the LGN afferents innervating layer 4 of V1 (assuming LGN to be a simple relay of RGCs). The neurons representing

*A*would likely be those in layer 4, or possibly layers 2 and 3. The hypotheses about eye position

*X*would be represented by a population of neurons corresponding to the particles supporting

*q*(

*X*). Such a scheme for neurally representing and updating probability distributions was proposed previously by (Lee & Mumford, 2003).

*A*and

*X*are not computed independently from the input, but rather jointly by multiplicative interactions between the two populations. The neurons representing

*X*essentially compute a cross-correlation between the spatial pattern of incoming spikes

*R*and the current estimate of the pattern represented by

*A*(Figure 1C). Conversely, the neurons representing

*A*are computed (in part) by dynamically routing the incoming spikes

*R*via multiplicative gating by neurons representing

*X*(Figure 1D).

*k*

^{th}element of

*A*is:

*R*

_{t, j}is the number of spikes arriving from RGC

*j*in the time interval [

*t*,

*t*+ Δ

*t*] and λ

_{t, j}is the corresponding rate parameter of its Poisson distribution

*p*(

*R*

_{t}|

*X*

_{t},

*S*=

*DA*) (a full derivation is the Appendix Equations 50–56).

*g*

_{j, k}(

*x*) corresponds to a dynamically controlled connection strength between RGC

*j*and latent factor

*k*that is determined by the eye position estimate

*X*. 〈 · 〉

_{t}denotes averaging with respect to

*q*

_{t}(

*X*

_{t}). The first term of Equation 10 corresponds with a multiplicative gating of the incoming spikes by the internal position estimate. The second term is a homeostatic correction that corresponds with the expected number of incoming spikes given the internal estimate of the spatial pattern. The precise mathematical form of

*g*

_{j, k}(

*x*) is determined by the parameters of the spiking model and the receptive fields of the latent factors (Appendix Equation 56).

*Visual Neuroscience,*20(02), 189–209. [CrossRef]

*Proceedings of the National Academy of Sciences of the United States of America,*84(17), 6297–6301. [CrossRef]

*Journal of Vision,*13(10), 22. [CrossRef]

*Journal of Neuroscience,*34(38), 12701–12715. [CrossRef]

*SIAM Journal on Imaging Sciences,*2(1), 183–202. [CrossRef]

*Current Biology,*27(9), 1268–1277. [CrossRef]

*Proceedings of the National Academy of Sciences of the United States of America,*107(45), 19525–19530. [CrossRef]

*eLife,*8, e40924. [CrossRef]

*Journal of Comparative Neurology,*226(4), 544–564. [CrossRef]

*Philosophical Transactions of the Royal Society of London B: Biological Sciences,*355(1404), 1685–1754. [CrossRef]

*Handbook of Nonlinear Filtering,*12, 656–704.

*Investigative Ophthalmology and Visual Science,*48(7), 3283–3291. [CrossRef]

*IEEE Transactions on Image Processing,*13(10), 1327–1344. [CrossRef]

*Vision Research,*37(3), 257–265. [CrossRef]

*Science,*221(4616), 1193–1195. [CrossRef]

*Journal of Neuroscience,*34(16), 5667–5677. [CrossRef]

*Proceedings: Biological Sciences,*265(1394), 359–366. [CrossRef]

*Journal of Neuroscience,*25(42), 9669–9679. [CrossRef]

*Nature Communications,*11(1), 1–11. [CrossRef]

*Current Biology,*22(6), 510–514. [CrossRef]

*JOSA A,*20(7), 1434–1448.

*Neuron,*81(1), 130–139.

*Vision Research,*32(2), 347–363.

*Nature Communications,*5(1), 4605, https://doi.org/10.1038/ncomms5605.

*Cerebral Cortex,*22(2), 294–307.

*IEEE Signal Processing Magazine,*13(6), 47–60.

*Investigative Ophthalmology and Visual Science,*36(4), S691–S691.

*Experimental Brain Research,*83(1), 37–43.

*Vision Research,*41(2), 173–186.

*Journal of Neuroscience,*13(11), 4700–4719.

*Vision Research,*37(23), 3311–3325.

*Advances in Neural Information Processing Systems,*1311–1318.

*PLoS Biology,*5(12), e331.

*Investigative Ophthalmology & Visual Science,*54(8), 5836–5847.

*Journal of Vision,*17(1), 30–30.

*Optics Express,*10(9), 405–412.

*Neural Computation,*20(10), 2526–2563.

*Trends in Cognitive Sciences,*22(10), 883–895.

*Nature,*447(7146), 852–855.

*Trends in Neurosciences,*38(4), 195–206.

*Science Advances,*2(9), e1600797.

*Nature Neuroscience,*12(8), 967.

*Visual Neuroscience,*18(2), 259–277.

*Proceedings of the 2010 symposium on eye-tracking research & applications*(pp. 195–198).

*Visual Neuroscience,*11(01), 111–118.

*t*: time step,

*i*: pixel index,

*j*: RGC index,

*k*: latent factor,

*p*: particle number.

- (1)
*S*is the spatial pattern to be inferred, in a pixel representation.*S*_{i}denotes a particular pixel of the pattern.*S*_{i}is constrained to be between*s*_{min}and*s*_{max}(for natural scenes, ( − 0.5, 0.5), (0, 1) for the tumbling E). \(X^S_i\) denotes the center of pixel*i*. The pixels are placed in a grid with spacing*ds*. The simulated projected image of the pattern,*I*(*x*), is smoothed using Gaussian interpolation, with σ^{S}= 0.5**ds*. Thus, \(I(x)=\sum _i S_i N(x;\mu =X^S_i, \sigma = \sigma ^S)\), where*N*denotes a Gaussian. - (2)
*A*is the vector of latent factors that generate*S*through a dictionary,*D*(e.g.,*S*=*DA*, where*S*,*A*are column vectors and*D*is a matrix).*A*_{k}denotes the*k*th latent factor. - (3)
*D*is the dictionary, where*D*_{k}is the*k*th dictionary element. - (4) \(X_t^R\) (abbreviated as
*X*_{t}) denotes the position of the retina relative the cone lattice.*X*is used as an abbreviation of*X*_{t}for all*t*. - (5)
*R*_{t, j}denotes the number of spikes of RGC*j*in the time window [*t*,*t*+ Δ*t*]. Δ*t*is the timestep of the simulation (taken to be 1 ms).*R*is used as an abbreviation for*R*_{t, j}for all*t*and*j*. - (6) A jittered cone lattice with spacing
*de*is constructed. Each cone is connected to one ON cell and one OFF cell. The*j*th RGC has a Gaussian receptive field \(N(x;\mu =X^E_j, \sigma =\sigma _j^E)\) with a full width half maximum of 0.48 times the cone spacing (Macleod et al., 1992) (thus, σ^{E}= 0.203 ·*de*, where*de*is the spacing of the cones). - (7)
*D*_{C}is the diffusion coefficient of the eye movements, λ_{0}= 10 Hz, λ_{1}= 100 Hz are the baseline and maximum firing rates of the neurons (Troy & Lee, 1994).

*g*is a gain factor that sets the maximum size of

*c*

_{j, t}to be 1 when

*S*is the vector of 1s. The scaling ensures that \(c^{\prime \prime }_{j,t}\in [0,1]\) when

*S*

_{i}∈ [

*s*

_{min},

*s*

_{max}]. The inner product of the pattern projected onto the retina with the Gaussian receptive field for each cone for an arbitrary displacement of the retina is calculated as follows: let

*T*denote the translation operator and

*E*

_{j}denote the receptive field of the

*j*th neuron. Then \(\int {d\vec{x}}\, [S(\vec{x}) T_{X^R} E_j(\vec{x})]\)\(\, = \sum _i S_i \int d\vec{x}\, [N(\vec{x}; \vec{X}_i^S, \sigma _S^2) N(\vec{x}; \vec{X}_j^E+\vec{X}^R, \sigma _E^2)]\)\(\, = \sum _i S_i \cdot N(\vec{X}^S_i - \vec{X}^E_j - \vec{X}^R; 0, \sigma _S^2 + \sigma _E^2)\). Thus, the inner product can be written as ∑

_{i}

*S*

_{i}

*T*(

*x*

^{R})

_{i, j}

*E*or a natural scene patch.

*p*(

*R*|

*X*,

*S*) (defined above) and the priors define the relationships between the random variables in the probabilistic graphical model shown in Figure 1.

*X*

_{t}is a 2D vector, so for the overall vector to have a diffusion constant of

*D*

_{C}Δ

*t*, then each individual component has a diffusion constant of

*D*

_{C}/2Δ

*t*. Higher-order priors that seek to both infer the eye position and the velocity of the eye were investigated, but did not provide significant benefits for inference. However, such a direction could help to reformulate the model in terms of tracking the velocity of the eye instead of the absolute position.

*D*is the identity matrix and there is a uniform prior on

*A*.

*A*is chosen to be a combination of a

*L*1 and a

*L*2 prior. The

*L*1 part of the prior is −log

*p*(

*A*) = β∑

_{k}|

*A*

_{k}|. The

*L*2 part of the prior is of the form 0.5 · (

*A*− μ)

^{T}Σ

^{−1}(

*A*− μ), where μ and Σ are precomputed mean and covariances of the latent factors. This prior is implemented by setting \(\hat{A}_0=\mu\) and \(\hat{H}_0 = \Sigma ^{-1}\) in the initialization of the algorithm.

*S*

_{min},

*S*

_{max}]: −log

*p*(

*S*

_{i}) = γ*(Θ(

*S*

_{i}−

*S*

_{max}) + Θ( − (

*S*

_{i}−

*S*

_{min}))), where Θ(

*x*) is

*x*if

*x*> 0 and zero otherwise and γ is a parameter (chosen to be 10).

*N*is the length in pixels of the image patches. Sparse coding minimizes the objective function (

*S*−

*DA*)

^{2}+

*k*|

*A*|, where

*S*is the pattern in a pixel representation,

*D*is a dictionary, and

*A*is a set of sparse generative factors.

*k*is a constant that trades off reconstruction error and sparsity (Olshausen & Field, 1997). The value of

*k*is chosen so that the reconstruction error and the sparsity penalty are the same order of magnitude. This step ensures that the minimization procedure attempts to seek a trade off between sparsity and reconstruction error. In jointly optimizing for

*A*and

*D*, for fixed

*D*, FISTA is used to find the best

*A*. A gradient step of size ϵ for

*D*is taken and the dictionary elements are normalized to have L2 norm of one. The value of ϵ is annealed during learning. The dimensionality of the sparse code is three times the effective dimensionality of the data (computed by using PCA to find the number of components that account for 90% of the variance). Because convergence is usually poor for a fullyconnected dictionary,

*D*, the dictionary elements are divided into 10 groups. The first group has dictionary elements whose values are constrained to be zero except the top left 16 × 16 part of the pattern. The next set of dictionary elements have nonzero values with this 16 × 16 patch is shifted by 8 pixels to the right. Doing all such shifts gives nine groups. The last group is fullyconnected.

*L*1 part of the sparse coding natural scenes prior, β is chosen through cross validation in the full simulation. For the

*L*2 part of the sparse coding natural scenes prior, the mean and covariance of the sparse codes of 10

^{4}held-out patterns were computed in the sparse coding simulations (e.g., minimizing (

*S*−

*DA*)

^{2}+

*k*|

*A*|).

*C*, can be written as

*C*=

*PVP*

^{T}, where

*P*is an orthonormal matrix and

*V*is a diagonal matrix with non-negative entries.

*P*is used as the dictionary,

*D*. If μ is the mean of the data after converting to the PCA basis, then \(-\log p(A) = 0.5 * \sum _k (A_k-\mu _k)^2 V_{k, k}^{-1}\). This prior is implemented by setting \(\hat{A}_0=\mu\) and \(\hat{H}_0=V\) (see the definitions of \(\hat{A}\) and \(\hat{H}\) below). For the PCA basis, note that the whitening filter does not fully whiten the data because of the high frequency cut-off in our whitening filter. The 20% of the principal components that contributed the smallest amount of variance were removed to improve numerical stability of the inference algorithm (the prior takes the inverse of variance associated with each of these component, which are very small numbers).

- (1)
*q*_{t}(*X*_{t}) is the algorithm’s current estimate of the position of the eye at time*t*. This distribution is estimated as*q*_{t}(*X*_{t}) = ∑_{p}*W*_{t, p}· δ(*X*_{t},*X*_{t, p}) where*X*_{t, p}is a collection of positions, and*W*_{t, p}are the corresponding weights. - (2) \(\hat{A}_t\), is a vector of size
*N*_{l}(the number of latent factors) that represents the algorithm’s estimate of the underlying spatial pattern, represented in a latent space, after looking at spikes in the time interval [0,*t*]. - (3) \(\hat{H}_t\) is a matrix of size
*N*_{l}by*N*_{l}that represents the inverse of the covariance associated with the estimate of*A*_{t}after looking at spikes in the time interval [0,*t*].

*q*:

*N*

_{p}= 20 particles (performance saturates at this

*N*

_{p}).

*p*(

*X*

_{t + 1}|

*X*

_{t}) is used for the proposal distribution (see Equations 48 and 49)

*x*) =

*x*for

*x*≥ 0, and zero otherwise. The minimization is executed using the FISTA algorithm (Beck & Teboulle, 2009) with 320 gradient steps per timestep (often not necessary, but ensures minimization). The Lipschitz constant is chosen using cross validation. A neural interpretation emerges when writing out the FISTA equations for this minimization (SI, Section 2).

*R*

_{t + 1}, to calculate the new state \((q, \hat{A},\hat{H})_{t+1}\).

*N*are all the random variables of the graphical model, then the joint distribution is

*p*(

*N*) = ∏

_{i}

*p*(

*N*

_{i}|

*N*

_{π(i)}), where π(

*i*) denotes the parents of node

*i*in the graph defining the model. All other quantities of interest are computed by marginalizing the resulting distribution. Our approach is based on using the EM algorithm to approximate argmax

_{A}

*p*(

*A*|

*R*) and then expanding the resulting equations using a Gaussian approximation to the data terms. In an ideal Bayesian world, one would compute

*p*(

*A*|

*R*). This assigns a probability for every possible set of latent factors given the observed spikes. Because this is intractable, the argmax is taken:

*X*(for fixed

*A*) involves numerically integrating over all timesteps of the simulation using a particle filter (which may not behave well numerically). Because the expression must be evaluated many times for each value of

*A*during optimization, EM is used (Moon, 1996). Following the traditional EM recipe (which comes from introducing a variational distribution

*q*), first initialize

*A*→

*A*

^{′}and then alternate between two steps:

*q*(

*X*) at each time point are needed to evaluate this expression.

*q*

_{t}(

*X*

_{t}) is the

*X*

_{t}marginal of

*q*(

*X*) (recall

*X*= (

*X*

_{0}, …

*X*

_{T}), where

*T*is the total number of timesteps in the simulation) and the sum over

*X*

_{t}weighted by

*q*

_{t}(

*X*

_{t}) is abbreviated as 〈 · 〉

_{t}. According to the EM equations, the marginals of

*q*(

*X*) are

*X*

_{−t}denotes the set \(\lbrace X_{t^{\prime }} | t^{\prime }\ne t\rbrace\). Conditioned on a fixed

*A*, the model is a hidden Markov model and the desired marginals are the smoothing estimate of the position. The smoothing estimate is replaced with the filtering estimate in order to get a causal position estimator (i.e., no spikes from the future are used to estimate the current position of the eye). A particle filter is used to estimate these marginals. Next, the optimization for

*A*is modified to be causal. First, the summation over all time is broken up to only use the data up to a time

*t*.

*q*

_{t}is an estimate of

*X*

_{t}at time

*t*and \(\hat{A}_t\) is the current estimate of the latent factors. This computation requires memory that grows linearly with

*t*. We seek an online algorithm. The sum of data terms from

*t*

^{′}∈ [0,

*t*] is modified such that all of the terms in the past are replaced by a Gaussian approximation expanded about the estimate of the position at that time point:

*A*, and noting the leftover polynomial is linear in

*A*.

*H*,

*B*and

*C*do not depend on

*A*. The polynomial expansion would be more accurate if the evidence terms were expanded about the most accurate value of

*A*possible, but for the algorithm to be online, the best available estimate of

*A*is used.

*C*

_{t}does not need to be computed because we are optimizing relative to

*A*.

*B*

_{t}could be computed recursively, but it is simpler to note that \(\hat{A}_t\) is a local minima:

*A*from the pixels going out of bounds in the log-linear firing rate equation, a term is added to the cost function that penalizes the entries of the vector

*S*=

*DA*from going out of bounds. The quadratic prior is incorporated by initializing

*H*

_{0}and \(\hat{A}_0\) to be nonzero.

*q*

_{t}(

*X*

_{t}) =

*p*(

*X*

_{t}|

*R*

_{0: t},

*S*=

*DA*) where

*A*is a fixed estimate of the latent variables. Given a fixed

*A*, the model is a hidden markov model, whose probabilities can be estimated using sequential importance sampling. Following the tutorial (Doucet & Johansen, 2009), suppose that one has a sequence of distributions (

*t*= 0, …

*T*): \(\pi (X_{0:t}) = \frac{\gamma (X_{0:t})}{Z_t}\), where π is normalized but γ is not normalized. Sequential Monte Carlo (SMC) methods help one to sample from these distributions iteratively. SMC methods estimate this distribution using a collection of samples with weights: \(\pi (X_{0:t}) \approx \sum _p W_t^p\delta (X_{0:t}, X^p_{0:t})\). The weights are defined by adding an auxillary distribution that is easier to sample from called an importance density, or a proposal distribution,

*r*

_{t}(

*X*

_{0: t}) chosen by the user. The weights are then calculated by the following equation: \(w_t(X_{0:t}) = \frac{\gamma _t(X_{0:t})}{r_t(X_{0:t})}\). To have the computation per sampling step not increase linearly in time, one typically uses a factorized importance density:

*r*

_{t}(

*X*

_{0: t}) =

*r*

_{t − 1}(

*X*

_{0: t − 1})

*r*

_{t}(

*X*

_{t}|

*X*

_{t − 1}). In this article, the unnormalized distribution is:

*X*

_{0: t}) =

*p*(

*X*

_{0: t}|

*R*

_{0: t},

*A*) and the normalizer is

*Z*

_{t}=

*p*(

*R*

_{0: t}|

*A*). For the EM algorithm requires the estimation of the

*X*

_{t}marginal of π(

*X*

_{0: t}). In this article, we use the proposal distribution:

*r*

_{t}=

*p*(

*X*

_{t}|

*X*

_{t − 1}), which is a Gaussian. Given the SMC framework, there are a number of sampling techniques that can be used to estimate the sequence of distributions. This article uses sequential importance resampling, which achieves good performance and is easy to implement. Each E step is achieved by executing the following steps:

- (1) Sample according to the proposal distribution: \begin{eqnarray} X_{t}^p\sim r_t(X_t|X_{t-1}^p) = p(X_t|X_{t-1}^p) \quad \end{eqnarray}(48)
- (2) Compute the weights \begin{eqnarray} W_t(X_{0:t}^p) &\sim& W_{t-1}(X_{0:t-1}^p)\nonumber\\ &&\times\frac{p(R_t|S=DA, X_t^p) p(X_t^p|X_{t-1}^p)}{r_t(X_t^p|X_{t-1}^p)}\nonumber\\ & =& W_{t-1}(X_{0:t-1}^p)p(R_t|S=DA, X_t^p) \qquad \end{eqnarray}(49)
- (3) Resample if the effective sample size goes below threshold (e.g., one-half of the number of particles). Resampling is done using systematic resampling, which takes
*O*(|*p*|) steps.

*t*,

*W*

_{t}is the weight associated with

*X*

_{0: t}. That is to say, at each step, each particle represents a full trajectory of the eye from 0 to

*t*. Thus, we get an approximation of π(

*X*

_{0: t}). Because the proposal distribution only looks at the most recent position, we need to only store the current position associated with each particle instead of the entire trajectory.

*j*. This loss function is optimized using AdaDelta (other methods are also possible).

*S*, and an estimated pattern,

*S*

^{′}=

*DA*

^{′}, the SNR is computed as follows. First, the average difference between the estimated path and the true path is computed. This is used to shift the estimated pattern, (call that

*S*

^{″}). Then

*S*·

*S*/(

*S*−

*S*

^{″}) · (

*S*−

*S*

^{″}) is computed where each pattern is represented by a sum of Gaussians. For example, if \(\lbrace U,V\rbrace = \sum _i \lbrace U, V\rbrace _i N(x;x^{\lbrace u,v\rbrace }_i, \sigma )\), then \(U\cdot V = \int dx \sum _{i, i^{\prime }}U_iV_{i^{\prime }}N(x; x^u_i, \sigma ) N(x; x^v_{i^{\prime }}, \sigma )=\sum _{i, i^{\prime }}U_iV_{i^{\prime }}N(0;x^u_i-x^v_{i^{\prime }}, \sqrt{2}\sigma )\).

*s*

_{min}= 0 and

*s*

_{max}= 1.

_{k}denotes extracting the

*k*th entry of a vector.

*g*

_{j, k}(

*x*), is the product of the connection strength between pixel

*i*and neuron

*j*,

*T*

_{i, j}(

*x*), and the dictionary. The result is the connection strength between RGC

*j*and latent factor

*k*. The equation for the derivative of the data term,

*E*

_{t}, has two parts. In the first part, the average position-dependent gain factor is computed by averaging over the internal position estimate. This value modulates the impact of the spike

*R*

_{j}on latent factor

*k*. The second term is a decay term that looks at the expected number of spikes that are coming in given the current estimate of the latent factors. In particular, λ

_{t, j}

*dt*is the number of spikes that the circuit expects to come in the interval [

*t*,

*t*+ Δ

*t*].

*A*is zero because it is a log-linear model. As before, the derivative of log λ is replaced to get

*W*

_{p, t}associated with positions

*X*

_{t, p}.

*g*with its definition gives:

*i*

^{′},

*i*. Recalling the definition

*T*

_{i, j}(

*x*) is a Gaussian with a constant standard deviation across

*i*and

*j*and a mean \(X_j^E-X_i^S\) (

*j*indexes the position of the cones and

*i*indexes the position of the pixels of the pattern). This uncertainty is independent of the incoming measurements given the estimated position.

*R*, and a hidden state,

*X*,

*A*. To map the problem into a hidden Markov model, consider the random variables

*A*

_{t}for

*t*∈ {0, …

*T*}, where

*A*

_{t + 1}=

*A*

_{t}and

*A*

_{0}=

*A*. These quantities are related by an observation model

*p*(

*R*|

*X*

_{t},

*A*

_{t}) and a state transition model

*p*(

*A*

_{t},

*X*

_{t}|

*A*

_{t − 1},

*X*

_{t − 1}). In principle, it is possible to do a change of variables from

*X*

_{t},

*A*

_{t}, to another set of hidden variables, which would result in a different neural representation (e.g., an unstabilized representation). In particular, suppose that the representation of the stimulus is in retinotopic coordinates. Define \(\bar{A}_t\) to be the latent factors representing the stimulus as it lands on the retina at time

*t*. For example, \(\bar{A}_t = \bar{T}_{X_t}A\) where \(\bar{T}\) is the translation operator that acts on the latent factors. Although this simplifies the observation model, it complicates the state transition model. For example, we would need to update \(\bar{A}_{t+1}\) from \(\bar{A}_t\) and

*X*

_{t+1}−

*X*

_{t}. This would require the circuit to know how to compute a translation in an arbitrary direction in the current encoding of the pattern (e.g., if

*T*

_{X}is the translation operator in pixel space, then the circuit would need to implement \(\bar{T}_X \approx D^{-1} T_X D\), which is a translation operator in the latent factor space). In experiments with such a model, we found it difficult to model that translation operator. More theoretical work on a translation operator that acts on a sparse code of a pattern could enable such a model.

*X*

_{t}=

*X*

_{t + 1}−

*X*

_{t}. Then write out the Bayesian equations and use conditional independences in the model:

*p*(Δ

*X*

_{t}|

*X*

_{t}) =

*p*(Δ

*X*

_{t})) and \(A_{t+1} = \bar{T}_{\Delta X_t} A_t\). Compared with before, the observation model is simpler:

*p*(

*R*

_{t + 1}|

*A*

_{t + 1},

*X*

_{t + 1}) =

*p*(

*R*

_{t + 1}|

*A*

_{t + 1}) because the location of the object in the world in retinotopic coordinates determines the spikes. However, there is a more complex hidden state update equation for

*A*

_{t}that does not have a simple analytical form.

- (1) Discrete time diffusion on a lattice (as in (Burak et al., 2010)): Diffusion happens on a rectangular lattice with lattice spacing
*a*. Each of the four possible steps happens with a probability*D*_{C}Δ*t*. Thus, the ratio is \(\frac{4D_C\Delta t a^2}{\Delta t} = 4 D_C a^2\). For the majority of their paper, \(a=\frac{1}{2}\) is used. - (2) Continuous time diffusion in continuous space (as in (Kuang et al., 2012)): Here, diffusion is modeled using the diffusion equation \(u_t = \frac{1}{2}D_C \nabla ^2 u\). In this equation, the variance as a function of time is 2
*D*_{C}Δ*t*, so the ratio is 2*D*_{C}. - (3) Discrete time in a continuous space (this article): Position is updated as
*X*_{t + 1}=*X*_{t}+ (*D*_{C}Δ*t*/2)*ϵ where ϵ is drawn from a 2-D standard normal distribution. Thus the expected difference is*D*_{C}Δ*t*/2 for each component of the eye position, so the total expectation divided by Δ*t*is*D*_{C}.

**Supplementary Movie S1**

**. Reconstruction as a function of time for the EM decoder**. Each frame of the video shows the simulation results after a certain amount of time (same parameters as in Figure 2). (Top left) stimulus moving relative to the retina over time. (Bottom left) the reconstructed pattern visualized in pixel space. (Middle) exponential moving average of the spikes from the ON and OFF cells as a function of time. Although the OFF cells fire in the absence of a stimulus, such spiking does not convey additional information — only the spatial variations in the spiking conveys useful information for the decoder. The spikes are the input to our decoding algorithm. Right: decoded eye position (blue) relative to the true eye position (green) as a function of time. The shaded regions show plus or minus 1 standard deviation of the estimate. Although the position is quickly known to a relatively high certainty, the pattern needs a longer integration time before becoming sharp. This is because the edges of the stimulus are relatively sharp and many measurements (i.e., spikes) contribute to the estimate of position (a 2D quantity). In contrast, the stimulus is a higher dimensional quantity that needs to be inferred with the same number of observations.