**Human bias towards more recent events is a common and well-studied phenomenon. Recent studies in visual perception have shown that this recency bias persists even when past events contain no information about the future. Reasons for this suboptimal behavior are not well understood and the internal model that leads people to exhibit recency bias is unknown. Here we use a well-known orientation estimation task to frame the human recency bias in terms of incremental Bayesian inference. We show that the only Bayesian model capable of explaining the recency bias relies on a weighted mixture of past states. Furthermore, we suggest that this mixture model is a consequence of participants' failure to infer a model for data in visual short-term memory, and reflects the nature of the internal representations used in the task.**

*identity model*is the simplest Bayesian incremental updating model (a Bayesian filter) that can plausibly represent the orientation estimation task. Bayesian filters (such as the Kalman filter; Kalman & Bucy, 1961) are widely used in explaining human behavior and have been previously proposed to explain the temporal continuity effects in perception (Burr & Cicchini, 2014; Rao, 1999; Wolpert & Ghahramani, 2000).

*p*(

*z*) at the bottom row is derived from the previously estimated posterior distribution

_{t}*natural prior*so that participants modulate the identity prediction by taking into account the natural statistics of orientations in the environment.

*z*is equal to the mixture of the previous stimulus and a static natural prior. See Natural prior model in Methods for details.

_{t}*mixture model*assumes that the participant's model of the environment is a mixture of multiple past states so that more recent states contribute more than older ones. The two previous hypotheses both assume that the model of the environment

*p*(

*z*) is inferred only based on the previous latent state. Contrastingly, the human recency bias clearly extends beyond the previous state—it is greatest for the most recent state and decays for each further state into the past (Figure 1C). In order to model such time-decaying recency bias over several past states we modify the VMF so that its prior distribution reflects a time-decaying mixture of information from multiple previous trials. Figure 4 illustrates the evolution of the latent state

*p*(

*z*) in a mixture model over four trials. Importantly, such mixture distribution is computed by a fixed sampling step (Kalm, 2017), which results in a computationally first-order Markovian model, which has the same number of parameters and model complexity as the natural prior model described above. See Mixture model in Methods for details.

*κ*and

_{Q}*κ*are latent state and measurement noise concentration terms, respectively. An example of a simple VMF is depicted on Figure 2B, where both state transition and measurement models are identity functions and state noise is zero, resulting in a model where the predicted state

_{L}*z*is equal to the previously estimated posterior distribution

_{t}*t*as the distance which posterior mean

*y*towards some previous stimulus value

_{t}*y*

_{t}_{–}

*. Such estimation error represents the systematic shift in participants' responses since the internal estimate of the perceived orientation (Equation 3) is not centered around the presented stimulus*

_{n}*y*(Figure 2). The value of the estimation error, as a distance between the posterior mean and stimulus value, can be easily derived from the properties of the Von Mises product:

_{t}*x*-axis (Figure 5; for a proof, see Von Mises filter properties in Supporting information). This property means that the VMF cannot even theoretically yield a derivative of a Gaussian (DoG)-like recency bias shape as observed with human participants (Figure 1B).

*κ*), which was chosen so as to produce the just noticeable difference (JND) values matching human data from Fischer and Whitney (2014) (average JND was 5.39°, hence

_{L}*σ*= 3.8113° and

*κ*= 0.0688). The concentration parameter for the state noise

_{L}*κ*was a free parameter optimized to minimize the distance between the simulated data and the average observed subjects' response (see Model fitting and parameter optimization in Supporting information).

_{Q}*p*(

*z*|

_{t}*z*

_{t}_{–1}) on trial

*t*from unimodal Von Mises to bimodal nonparametric distribution. For this purpose we model the average observers' prior as reported by Girshick et al. (2011) as a mixture of two Von Mises distributions, which has two components peaking at cardinal orientations (solid blue line on Figure 3C):

*κ*) as in the VMF described above. Here we used an additional free parameter—the proportion of the previous posterior (

_{L}*β*

_{1}) in the prior distribution (Equation 6). Importantly, when

*β*

_{1}> 0.5 (Figure 6A) the previous posterior would dominate the resulting prior and hence the model would start to approximate the VMF. Similarly, as

*β*

_{1}approaches zero (Figure 6B) the natural prior component will dominate the prior distribution. As in our previous simulations we chose the free parameter values (

*κ*,

_{Q}*β*

_{1}) to minimize the distance between the simulation and behavioral results observed by Fischer and Whitney (2014). See Model fitting and parameter optimization in Supporting information for details.

*libDirectional*toolbox for MATLAB (MathWorks, Natick, MA; Kurz, Gilitschenski, Pfaff, & Drude, 2015). Because in the unidimensional circular state space of orientations the quality of approximation is only given by the number of components, we felt that 10,000 Dirac components can adequately approximate a distribution of a circular variable. For details on the implementation of the filter algorithms see Discrete circular filter with Dirac components in Supporting information.

*a*(·), Equation 1) predicts the next state based on a recency-weighted mixture of

*m*past states:

*θ*is a mixing coefficient for the

_{m}*m*-th past state. We can control the individual contribution of a past state

*z*to the resulting mixture distribution by defining how the mixing coefficient

_{m}*θ*decays over the past states:

*β*is the rate of decrease of the mixing coefficient over the past

*m*states and

*α*is a normalizing constant. As a result we have a decaying time window into the past

*m*states defined by the rate parameter

*β*. The role of the

*β*parameter is to control the decrease of the mixing coefficient

*θ*over past states. Figure 7A illustrates the relationship between the

*β*and

*θ*parameters: The bigger the

*β*, the faster the contribution of past states decreases and greater the proportion of most recent states to the mixture distribution (Equation 7). As

*β*approaches 1, the mixture begins to resemble

*z*

_{t}_{–1}and approximate the first-order VMF described above:

*β*approaches zero, all past states contribute equally to the mixture. Intuitively,

*β*could be interpreted as the bias towards more recent states. Figure 7B illustrates the evolution of the latent state distribution

*p*(

*z*) when

*β*= 0.5 and the mixing coefficient decays over the previous states.

*a*(·) is a mixture function over past

*m*states (Equation 7). The proportion of the individual past states in the mixture—and therefore the effective extent of the window into the past—is controlled by the

*β*parameter. As in previous models, the measurement model is identity, and both state and measurement noise (

*κ*and

_{Q}*κ*) are additive Von Mises.

_{L}*κ*) as in the VMF and NPM simulations described above. The free parameters in the mixture model (MM) were the mixing coefficient hyper-parameter

_{L}*β*(Equation 8) and state noise (

*κ*). As with previous simulations, the free parameters were chosen to minimize the distance between the simulated data and the average observed subjects' response (see Model fitting and parameter optimization in Supporting information).

_{Q}*k*by taking a random sample from the posterior distribution:

_{t}- Distribution of errors—we fit a Von Mises distribution to the simulated errors yielding mean (
*μ*) and concentration values (*κ*). We then calculated the similarity between our simulated error distribution and participants' average distribution by assessing the probability of simulated_{E}*μ*and*κ*given the distribution of participants bootstrapped_{E}Display Formula \(\bar \mu \) andDisplay Formula \(\overline {\kappa _E} \). - DoG recency bias curve—we fit the simulated errors with a first DoG curve (see Recency bias amplitude as measured by fitting the derivative of Gaussian in Supporting information for details), and as above, calculated the probability of the curve parameters arising from the distribution of participants' bootstrapped parameter distributions.
- DoG recency bias over past three trials—we calculated the amplitude of the DoG curve peak for stimuli presented one, two, and three trials back. We sought to replicate a positive but decaying recency bias over three previous stimuli.

*p*= 0.026). Similarly, the maximum of the simulated recency bias was significantly removed from the human data (ca 20° for humans; ca 45° for the VMF; Figure 8B, VMF). Furthermore, it can be shown that the VMF cannot even theoretically have a maximum of the bias at less than

*π*/4 radians (or 45°), which means it is incapable of replicating the DoG-like curve of the human recency bias (see Von Mises filter properties in Supplementary Figure S1). The VMF also could not replicate recency biases for stimuli presented two or three trials ago (Figure 8C, VMF). In sum, the VMF can simulate a recency bias but it is qualitatively different from human bias and only extends one trial back.

*p*< 0.001). However, because NPM's prior and posterior distributions are not Von Mises it was able to capture the DoG-shaped curve of the recency bias (Figure 8B, NPM). The NPM was still not able to capture either the amplitude of the recency bias or extend it back more than one previous trial (Figure 8C, NPM). This was to be expected since the NPM, like the VMF, also predicts the next state based only on the previous one (first order Markovian). In sum, the NPM was able to replicate the DoG-shaped recency bias curve but only for stimuli one trial back. Furthermore, the error distributions simulated by the NPM were significantly different from participants' average with variance of the response reduced by approximately twofold.

*p*= 0.23); the recency bias fit the DoG-shaped curve (Figure 8B, MM); and a significant recency bias was evident over multiple past states (Figure 8C, MM). Importantly, the best-fitting

*β*parameter value for the mixture model was

*β*= 0.75, which effectively sets the time window for the mixture distribution at three to four past states (see Figure 7A, column 2,

*β*= 0.7).

*a*(·) to

*m*past states:

*q*is state noise. According to the model mismatch explanation, the state transition function

*a*(·) is the recency-weighed mixture function (Equation 7). Importantly, this assumes that all data from past

*m*states is potentially available for the state transition function

*a*(·) to generate a prediction. Participants' suboptimal behavior is hence caused by applying the mixture model to data representing past

*m*states. However, exactly the same prediction would be generated if the state transition function

*a*(·) would not perform any transform at all—is identity—but the data from the past

*m*states is itself a recency weighed mixture. This is a more realistic interpretation since the former hypothesis assumes unlimited storage for past experiences. Similarly, the latter interpretation does not require any model selection at all (out of possibly infinite models) and is hence a more parsimonious view.

*z*) given the orientation of the presented Gabor (

_{t}*y*) on trial

_{t}*t*as Bayesian inference:

*p*(

*z*) represents a participant's estimate of the orientation on trial

_{t}|y_{t}*t*and their response can be thought of as a sample from the posterior distribution. Since orientation is a circular variable we need to use a probability distribution wrapped on a circle to represent variables of interest (here we use a Von Mises distribution; Figure 2A). We model the sensory evidence, or the likelihood distribution, as Von Mises noise around the value of the stimulus

*y*on trial:

_{t}*κ*for every trial and derive it from the participants' average JND as derived with Experiment 3 in Fischer and Whitney (2014). See Measurement noise concentration parameter in Supporting information for details.

_{L}*t*observation

*y*corresponds to a latent variable

_{t}*z*(internal representation of orientation), which over time forms a Markov chain, giving rise to the latent state space model. Hence, the state of the latent variable

_{t}*z*is inferred at every time step

_{t}*t*using the Bayes theorem based on the previous state

*z*

_{t}_{–1}and current observation (sensory information)

*y*:

_{t}*z*is predicted by the state transition model and the likelihood of the observations

*y*are given by the measurement model:

*a*and

*h*) are arbitrary but known functions and

*q*and

_{t}*r*are additive noise, we can carry out Bayesian inference at every time step by first predicting the next state of the latent variable given the past state (Equation 9) and then updating this prediction when observable data is measured (Equation 10) to give us a Bayesian posterior distribution of the latent variable (see Bishop, 2006; Sarkka, 2013, for a detailed description of state space models).

_{t}*Bayesian filtering*. If the evolution of the state-space is linear and the noise is Gaussian, then the optimal probabilistic state space model is the Kalman filter (Kalman & Bucy, 1961). Kalman filters have been extensively used for problems where the future state of the environment can be predicted from just the previous few states with sufficient accuracy, such as in object tracking or phoneme recognition. An in-depth mathematical explanation of the Kalman filter, and how it is derived from the Bayesian estimation problem, can be found in any good book covering digital signal processing—for example, in Sarkka (2013). However, Gaussian distributions are not appropriate when the latent state space is wrapped on a circle. A circular analogue of the Kalman filter is the VMF.

*) to the measurement concentration parameter*

_{P}*κ*as JND relates to the standard deviation (

_{L}*σ*) of the normal distribution:

*σ*) across subjects was 5.39° (Fischer & Whitney, 2014), so we used

_{P}*σ*= 3.8113° and

*κ*= 0.0688.

_{L}*μ*and concentration

*κ*are respectively:

*t*is given as the distance that posterior mean

*y*towards some previous stimulus value

_{t}*y*

_{t}_{–}

*. Since the posterior in VMF is a product of two individual Von Mises distributions (likelihood and prior) such distance (*

_{n}*t*) can be computed analytically by evaluating

*y*is the mean of the likelihood and hence

_{t}*μ*

_{1}of the first Von Mises components, we can fix the prior mean (

*μ*

_{2}of the second Von Mises component) to zero, and hence assign:

*x*-axis on Figure 5); and (b) ratio of likelihood to prior concentration (

*y*-axis), and hence the recency bias, to be greatest when the distance between the prior and stimulus (

*x*-axis maxima and minima on Figure 5B). However, as shown in Figure 5B for various values of

*π*/2 on the

*x*-axis on Figure 5B) the influence of the prior is zero and the posterior mean is equal to the stimulus.

*π*/4 and

*π*/4 (see Minima and maxima of the recency bias in Von Mises filter below for details). In other words, when Von Mises distributions are used for Bayesian inference, the recency bias always peaks more than halfway through the

*x*-axis. Contrastingly, DoG curves fitted to participant data in Fischer and Whitney (2014) peak close to zero and between ±

*π*/4 (Figure 1B). In sum, a DoG-shaped recency bias is not even theoretically possible if participants use Bayesian inference and Von Mises distributions for orientation estimation.

*π*/2].

*π*], while in the orientation estimation task, stimulus values were wrapped on the top half of the circle [–

*π*/2,

*π*/2] (Fischer & Whitney, 2014). Equation 22 wrapped between ±

*π*/2 gives ±

*π*/4 as the new limits to the extrema.

*x*is the relative orientation of the previous trial,

*α*is the amplitude of the curve peaks,

*w*scales the curve width, and

*c*is the constant

*α*parameter numerically matches the height of the positive peak of the curve for ease of interpretation: The amplitude of serial dependence (

*α*) is the number of radians that perceived orientation was pulled in the direction of the previous stimulus.

*κ*values from 48 to 480 (step size

_{Q}*log*(

*x*)) and

*β*values from 0.2 to 0.9 (step size 0.05). We used the parameter values that yielded the lowest SSE to the average participants' DoG curve (Supplementary Figure S1).

*π*radians, respectively.

*fminsearch*with 10

^{8}iterations and an ordinary least squares cost function to estimate the values of mixing coefficients (

*β*

_{1}= 0.0215,

*β*

_{2}= 0.0178) and von Mises concentrations (

*κ*

_{1}= 0.8365,

*κ*

_{2}= 0.8728). The resulting mixture and its fit with data from Girshick et al. (2011) is depicted on Figure 3C.

*L*components and Dirac positions

*ω*are weighting coefficients and

_{j}*μ*as the circular mean

*κ*.

*z*can be modeled as the mixture of its past states by using a mixture state transition function (see Berchtold & Raftery, 2002; Raftery, 1985, for details). This method considers the effect of the each of the past

*m*states separately. Specifically, the conditional probability distribution

*m*states:

*θ*is a mixing coefficient so that

*θ*, declines over

*m*time steps as given by some decay function

*ϕ*and the rate of decay parameter

*β*:

*ϕ*, so that:

*β*is the rate of decrease (0 ≤

*β*< 1) and

*α*normalizing constant. Substituting

*θ*into Equation 26 gives:

_{m}*z*with a fixed number of samples

*L*. As a result the proportion of samples assigned to a particular component of the mixture distribution (representing a past state) is determined by the mixing coefficient

*θ*:

*L*is the total number of samples,

*θ*is rounded to the nearest integer, and

_{L}*l*samples drawn from

*l*for every

*m*-th component of the mixture at any time-step (

*t*we take a fixed proportion

*β*from the existing mixture distribution and reassign those samples to represent the most recent component, then after

*m*steps we end up with the same proportion of components as given by Equation 28. It follows that a mixture distribution of past

*m*states with exponentially decaying proportions of past

*m*components can be approximated by sampling from just

*t*.

*Psychological Review*, 96 (4), 703.

*Psychological Science*, 2 (6), 396–408.

*Advances in Applied Mathematics*, 12 (4), 428–454, https://doi.org/10.1016/0196-8858(91)90029-I.

*Statistical Science*, 17 (3), 328–356, https://doi.org/10.1214/ss/1042727943.

*Pattern recognition and machine learning*(pp. 605–652). Cambridge, UK: Cambridge University Press.

*Current Biology*, 24 (22), R1096–R1098, https://doi.org/10.1016/j.cub.2014.10.002.

*Proceedings of the National Academy of Sciences, USA*, 111 (21), 7867–7872, https://doi.org/10.1073/pnas.1402785111.

*Vision Research*, 96, 8–16, https://doi.org/10.1016/j.visres.2013.12.016.

*Nature Neuroscience*, 17 (5), 738–743, https://doi.org/10.1038/nn.3689.

*Trends in Cognitive Sciences*, 14 (3), 119–130, https://doi.org/10.1016/j.tics.2010.01.003.

*Current Biology*, 27 (4), 590–595.

*Nature*, 521 (7553), 452–459, https://doi.org/10.1038/nature14541.

*Nature Neuroscience*, 14 (7), 926–932, https://doi.org/10.1038/nn.2831.

*Trends in Cognitive Sciences*, 14 (8), 357–364, https://doi.org/10.1016/j.tics.2010.05.004. [PubMed]

*Journal of Vision*, 15 (4): 5, 1–12, https://doi.org/10.1167/15.4.5. [PubMed] [Article]

*arXiv*. Retrieved from http://arxiv.org/abs/1711.03038.

*Journal of Basic Engineering, Transactions of the ASME, Series D,*83 (3), 95–108, https://doi.org/10.1115/1.3658902.

*Trends in Cognitive Sciences*, 21 (7), 493–497, https://doi.org/10.1016/j.tics.2017.04.011.

*American Control Conference (ACC)*, 5439–5445, https://doi.org/10.1109/ACC.2013.6580688.

*IEEE Aerospace and Electronic Systems Magazine*, 31 (3), 70–87, https://doi.org/10.1109/MAES.2016.150083.

*libDirectional*. Retrieved from https://github.com/libDirectional

*Current Biology*, 24 (21), 2569–2574, https://doi.org/10.1016/j.cub.2014.09.025.

*Scientific Reports*, 7 (1), 1971, https://doi.org/10.1038/s41598-017-02201-5.

*Proceedings Book of 40th International Symposium on Robotics*, (2), 283–288, https://doi.org/10.1016/j.robot.2010.08.001.

*PLOS Computational Biology*, 11 (1), e1004003, https://doi.org/10.1371/journal.pcbi.1004003.

*Journal of Vision*, 10 (11): 15, 1–11, https://doi.org/10.1167/10.11.15. [PubMed] [Article]

*Nature Neuroscience*, 16 (9), 1170–1178, https://doi.org/10.1038/nn.3495.

*Journal of the Royal Statistical Society, Series B (Methodological)*, 47 (3), 528–539.

*Vision Research*, 39 (11), 1963–1989, https://doi.org/10.1016/S0042-6989(98)00279-X. [PubMed]

*Bayesian filtering and smoothing*. Cambridge, UK: Cambridge University Press.

*Nature Neuroscience*, 3 (Suppl.), 1212–1217, https://doi.org/10.1038/81497. [PubMed]