Our expectation of an event such as a visual stimulus clearly depends on previous experience, but how the brain computes this expectation is currently not fully understood. Because expectation influences the time to respond to a stimulus, we arranged for the probability of a visual target to suddenly change and found that the time taken to make an eye movement to it then changed continuously, eventually stabilizing at a level reflecting the new probability. The time course of this change can be modeled making a simple assumption: that the brain discounts old information about the probability of an event by a factor *λ,* relative to new information. The value of *λ* presumably represents a compromise between responding rapidly to genuine changes in the environment and not prematurely discarding information still of value. The model we propose may be implemented by a very simple neural circuit composed of only a few neurons.

^{−2}diffuse yellow light-emitting diodes (LEDs), subtending 0.26 × 0.53 deg (H × V), optically superimposed via a beam splitter onto a yellow background of 4 cd m

^{−2}. Targets were spaced 6 deg to either side of a central fixation LED. The targets and the background were housed within a darkened hood to minimize visual distractions, and the environmental light level was controlled to minimize possible adaptational effects.

*promptness*data (i.e., the reciprocal of latency in seconds), as promptness values are normally distributed whereas raw latencies are not (Carpenter, 1981); as such, our average data represent harmonic means for latency. We excluded from our analysis extremely short (<60 ms) or long (>790 ms) latencies.

*p*= .1, only approximately 1 in 10 runs contribute to the averaged data for a given datum point). However, the effect of this increased noise is partly offset by the increased magnitude of the shift in latency that results from the logarithmic relationship between appearance probability and latency (Figure 1).

*p*= .5) and shortly after a probability shift (

*p*= .1 or .9), with the results shown in Figure 3. Although a swivel is only modestly favored (log

_{10}likelihood ratios [LR] = +1.4 and +0.1) compared with a simple lateral shift in the data sets, for both observers, the distributions are highly compatible with a swivel of the curves (Kolmogorov–Smirnov [KS],

*p*= .85 and .94 for Observers A and B, respectively) and so provide no grounds for rejecting the LATER model. As the data set is derived from a cross section of many runs measured over many weeks, it is likely that they contain a component of variability primarily attributable to interrun variation, which will be unrelated to the swivel effect, and so act to dilute it. Although the data at the extremes of the reciprobit plots in Figure 3 appear to deviate from the nominal curve in some instances, as seen in previously reported recinormal plots (Carpenter & Williams, 1995; Leach & Carpenter, 2001; Reddi & Carpenter, 2000), these tails constitute only a small fraction (typically <5%) of the data. The KS test is relatively insensitive to such deviations in the extreme tails of a cumulative distribution (Press, Teukolsky, Vetterling, & Flannery, 1992).

*p*

_{n}is the true probability of a target's appearance in a particular location, say on the left, in trial

*n,*then

*L*

_{n}(

*x*), the logarithm of the likelihood function estimating

*p*

_{n}=

*x,*is modified by an observation

*E*to form an updated (posterior) estimate of log likelihood,

*L*

_{n + 1}(

*x*). Then:

*S*(

*x*) = log(prob(

*E*|

*p*

_{n}=

*x*)) and

*κ*is the arbitrary constant necessarily entailed in measures of likelihood (Edwards, 1972).

*p,*which we shall call

*q,*is the most likely value of

*x,*for which

*L*(

*x*) is a maximum. For a task where a target appears either on the left or the right, this maximum likelihood estimate is simply equal to the proportion of trials that have appeared on a given side, with equal weight given to all observations, whether old or recent (see 1). Unfortunately, this simple Bayesian model has the undesirable characteristic of becoming progressively insensitive to changes in the underlying probability

*p*as the total number of target appearances increases. As the past history gets longer, a given trial will have an increasingly negligible effect on expectation because it is outweighed by observations that were made some time ago. Although such a characteristic is desirable when attempting to predict a fixed, underlying probability (Poggio, Rifkin, Mukherjee, & Niyogi, 2004), when a probability is subject to change, as occurs in most real-world environments, it makes the system very slow to respond. Given the results in Figure 2, we can also rule out even the case where the Bayesian history commences from the first trial in a run: the predicted change in latency is still too slow to explain the experimental data (Figure 2, dashed lines).

*λ,*so that its weighting is (1 −

*λ*). We call

*λ*the Lethean factor, after the river Lethe, whose waters induced forgetfulness when imbibed (Griffin, 1998). This Lethean factor determines how quickly a system adapts to change, with a small

*λ*corresponding to a conservative system that is not easily swayed by new information. Then:

*q,*the most likely value of

*x,*reflects the current expectation of the target probability

*p*and should be logarithmically related to the median saccadic latency (Carpenter & Williams, 1995). Our model suggests that the effects of stimulus history should decay exponentially the further in the past a trial occurred. While it would be possible to use other weighting schemes, the exponential form allows stimulus history effects to be calculated incrementally (Sutton, 1988). We shall see later that this is quite useful for reducing the computational complexity of Equation 2.

*R*

^{2}= .91), which was indistinguishable from that obtained in Figure 1. Results for Observer B were more variable, with the slope being −37.9 ms/log unit (95% CI = −41.1 to −34.7;

*R*

^{2}= 1.0) prior to the main experiment and −15.9 ms/log unit (95% CI = −34.4 to 2.60;

*R*

^{2}= .71) afterward: The slope (−26.4) obtained in Figure 1 is very close to the midpoint (−26.9) between these two pre- and postexperiment values.

*λ*using the

*F*ratio between the best fit of the data and one with a specified

*λ*. To confirm that an exponential change in promptness improved the fit, we used a corrected Akaike's Information Criterion (AICc) to compare our model to a linear change between the initial promptness (average of Trials −19 through 0) and the asymptotic promptness after the probability change (average of Trials 71 through 80). In all but one case, the log

_{10}LR ( Table 1) are positive, indicating that our exponential model is preferred, with many comparisons showing strong (log

_{10}LR > 2) support.

Saccade direction | Probability | Observer A | Observer B | ||||||
---|---|---|---|---|---|---|---|---|---|

Trials | λ | 95% CI ( λ) | log _{10} LR (AICc) | Trials | λ | 95% CI ( λ) | log _{10} LR (AICc) | ||

Left | .10 | 128 | .11 | .18–.08 | 8.6 | 33 | .14 | .34–.06 | 3.2 |

.33 | 97 | .02 | .03–.01 | 0.47 | 36 | .05 | 1.00–.02 | 0.11 | |

.67 | 124 | .06 | .10–.03 | 5.9 | 31 | .11 | 1.00–.01 | –0.01 | |

.90 | 140 | .08 | .09–.06 | 27 | 37 | .03 | .05–.02 | 2.4 | |

Right | .10 | 140 | .06 | .08–.05 | 6.3 | 37 | .06 | .11–.04 | 1.5 |

.33 | 124 | .03 | .04–.02 | 0.54 | 31 | .41 | 1.00–.04 | 1.7 | |

.67 | 97 | .01 | .04–.00 | 0.69 | 36 | .17 | 1.00–.04 | 1.2 | |

.90 | 128 | .62 | 1.00–.25 | 15 | 33 | .26 | 1.00–.12 | 13 |

*λ*to an acceptable level of confidence, although the inescapable noisiness of some of the data sets, relative to the shift in mean latency, makes a confident estimation more difficult. Because there appears to be no systematic change in

*λ*as the underlying appearance probability changes, we calculated a weighted average of the best fitting values for both left- and right-going saccades for each observer ( Figure 4); overall, we find a value for

*λ*that is approximately .05. We used the average values given in Figure 4 to produce the solid curves shown in Figure 2, and it is clear that our forgetful model is better able to capture the time course of expectation development than is the nonforgetful model (dashed lines).

*σ,*Carpenter, 1999; Figure 5, lower panel), variability across runs increases markedly (lower panel, triangles) as the average latency changes continuously, as predicted. Unfortunately, this characteristic signature becomes lost when trial-by-trial saccadic variability is set to realistic levels (lower panel, filled and unfilled circles); hence, the fact that variability does not appear to increase in our data (Figure 2) cannot be said to rule out the possibility of step changes in latency on individual runs.

*λ*. We examined only the most extreme shifts in appearance probability (.5 to .9 and .5 to .1), as these would be expected to show the largest changes in expectation and so most reliably anchor our curve fits. Fits containing fewer than 4 points could not be analyzed by our goodness-of-fit measures below: and so were excluded from the analysis. The distribution of step-change positions for each observer is given in Figure 6, and the median values for

*λ*were .080 and .078 for Observer A (final probability of .9 and .1, respectively) and .073 and .10 for Observer B. We then summed the log

_{10}LR, calculated from the AICc, for each run to gauge the support for each model. For shifts to a high appearance probability, the step model was favored (summed log

_{10}LR = −32.1 and −7.6, from 140 and 37 runs, for Observers A and B, respectively), whereas the forgetful model was favored for shifts to low appearance probabilities (summed log

_{10}LR = 25.4 and 7.6, from 120 and 30 runs, respectively). Overall, the results fail to show compelling support for either model. As expected, when the results of the individual steps are averaged, a smoothed trajectory results, and this can be seen in Figure 2 (red lines).

*λ*given in Table 1, however, and may reflect initial non-monotonicities biasing our curve fits. It should be noted that the curves given in Figure 7 may not asymptote exactly to zero as saccades that are widely separated will be differently influenced by both fatigue and slow drifts in saccadic latency, the sum of which may not be easily predicted. In addition, the non-Gaussian distribution of saccadic latencies means that differences between two distributions will contain a small offset, on average.

*L*(

*x*) to be neurally encoded, presumably using a coordinated population of neurons that each code for a discrete value of

*x*. Fortunately, the updating procedure embodied in Equation 2 can be expressed in a different way that suggests an implementation that is far less costly in terms of the number of neurons required. It is not difficult to show (see 1) that the updating rule for

*q*is given by:

*Z*

_{ n}is 1 when, in the

*n*th trial, the target is on the left and 0 otherwise. The obvious advantage of this simplified expression is that it avoids the search for a maximum value of

*L*

_{ n + 1}(

*x*) over all possible

*x*required by Equations 1 and 2. A similar equation has been used by Cho et al. (2002) to produce an exponential decay in the effects of stimulus history. Our rule is capable of very simple implementation by a very small number of actual neurons. For instance, if the decay of previous information predicted by the equation is time dependent rather than event dependent, this behavior could even be displayed by a single neuron (Figure 8). Here, we make explicit what has so far been only implicit, that

*q*is necessarily a

*conditional*probability, contingent on the experimental circumstances, of which the most relevant is that a trial has just started. Behavior of this kind is seen in the superior colliculus, where there are units that increase their resting activity when a response is imminent (Dorris & Munoz, 1998); thus, it does not seem unreasonable to suppose that this is because of an afferent pathway stimulated by the beginning of a trial,

*T*. If this pathway synapses with a response neuron

*Z*that fires maximally in relation to a particular response, then its resting activity at the start of a trial will depend on the strength of the synapse from

*T*to

*Z*. If this strength (representing

*q*) behaves in a quasi-Hebbian manner, increasing when

*Z*is paired with

*T*but declining exponentially using a constant fraction

*λ*of its strength when it is not, then Equation 3 will again be obeyed.

*λ*that we have observed. Do they represent an optimum of some kind? Intuitively, it seems clear that an environment in which circumstances are frequently changing will favor a larger value of

*λ,*but when they change only seldom,

*λ*should be smaller. More formally, we might postulate a stochastic environment in which the underlying probability

*p*of some event undergoes unpredictable stepwise changes, the probability of such a change at any particular moment being constant, such that transitions occur with an average frequency

*f*. We may then ask what value of

*λ*minimizes the average discrepancy between the true value of

*p*and our current estimate of it,

*q*. Unfortunately, this question does not appear to have an analytical solution; we therefore tackled the problem by Monte Carlo simulation using a customized computer program that used a linear congruential random number generator with a period of 2

^{24}− 1.

*p*could take one of two values,

*p*

_{0}or

*p*

_{1}, with transitions between these values occurring randomly with an average frequency

*f*. We updated

*q*at each trial according to Equation 3 and calculated the standard deviation

*σ*of the prediction error (

*q*−

*p*), over the entire run, as a measure of overall performance. To find an optimum value for

*λ*for a particular set of parameters (

*p*

_{0},

*p*

_{1},

*f*), we performed runs for values of

*λ*from .0 through 1.0 in .01 steps and then took the value for

*λ*that returned the lowest standard deviation

*σ*. The results of these simulations are shown in Figures 9 and 10 and show that for a given (

*p*

_{0},

*p*

_{1}), the optimum

*λ*had a monotonically increasing relation to

*f*; in other words, the more frequently the underlying probability changed, the larger

*λ*had to be to achieve the best performance. This relationship between optimum

*λ*and

*f*is shown in Figure 10A: Its shape appears roughly constant (for convenience, we model it with a cube root function), with a scaling factor

*k*that chiefly depends on the size of the step in probability (

*p*

_{0}−

*p*

_{1}): The larger the changes in probability, the larger

*λ*needs to be for optimum performance. This relationship is shown in Figure 10B:

*k*is an accelerating function of (

*p*

_{0}−

*p*

_{1}), which can be approximated quite well with an exponential. We attach no particular meaning to this empirical fit, however, but simply note that it economically describes the functional relationship we observe. Finally, we simulated a (more realistic) situation in which

*p*

_{0}and

*p*

_{1}are not fixed but are themselves randomly determined at each transition, with a uniform distribution between

*p*= 0 and

*p*= 1. As might have been anticipated, this resulted in very similar behavior to the case when (

*p*

_{0}−

*p*

_{1}) = .5 ( Figure 10A) and suggests that in the absence of any other information, we might expect to find

*λ*having values between 0 and .18; as it happens, our observed values ( Figure 4) do indeed lie in the middle of this range.

*λ,*we would not expect to see this in our empirical data. As neither subject knew what probability change would occur on any particular run, the magnitude of the change could only be estimated some time

*after*the probability shift occurred when adjusting

*λ*would be of little, if any, use. Consistent with this, we found no significant differences between best fitting values for

*λ*in Table 1 for low (.33 and .67) and high (.1 and .9) magnitude probability shifts (

*t*test:

*p*= .21 and .53 for Observers A and B, respectively).

*S*

_{0}that encodes prior probability, as found experimentally ( Figure 2). In addition, our model predicts small fluctuations in

*S*

_{0}on a trial-to-trial basis even when the probability of appearance is constant ( Figure 7). Such trial-to-trial variation has previously been thought to arise from a deliberate randomization process in the LATER model (Carpenter, 1999). Although this is predominantly the case, our results show that a proportion of this variation can be attributed to fluctuations in the estimate of prior probability that are dependent upon the short-term pattern of the target's appearance (Figure 6). This proportion is small, however (a few milliseconds, Figure 7, vs. the near 100 ms of random trial-to-trial variation in latency, Figure 3), and thus, it is appropriate to assume that

*S*

_{0}is essentially constant when appearance probability is fixed. However, in the ever-changing visual environment of the real world, the appearance probabilities of visual targets are dynamic, and our simple method for updating prior probability allows the LATER model to be applied in these situations.

*falsely*registering that a change in appearance probability had taken place, though, given that the true appearance probability is fixed throughout. The curves would then give some indication of the likelihood that a false registration

*n*trials in the past had been maintained and so was still influencing the current trial. As multiple, fine step models could approximate our forgetful model with increasing closeness, it may be very difficult to distinguish them from our forgetful model. Precise decision rules for how such step models might operate would need to be developed before detailed analyses could be attempted, however.

*x*at various sequence times

*t*followed by a final outcome

*z,*that is,

*w,*which is used in combination with the appropriate observation to determine predictions

*P*

_{1}through

*P*

_{ m}for what

*z*will eventually be. The TD learning method states that the error signal indicating a modification in weight

*w*

_{ t}is required is calculated not from the difference between the instantaneous prediction and the final outcome (i.e.,

*z*−

*P*

_{ t}) as occurs in supervised learning models but rather from the difference between successive predictions,

*P*

_{ t + 1}−

*P*

_{ t}. The amount of change in

*w*depends not only upon the magnitude of this error signal but also upon a sum of partial derivative calculations for how changing

*w*will influence the various predictions

*P*. For most TD learning algorithms, the components in this sum are weighted exponentially, such that past predictions have less influence. By running the model over numerous sequences leading to an outcome

*z,*the TD model can “learn” the appropriate weighting vector

*w*to best predict future events when a new sequence is presented.

*z*(in our situation, the true underlying appearance probability) to modify predictions. Despite these similarities, there are important differences, however. Firstly, the defining feature of TD models is high-pass filtering of the outcome: synaptic weighting changes in relation to changes in the output, equivalent to taking the difference between the actual and predicted response at any moment (Sutton & Barto, 1981). This is not a feature of our model. Secondly, the exponential decay in our model is incorporated to allow it to follow changes in the underlying parameter being estimated, in a fashion similar to that done by Sugrue et al. (2004). In contrast, the decay in TD models is not designed to account for variations in

*z*but rather to reduce computational complexity, with the decay made infinitely short and the weighting vector

*w*driven by the last observation in the limiting case (Seymour et al., 2004; Sutton & Barto, 1981). Indeed, it is not clear that the inclusion of an exponential decay in TD models incurs any benefits when the underlying parameter to be predicted is constantly changing. Finally, TD models are primarily designed to learn

*across*multiple sequences to predict a certain outcome

*z*(Schultz, Dayan, & Montague, 1997; Seymour et al., 2004; Sutton, 1988; Sutton & Barto, 1981). In contrast, our model is designed to track underlying patterns

*within*a sequence; there is no ultimate outcome

*z,*and in our experiment, the patterns in one sequence offer no predictive value (as far as TD models are concerned) to subsequent sequences.

*λ*in our model is ultimately found to be variable under different experimental circumstances, as our simulations suggests might be beneficial, TD models may provide useful insights into how the brain establishes what optimal value

*λ*should take.

*λ*varies systematically with intervening time, so we cannot directly address this question. Some clues are available from work that has investigated “repetition” effects, as measured by either saccadic latency or the pre-target firing in the superior colliculus, that suggests that decay either does not occur or is very small when the time between saccades is increased (Gore, Dorris, & Munoz, 2002). However, the values of

*λ*obtained in our study predict only a small decay in pre-target firing between successive saccades, which might not be readily detectable.

*Z*

_{ n}be the outcome of the

*n*th trial (1 if the target is on the left, 0 otherwise). Let

*p*be the probability that

*Z*= 1; hence,

*E*(

*Z*) =

*p,*where

*E*is the expectation function. Let

*L*

_{ n}(

*x*) be the log likelihood function after trial

*n*for different possible values

*x*of

*p*. Let

*S*

_{ n}(

*x*) be the support afforded to the hypothesis (

*p*=

*x*) by trial

*n,*that is, log(prob(

*Z*

_{ n}|

*p*=

*x*)).

*n*th trial, applying Bayes' law:

*L*=

_{n}*L*

_{n}_{−}

_{1}+

*S*+ κ, where

_{n}*S*=

_{n}*Z*log(

_{n}*x*) + (1 −

*Z*) log(1 −

_{n}*x*) and κ is the arbitrary constant necessarily entailed in measures of likelihood (Edwards, 1972).

*L*=

_{n}*L*

_{0}+

*m*log(

_{n}*x*) + (

*n*−

*m*) log(1 −

_{n}*x*) + κ, where

*m*is the total number of leftward responses seen by the

_{n}*n*th trial and κ is once again an arbitrary constant.

*L*

_{0}, then differentiating: d

*L*/d

_{n}*x*=

*m*/

_{n}*x*− (

*n*−

*m*)/(1 −

_{n}*x*) =

*m*/

_{n}*x*+

*m*/(1 −

_{n}*x*) −

*n*/(1 −

*x*).

*p*is

*q*, then

_{n}*L*(

_{n}*x*) should be maximal for

*q*=

_{n}*x*, and d

*L*/d

_{n}*x*= 0.

*m*(1/

_{n}*q*+ 1/(1 −

_{n}*q*)) −

_{n}*n*/(1 −

*q*) = 0; rearranging,

_{n}*q*=

*m*/

_{n}*n*.

*p*is simply the observed proportion of leftward trials.

*n*th trial,

*L*= β

_{n}*L*

_{n}_{− 1}+

*S*+ κ, where

_{n}*S*=

_{n}*Z*log(

_{n}*x*) + (1 −

*Z*) log(1 −

_{n}*x*) and κ is, as before, an arbitrary constant.

*E*(

*S*+ κ) =

_{n}*p*log(

*x*) + (1 −

*p*) log(1 −

*x*) + κ = say Φ(

*x*,

*p*).

*E*(

*L*) = β

_{n}

^{n}*L*

_{0}+ Φ(

*x*,

*p*) (1 + β + β

^{2}+ … + β

^{n}^{− 1}) = β

^{n}*L*

_{0}+ Φ(

*x*,

*p*) (1 − β

*)/(1 − β).*

^{n}*n*→ ∞, this tends to Φ(

*x*,

*p*)/(1 − β).

*n*= 0 to probability

*p*.

*L*

_{0}= Φ(

*x*,θ)/(1 − β).

*L*= [β

_{n}*Φ(*

^{n}*x*,θ) + (1 − β

*) Φ(*

^{n}*x*,

*p*)]/(1 − β), and d

*L*/d

_{n}*x*= [β

*(θ/*

^{n}*x*− (1 − θ)/(1 −

*x*)) + (1 − β

*) (*

^{n}*p*/

*x*− (1 −

*p*)/(1 −

*x*))]/(1 − β).

*x*=

*q*) if β

_{n}*(θ/*

^{n}*q*− (1 − θ)/(1 −

_{n}*q*)) = (1 − β

_{n}*) (*

^{n}*p*/

*q*− (1 −

_{n}*p*)/(1 −

*q*)).

_{n}*θ + (1 − β*

^{n}*)*

^{n}*p*]/

*q*= [β

_{n}*(1 − θ) + (1 − β*

^{n}*) (1 − p)]/(1 −*

^{n}*q*), which simplifies to

_{n}*q*= β

_{n}*θ + (1 − β*

^{n}*)p; hence, β*

^{n}*θ =*

^{n}*q*− (1 − β

_{n}*)p.*

^{n}*q*

_{n}_{− 1}= β

*1 θ + (1 − β*

^{n}

^{n}^{− 1})p, β

^{n}^{− 1}θ =

*q*

_{n}_{− 1}(1 − β

^{n}^{− 1})

*p*; multiplying both sides by β, β

*θ = β*

^{n}*q*

_{n}_{− 1}(β − β

*)*

^{n}*p*.

*q*= β

_{n}*q*

_{n}_{− 1}+ (1 − β)

*p*.

*q*= (1 − λ)

_{n}*q*1 + λ

_{n}*p*.

*p*= E(

*Z*), this is equivalent to the updating rule:

*q*= (1 − λ)

_{n}*q*1 + λ

_{n}*Z*.QED

_{n}