Free
Research Article  |   July 2006
Changes in expectation consequent on experience, modeled by a simple, forgetful neural circuit
Author Affiliations
Journal of Vision July 2006, Vol.6, 5. doi:https://doi.org/10.1167/6.8.5
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Andrew J. Anderson, R. H. S. Carpenter; Changes in expectation consequent on experience, modeled by a simple, forgetful neural circuit. Journal of Vision 2006;6(8):5. https://doi.org/10.1167/6.8.5.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Our expectation of an event such as a visual stimulus clearly depends on previous experience, but how the brain computes this expectation is currently not fully understood. Because expectation influences the time to respond to a stimulus, we arranged for the probability of a visual target to suddenly change and found that the time taken to make an eye movement to it then changed continuously, eventually stabilizing at a level reflecting the new probability. The time course of this change can be modeled making a simple assumption: that the brain discounts old information about the probability of an event by a factor λ, relative to new information. The value of λ presumably represents a compromise between responding rapidly to genuine changes in the environment and not prematurely discarding information still of value. The model we propose may be implemented by a very simple neural circuit composed of only a few neurons.

Introduction
Saccadic latency—the delay between the sudden presentation of a visual target and the start of an eye movement to look at the target—is much longer than the time needed for low-level structures to calculate the neural signal for moving the eyes to a new location. Rather, it reflects the time for higher level structures to decide whether to look at a particular target in preference to others (Carpenter, 1988). Because of this, the study of saccadic latency provides a useful quantitative tool for studying how the brain makes decisions. 
Under most conditions, latency distributions are consistent with the LATER (Linear Approach to Threshold with Ergodic Rate) model of decision making, in which a decision signal reporting log likelihood rises linearly from an initial level until a threshold criterion is reached. The initial level represents an estimate of the log prior probability that a target requiring a response is present (Carpenter & Williams, 1995), the mean rate of rise reflecting the rate of arrival of information about the target (Reddi, Asrress, & Carpenter, 2003), and the threshold criterion being influenced by the urgency with which a response is required (Reddi & Carpenter, 2000). All of these factors have characteristic effects on saccadic latency. In addition, some populations of neurons in the frontal eye fields of monkeys demonstrate rise-to-threshold behavior before saccadic eye movements, which is consistent with the LATER model (Hanes & Schall, 1996; Schall & Hanes, 1993) and can be dissociated from developing motor responses (Gold & Shadlen, 2003). 
It is not clear how the brain computes an estimate of the prior probability, or expectation, that a target will appear, whose logarithm forms the initial starting level for the LATER decision signal. This computation is evidently a form of learning, using previous experience about how often the target appears to formulate a prediction of what will happen in future. However, it must also involve forgetting, or old experience would have an undue influence when appearance probability has actually altered. Understanding how changes of expectation occur may therefore provide useful insights into how the brain accumulates and discards information. Previous experiments (Carpenter & Williams, 1995), however, have investigated fixed probability levels for target appearance, measuring responses only after the subject has adapted fully to each new level. Because of this, they provide little insight into how the brain computes prior probability. 
In this study, we abruptly changed the probability of a target appearing and examined the resulting alteration in saccadic latencies. By studying the time course of this latency alteration and applying the established logarithmic relationship between latency and prior probability (Carpenter & Williams, 1995), we can hope to learn something of how the brain determines prior probability from a sequence of events. 
Materials and methods
Experimental setup and procedure
Our experiment consisted of a saccadic step task, as used previously (Carpenter & Williams, 1995), in which the subject makes a saccade to a target appearing randomly to the left or to the right of a central fixation light. At the instant the target appears, the fixation light is extinguished, giving the characteristic impression that the fixation light has “stepped” to either the left or the right. Both targets and the fixation point were 460 cd m−2 diffuse yellow light-emitting diodes (LEDs), subtending 0.26 × 0.53 deg (H × V), optically superimposed via a beam splitter onto a yellow background of 4 cd m−2. Targets were spaced 6 deg to either side of a central fixation LED. The targets and the background were housed within a darkened hood to minimize visual distractions, and the environmental light level was controlled to minimize possible adaptational effects. 
We collected data in runs of 200 sequentially presented targets. At the beginning of each run, the probability of a target appearing on the left was .5, but at a random point between 70 and 120 presentations, it abruptly changed to another value: .1, .33, .67, or .9. The subject did not know which probability change to expect, and on approximately one fifth of the runs, the probability did not change—it remained at .5. The position of the subject's right eye was measured using an infrared oculometer (Carpenter, 1988). A computer read the output of the oculometer and automatically determined saccadic reaction time in 10-ms bins (Carpenter, 1994), although an observer reviewed all records and deleted any misclassifications caused by blinking or head movements. 
In each trial, a target appeared after a random time (foreperiod) uniformly distributed between 0.5 to 1.5 s after the onset of the fixation LED. Although latency can be influenced by a uniform distribution of foreperiods (Luce, 1986), decreasing as the foreperiod increases, such effects are identical for all appearance probabilities we investigated and so do not influence our principal finding, of a gradual change in average latency subsequent to a change in appearance probability. In addition, our use of a uniform distribution of foreperiods allows our data to be directly compared with previous investigations of saccadic latency and appearance probability (Carpenter & Williams, 1995; Reddi & Carpenter, 2000). 
Subjects
As saccadic latencies show considerable random fluctuations from trial to trial (Carpenter, 1999; Leach & Carpenter, 2001), a very large number of repeated measures is required from each subject to assess the comparatively small effects arising from changes in the appearance probability of the target; the principal observer (Observer A, aged 31) performed 596 runs over 22 weeks (a total of more than 100,000 saccades), with a supplementary observer (Observer B, aged 20) performing 168 runs over 9 weeks. Unfortunately, the need for such vast data sets precludes running our experiment on a large number of subjects. However, because the primary aim of our study was to document the presence of a particular effect rather than to quantify its magnitude throughout the population, the need for large sample sizes is reduced (Anderson & Vingrys, 2001). 
The presence of random fluctuations in latency does however have a beneficial effect in that it acts as a source of signal dither (Smith, 1999), thereby allowing us to measure changes in average latency that are smaller than our nominal measurement bin width of 10 ms. Because of this relatively large dither (∼100 ms), binning data creates no net bias on averaged data, and so our results are free from any systematic effects due to quantization errors. 
Data analysis
Our principal aim was to examine the change in saccadic latency resulting from an abrupt change in the target's probability of appearing. We averaged data over multiple runs to reduce the effects of random fluctuations in latency from trial to trial (Carpenter, 1999; Leach & Carpenter, 2001). As the abrupt change in probability occurred at a random position in a run, however, we first aligned individual runs so that the change in probability occurred at the same position, denoted by the number zero (see Figure 2, abscissa); this approach is similar to that used in conventional reverse correlation studies (Anderson, 2004). Averages were based on promptness data (i.e., the reciprocal of latency in seconds), as promptness values are normally distributed whereas raw latencies are not (Carpenter, 1981); as such, our average data represent harmonic means for latency. We excluded from our analysis extremely short (<60 ms) or long (>790 ms) latencies. 
Results
Measuring changes in expectation
Our primary experiment involved abruptly altering the appearance probability of a target and measuring the subsequent change in saccadic latency. Figure 1 shows the average latency just prior to the appearance probability change (filled circles) and at approximately 75 presentations after the probability change (unfilled circles), as a function of the final probability. Just before the shift in probability, mean saccadic latencies are essentially constant, reflecting the .5 target probability used to begin each experimental run. In contrast, after some 75 presentations at the new probability, mean saccadic latency has settled down at a new value that is linearly related to the log of the new probability (Carpenter & Williams, 1995). In between, there is a continuous change in mean saccadic latency that approaches an asymptotic value (Figure 2). A similar change was seen in right-going saccades (not shown). When there is a change to a lower probability (uppermost data sets), there is increased scatter that reflects the necessarily decreased number of samples (when p = .1, only approximately 1 in 10 runs contribute to the averaged data for a given datum point). However, the effect of this increased noise is partly offset by the increased magnitude of the shift in latency that results from the logarithmic relationship between appearance probability and latency (Figure 1). 
Figure 1
 
Average saccadic latency over 10 trials prior to the probability change (filled symbols) and for the 71st through the 80th trial after the probability change (unfilled symbols), as a function of the final probability. The upper and lower data sets are for left-going saccades of Observers A (circles) and B (squares), respectively. Error bars show the standard error of the harmonic mean. Weighted (1/variance) linear regression slopes for the unfilled circles were −32.8 ms/log unit (95% CI = −24.9 to −40.7; R2 = .98) for Observer A and −26.4 ms/log unit (95% CI = −15.2 to −37.6; R2 = .95) for Observer B, using the SPSS™ software package (SPSS Inc., Chicago, IL).
Figure 1
 
Average saccadic latency over 10 trials prior to the probability change (filled symbols) and for the 71st through the 80th trial after the probability change (unfilled symbols), as a function of the final probability. The upper and lower data sets are for left-going saccades of Observers A (circles) and B (squares), respectively. Error bars show the standard error of the harmonic mean. Weighted (1/variance) linear regression slopes for the unfilled circles were −32.8 ms/log unit (95% CI = −24.9 to −40.7; R2 = .98) for Observer A and −26.4 ms/log unit (95% CI = −15.2 to −37.6; R2 = .95) for Observer B, using the SPSS™ software package (SPSS Inc., Chicago, IL).
Figure 2
 
Change in average saccadic latency as a result of changing target probability from an initial value of .5 (abscissa <0) to .1 (top data set), .33, .67, or .9 (bottom data set). The solid lines show the latency expected from the forgetful model using the average values for λ given in Figure 4, with the initial and final asymptotic latencies estimated from the average of the promptness values (see Materials and methods) in the figure from −19 through 0 and 71 through 80, respectively. For comparison, the dashed lines give the fit for a non-forgetful maximum likelihood model ( Equation 1) for final probabilities of .9 and .1, assuming an average of 95 trials prior to the probability change. The left and right panels represent the data for Observers A and B, respectively. The red line represents the average latency expected from the best-fitting step functions reported in Figure 6, from 0 through 70.
Figure 2
 
Change in average saccadic latency as a result of changing target probability from an initial value of .5 (abscissa <0) to .1 (top data set), .33, .67, or .9 (bottom data set). The solid lines show the latency expected from the forgetful model using the average values for λ given in Figure 4, with the initial and final asymptotic latencies estimated from the average of the promptness values (see Materials and methods) in the figure from −19 through 0 and 71 through 80, respectively. For comparison, the dashed lines give the fit for a non-forgetful maximum likelihood model ( Equation 1) for final probabilities of .9 and .1, assuming an average of 95 trials prior to the probability change. The left and right panels represent the data for Observers A and B, respectively. The red line represents the average latency expected from the best-fitting step functions reported in Figure 6, from 0 through 70.
Based on LATER, plots of cumulative probability on a probit scale versus reciprocal time (reciprobit plots) are expected to be linear (Carpenter, 1981; Carpenter & Williams, 1995; Reddi & Carpenter, 2000), with alterations in prior probability causing these lines to “swivel” about a fixed point corresponding to infinite time (Carpenter & Williams, 1995). To confirm that our data was consistent with LATER, we determined reciprobit plots for the period just prior to a probability shift (i.e., when p = .5) and shortly after a probability shift (p = .1 or .9), with the results shown in Figure 3. Although a swivel is only modestly favored (log10 likelihood ratios [LR] = +1.4 and +0.1) compared with a simple lateral shift in the data sets, for both observers, the distributions are highly compatible with a swivel of the curves (Kolmogorov–Smirnov [KS], p = .85 and .94 for Observers A and B, respectively) and so provide no grounds for rejecting the LATER model. As the data set is derived from a cross section of many runs measured over many weeks, it is likely that they contain a component of variability primarily attributable to interrun variation, which will be unrelated to the swivel effect, and so act to dilute it. Although the data at the extremes of the reciprobit plots in Figure 3 appear to deviate from the nominal curve in some instances, as seen in previously reported recinormal plots (Carpenter & Williams, 1995; Leach & Carpenter, 2001; Reddi & Carpenter, 2000), these tails constitute only a small fraction (typically <5%) of the data. The KS test is relatively insensitive to such deviations in the extreme tails of a cumulative distribution (Press, Teukolsky, Vetterling, & Flannery, 1992). 
Figure 3
 
Reciprobit plots of latency for three target appearance probabilities. The data for p = .5 were pooled from Presentations −9 through 0 in runs having a final appearance probability of .1 and .9. Data for p = .1 and .9 were pooled from Presentations 30 through 39 of the same runs. Straight lines are maximum likelihood fits (KS, p = .85 and 0.93 for Observers A and B, respectively) constrained to have a common intercept when latency is infinite, which is the characteristic pattern that occurs when an observer's prior probability shifts. Such fitting was favored (log10 LR = +0.6 and +0.04 for Observers A and B, respectively) over a maximum likelihood fit of a model in which the slope was fixed but the intercept varied.
Figure 3
 
Reciprobit plots of latency for three target appearance probabilities. The data for p = .5 were pooled from Presentations −9 through 0 in runs having a final appearance probability of .1 and .9. Data for p = .1 and .9 were pooled from Presentations 30 through 39 of the same runs. Straight lines are maximum likelihood fits (KS, p = .85 and 0.93 for Observers A and B, respectively) constrained to have a common intercept when latency is infinite, which is the characteristic pattern that occurs when an observer's prior probability shifts. Such fitting was favored (log10 LR = +0.6 and +0.04 for Observers A and B, respectively) over a maximum likelihood fit of a model in which the slope was fixed but the intercept varied.
Given our results in Figure 2, we can investigate how prior probability might be calculated. If prior probability were determined solely from the results of the latest trial, then average latency would change abruptly in the same way that the appearance probability changed. Clearly, this is not so, which suggests that prior probability reflects some extended history of previous events that is continuously updated through new information from the latest trial. In the following section, we will examine a simple model of how the brain might perform this updating. 
Modeling the time course of expectation development
As Bayesian likelihood theory successfully predicts other aspects of decision making (Carpenter & Williams, 1995; Reddi et al., 2003) and can be implemented using plausible neural circuits (Gold & Shadlen, 2001; Rao, 2004), it may be that the brain simply calculates the most likely estimate of probability given the target's previous pattern of appearance. Indeed, if such a calculation could suitably describe our data, it would obviate the need to search for more complex models. If pn is the true probability of a target's appearance in a particular location, say on the left, in trial n, then Ln(x), the logarithm of the likelihood function estimating pn = x, is modified by an observation E to form an updated (posterior) estimate of log likelihood, Ln + 1(x). Then: 
Ln+1(x)=Ln(x)+S(x)+κ,
(1)
where S(x) = log(prob(E|pn = x)) and κ is the arbitrary constant necessarily entailed in measures of likelihood (Edwards, 1972). 
The current expectation of the target probability p, which we shall call q, is the most likely value of x, for which L( x) is a maximum. For a task where a target appears either on the left or the right, this maximum likelihood estimate is simply equal to the proportion of trials that have appeared on a given side, with equal weight given to all observations, whether old or recent (see 1). Unfortunately, this simple Bayesian model has the undesirable characteristic of becoming progressively insensitive to changes in the underlying probability p as the total number of target appearances increases. As the past history gets longer, a given trial will have an increasingly negligible effect on expectation because it is outweighed by observations that were made some time ago. Although such a characteristic is desirable when attempting to predict a fixed, underlying probability (Poggio, Rifkin, Mukherjee, & Niyogi, 2004), when a probability is subject to change, as occurs in most real-world environments, it makes the system very slow to respond. Given the results in Figure 2, we can also rule out even the case where the Bayesian history commences from the first trial in a run: the predicted change in latency is still too slow to explain the experimental data (Figure 2, dashed lines). 
This problem can be overcome by a simple modification of Bayesian updating, in which previous information is progressively “forgotten” by discounting it by a factor λ, so that its weighting is (1 − λ). We call λ the Lethean factor, after the river Lethe, whose waters induced forgetfulness when imbibed (Griffin, 1998). This Lethean factor determines how quickly a system adapts to change, with a small λ corresponding to a conservative system that is not easily swayed by new information. Then: 
Ln+1(x)=(1λ)Ln(x)+S(x)+κ.
(2)
 
As previously, q, the most likely value of x, reflects the current expectation of the target probability p and should be logarithmically related to the median saccadic latency (Carpenter & Williams, 1995). Our model suggests that the effects of stimulus history should decay exponentially the further in the past a trial occurred. While it would be possible to use other weighting schemes, the exponential form allows stimulus history effects to be calculated incrementally (Sutton, 1988). We shall see later that this is quite useful for reducing the computational complexity of Equation 2
We fitted this model to the data presented in Figure 2, assuming that a subject's change in expectation was effectively complete from 70 presentations after the probability change and so that we could fix the final asymptote of the curve to the average promptness for the next 10 presentations (71 to 80). Our justification for this assumption was a separate experiment, described below, which measured the asymptotic saccadic latency for a given probability. We performed five sequential runs of 200 saccades at each of the five probability levels described in this article, in random order. The initial two runs (400 saccades) at each probability were treated as learning runs and so were discarded, and a weighted least squares fit, as used in Figure 1, was applied to the remaining data. The slope of the latency versus log probability plot for Observer A was −27.9 ms/log unit (95% confidence interval [CI] = −43.7 to −12.0; R 2 = .91), which was indistinguishable from that obtained in Figure 1. Results for Observer B were more variable, with the slope being −37.9 ms/log unit (95% CI = −41.1 to −34.7; R 2 = 1.0) prior to the main experiment and −15.9 ms/log unit (95% CI = −34.4 to 2.60; R 2 = .71) afterward: The slope (−26.4) obtained in Figure 1 is very close to the midpoint (−26.9) between these two pre- and postexperiment values. 
For the fitted functions in Figure 2, we determined the initial promptness, corresponding to an underlying appearance probability of .5, from the average of the promptness values from Trials −19 through 0. We assumed that all model parameters had reached their asymptotic values prior to the probability change. We fitted curves, using a least squares technique and assuming a logarithmic relationship between expectation and average latency, on the data from Presentation 0 through 70, and calculated CIs for λ using the F ratio between the best fit of the data and one with a specified λ. To confirm that an exponential change in promptness improved the fit, we used a corrected Akaike's Information Criterion (AICc) to compare our model to a linear change between the initial promptness (average of Trials −19 through 0) and the asymptotic promptness after the probability change (average of Trials 71 through 80). In all but one case, the log 10 LR ( Table 1) are positive, indicating that our exponential model is preferred, with many comparisons showing strong (log 10 LR > 2) support. 
Table 1
 
Fitted parameters: log 10 LR (AICc) = log 10 likelihood ratio, using a corrected Akaike's Information Criterion.
Table 1
 
Fitted parameters: log 10 LR (AICc) = log 10 likelihood ratio, using a corrected Akaike's Information Criterion.
Saccade direction Probability Observer A Observer B
Trials λ 95% CI ( λ) log 10 LR (AICc) Trials λ 95% CI ( λ) log 10 LR (AICc)
Left .10 128 .11 .18–.08 8.6 33 .14 .34–.06 3.2
.33 97 .02 .03–.01 0.47 36 .05 1.00–.02 0.11
.67 124 .06 .10–.03 5.9 31 .11 1.00–.01 –0.01
.90 140 .08 .09–.06 27 37 .03 .05–.02 2.4
Right .10 140 .06 .08–.05 6.3 37 .06 .11–.04 1.5
.33 124 .03 .04–.02 0.54 31 .41 1.00–.04 1.7
.67 97 .01 .04–.00 0.69 36 .17 1.00–.04 1.2
.90 128 .62 1.00–.25 15 33 .26 1.00–.12 13
Table 1 suggests that in most cases, we can estimate λ to an acceptable level of confidence, although the inescapable noisiness of some of the data sets, relative to the shift in mean latency, makes a confident estimation more difficult. Because there appears to be no systematic change in λ as the underlying appearance probability changes, we calculated a weighted average of the best fitting values for both left- and right-going saccades for each observer ( Figure 4); overall, we find a value for λ that is approximately .05. We used the average values given in Figure 4 to produce the solid curves shown in Figure 2, and it is clear that our forgetful model is better able to capture the time course of expectation development than is the nonforgetful model (dashed lines). 
Figure 4
 
Weighted means for λ for left- and right-going saccades in two observers (A and B), based on the best-fit values given in Table 1. Weightings were the reciprocal of the width of the 95% CIs. The error bars show weighted standard error of the mean.
Figure 4
 
Weighted means for λ for left- and right-going saccades in two observers (A and B), based on the best-fit values given in Table 1. Weightings were the reciprocal of the width of the 95% CIs. The error bars show weighted standard error of the mean.
It is possible to formulate other, more complex, models to describe our data, including changing expectation only when unusual stimulus appearance patterns occur, for example, an extended sequence of targets all appearing on the left. The gradual change in mean latency seen in Figure 2 could then result from averaging multiple abrupt, steplike changes in latency occurring at various presentations on individual runs, as could occur in a hidden Markov model (Jüttner & Wolf, 1994; Rabiner, 1989). Such discrepancies between individual and average curves have been demonstrated for conventional conditioned learning paradigms, where individual learning curves were steplike but group curves showed a gradual change (Gallistel, Fairhurst, & Balsam, 2004). If this were the case in our data, we would expect the variability of the data to increase over the highly sloped portion of our averaged function, relative to the asymptotic regions. We investigated this possibility using a Monte Carlo simulation of an observer whose expectation did not change continuously but altered in a stepwise fashion from .5 or .9 at a single point 1 to 70 presentations after the true probability change. We modeled 100 runs, with the proportion of runs remaining at an expectation of .5, dropping exponentially for each presentation following the true underlying probability change. We offer no theoretical reason for such an exponential trajectory, although it appears that this type of trajectory is required to give a continuous change in average latency (Figure 5, upper panel, unfilled circles) that is similar to our experimental results (filled circles). If the intrinsic random variability of saccadic latency is low (as quantified by σ, Carpenter, 1999; Figure 5, lower panel), variability across runs increases markedly (lower panel, triangles) as the average latency changes continuously, as predicted. Unfortunately, this characteristic signature becomes lost when trial-by-trial saccadic variability is set to realistic levels (lower panel, filled and unfilled circles); hence, the fact that variability does not appear to increase in our data (Figure 2) cannot be said to rule out the possibility of step changes in latency on individual runs. 
Figure 5
 
Average latency (upper panel) and standard deviation for Observer A (left-going saccades; final probability, .9) and a Monte Carlo model where expectation changes in a step-like manner between .5 and .9 at a point 1 to 70 presentations after the true probability change. We modeled 100 runs, with the proportion of runs remaining at an expectation of .5 dropping exponentially for each presentation. The slope of the log probability versus latency function was taken from Observer A's results in Figure 1. The parameter σ gives the variability in the rise-to-threshold rate of the LATER decision unit, as described in Carpenter and Williams (1995).
Figure 5
 
Average latency (upper panel) and standard deviation for Observer A (left-going saccades; final probability, .9) and a Monte Carlo model where expectation changes in a step-like manner between .5 and .9 at a point 1 to 70 presentations after the true probability change. We modeled 100 runs, with the proportion of runs remaining at an expectation of .5 dropping exponentially for each presentation. The slope of the log probability versus latency function was taken from Observer A's results in Figure 1. The parameter σ gives the variability in the rise-to-threshold rate of the LATER decision unit, as described in Carpenter and Williams (1995).
As a further investigation, we compared fits based on our forgetful model with those from a step change in expectation, using the data of individual runs. Each model had a single free parameter: for the step model, this was the presentation (1 to 70 presentations after the true change in probability) where expectation changed from .5 to the final appearance probability, and for the forgetful model, this was the magnitude of λ. We examined only the most extreme shifts in appearance probability (.5 to .9 and .5 to .1), as these would be expected to show the largest changes in expectation and so most reliably anchor our curve fits. Fits containing fewer than 4 points could not be analyzed by our goodness-of-fit measures below: and so were excluded from the analysis. The distribution of step-change positions for each observer is given in Figure 6, and the median values for λ were .080 and .078 for Observer A (final probability of .9 and .1, respectively) and .073 and .10 for Observer B. We then summed the log 10 LR, calculated from the AICc, for each run to gauge the support for each model. For shifts to a high appearance probability, the step model was favored (summed log 10 LR = −32.1 and −7.6, from 140 and 37 runs, for Observers A and B, respectively), whereas the forgetful model was favored for shifts to low appearance probabilities (summed log 10 LR = 25.4 and 7.6, from 120 and 30 runs, respectively). Overall, the results fail to show compelling support for either model. As expected, when the results of the individual steps are averaged, a smoothed trajectory results, and this can be seen in Figure 2 (red lines). 
Figure 6
 
Cumulative distribution of the location of the best-fitting step-change in expectation (left-going saccades; final probability, .9) for Observer A ( n = 123 runs) and Observer B ( n = 34 runs). The Monte Carlo data are from those presented in Figure 5A. Runs where the best fit was not unique, that is, two positions gave the same minimum SSQ, were excluded from the analysis. Similar cumulative distributions could not be determined for when the final probability was .1 as the sparseness of the data meant that only a very low number of runs gave unique best fits (6 and 2 runs for Observers A and B, respectively).
Figure 6
 
Cumulative distribution of the location of the best-fitting step-change in expectation (left-going saccades; final probability, .9) for Observer A ( n = 123 runs) and Observer B ( n = 34 runs). The Monte Carlo data are from those presented in Figure 5A. Runs where the best fit was not unique, that is, two positions gave the same minimum SSQ, were excluded from the analysis. Similar cumulative distributions could not be determined for when the final probability was .1 as the sparseness of the data meant that only a very low number of runs gave unique best fits (6 and 2 runs for Observers A and B, respectively).
The simple step model can, however, be distinguished from our model by considering the data prior to the change in appearance probability, that is, when the appearance probability was .5. While the step model predicts that expectation and, therefore, average latency remains constant during such times, the forgetful model predicts small changes in trial-by-trial latencies as the posterior estimate of probability is continually updated. We performed a separate analysis on those portions of our data for where the appearance probability was .5 to determine whether such trial-by-trial changes in latency occurred even when the appearance probability remained fixed. 
The results are given in Figure 7, which plots the average difference in saccadic latency between two saccades of the same direction, separated by the number of trials (of any direction) as given on the abscissa. These data show that the influence of a previous trial on saccadic latency decays approximately exponentially the further away in the sequence history the previous trial is, in contrast to the flat function predicted by the step model. Therefore, the variation in latency on a trial-by-trial basis is inconsistent with that predicted from a single-step model. Such an exponential decay is the behavior expected from our forgetful model, however, and is consistent with neurophysiological evidence for an exponential decay in the influence of stimulus history in the frontal eye fields for simple “pop-out” visual search tasks in the frontal eye fields (Bichot & Schall, 2002) and in the parietal cortex for more complex tasks involving maximizing reward (Sugrue, Corrado, & Newsome, 2004). Unfortunately, it is difficult to isolate the effects of expectation using trial-by-trial analyses as there are short-term effects that can alter saccadic latency in a way opposite to that expected from the probability of appearance alone (Bichot & Schall, 2002; Carpenter, 2001; Cho et al., 2002; Dorris, Taylor, Klein, & Munoz, 1999; Fecteau & Munoz, 2003; Klein, 2000). Such effects may be responsible for nonmonotonic irregularities seen at the beginning of our curves. For this reason, we believe that our analysis of average latency (Figure 2) provides a better estimate of the dynamics of expectation development, as various short-term effects should largely cancel. The magnitudes of the effect seen in our experimental curves in Figure 7 are roughly similar to those generated from a Monte Carlo simulation of left-going saccades, which assumes our forgetful model (Figure 7, diamonds). The time courses are significantly shorter than would be expected from our best fitting values for λ given in Table 1, however, and may reflect initial non-monotonicities biasing our curve fits. It should be noted that the curves given in Figure 7 may not asymptote exactly to zero as saccades that are widely separated will be differently influenced by both fatigue and slow drifts in saccadic latency, the sum of which may not be easily predicted. In addition, the non-Gaussian distribution of saccadic latencies means that differences between two distributions will contain a small offset, on average. 
Figure 7
 
The average difference in saccadic latency between two saccades of the same direction, separated by the number of trials (in any direction) as given on the abscissa. Only those trials up to and including the moment of the underlying change in appearance probability were analyzed, with the first 20 trials in each run being discarded; for this reason, the appearance probability was always .5. For runs where the probability remained at .5 throughout the entire run, data were analyzed up to a random endpoint between 70 and 120 trials, thereby giving endpoints consistent with those runs in which the probability did change. Data shown are for the left (filled circles) and right (unfilled circles) saccades of Observer A, with error bars giving one standard error. Negative deflection means that a previous saccade speeded up the subsequent saccade. The smooth curves show a least squares exponential function, with decay coefficients of 4.7 and 4.1 for left- and right-going saccades, respectively (95% CI = 1.6 to 7.7 and 1.1 to 7.1), and magnitudes of −2.3 and −2.6 (95% CI = −2.9 to −1.6 and −3.6 to −1.6). Data from Observer B (dashed line) are represented by the dashed line, showing the average from both left- and right-going saccades. Yellow triangles are left-going saccadic results from a Monte Carlo simulation of an observer whose log probability versus latency function was taken from Observer A's results in Figure 1 and whose rise-to-threshold variability σ was .55, as used in Figure 5. One million runs of 40 saccades were generated with the final saccade always being to the left, and differences were always taken relative to this final saccade. We used Equation 3 to calculate expectation, where λ = .068 (the average left-going values for Observer A, Table 1).
Figure 7
 
The average difference in saccadic latency between two saccades of the same direction, separated by the number of trials (in any direction) as given on the abscissa. Only those trials up to and including the moment of the underlying change in appearance probability were analyzed, with the first 20 trials in each run being discarded; for this reason, the appearance probability was always .5. For runs where the probability remained at .5 throughout the entire run, data were analyzed up to a random endpoint between 70 and 120 trials, thereby giving endpoints consistent with those runs in which the probability did change. Data shown are for the left (filled circles) and right (unfilled circles) saccades of Observer A, with error bars giving one standard error. Negative deflection means that a previous saccade speeded up the subsequent saccade. The smooth curves show a least squares exponential function, with decay coefficients of 4.7 and 4.1 for left- and right-going saccades, respectively (95% CI = 1.6 to 7.7 and 1.1 to 7.1), and magnitudes of −2.3 and −2.6 (95% CI = −2.9 to −1.6 and −3.6 to −1.6). Data from Observer B (dashed line) are represented by the dashed line, showing the average from both left- and right-going saccades. Yellow triangles are left-going saccadic results from a Monte Carlo simulation of an observer whose log probability versus latency function was taken from Observer A's results in Figure 1 and whose rise-to-threshold variability σ was .55, as used in Figure 5. One million runs of 40 saccades were generated with the final saccade always being to the left, and differences were always taken relative to this final saccade. We used Equation 3 to calculate expectation, where λ = .068 (the average left-going values for Observer A, Table 1).
Deriving a simplified method for updating expectation
To implement the model implied by Equation 2 in a literal fashion would require a likelihood distribution L( x) to be neurally encoded, presumably using a coordinated population of neurons that each code for a discrete value of x. Fortunately, the updating procedure embodied in Equation 2 can be expressed in a different way that suggests an implementation that is far less costly in terms of the number of neurons required. It is not difficult to show (see 1) that the updating rule for q is given by:  
q n = ( 1 λ ) q n 1 + λ Z n 1 ,
(3)
where the outcome Z n is 1 when, in the nth trial, the target is on the left and 0 otherwise. The obvious advantage of this simplified expression is that it avoids the search for a maximum value of L n + 1( x) over all possible x required by Equations 1 and 2. A similar equation has been used by Cho et al. (2002) to produce an exponential decay in the effects of stimulus history. Our rule is capable of very simple implementation by a very small number of actual neurons. For instance, if the decay of previous information predicted by the equation is time dependent rather than event dependent, this behavior could even be displayed by a single neuron (Figure 8). Here, we make explicit what has so far been only implicit, that q is necessarily a conditional probability, contingent on the experimental circumstances, of which the most relevant is that a trial has just started. Behavior of this kind is seen in the superior colliculus, where there are units that increase their resting activity when a response is imminent (Dorris & Munoz, 1998); thus, it does not seem unreasonable to suppose that this is because of an afferent pathway stimulated by the beginning of a trial, T. If this pathway synapses with a response neuron Z that fires maximally in relation to a particular response, then its resting activity at the start of a trial will depend on the strength of the synapse from T to Z. If this strength (representing q) behaves in a quasi-Hebbian manner, increasing when Z is paired with T but declining exponentially using a constant fraction λ of its strength when it is not, then Equation 3 will again be obeyed. 
Figure 8
 
A simple neural implementation of the model. T is an afferent neuron encoding the occurrence of a trial; Z is a neuron associated only with a particular response. The strength of the Hebbian synapse qn from T to Z represents the estimated conditional probability q(Z|T) and is appropriately increased or decreased from trial to trial accordingly as activity of T is or is not associated with activity of Z.
Figure 8
 
A simple neural implementation of the model. T is an afferent neuron encoding the occurrence of a trial; Z is a neuron associated only with a particular response. The strength of the Hebbian synapse qn from T to Z represents the estimated conditional probability q(Z|T) and is appropriately increased or decreased from trial to trial accordingly as activity of T is or is not associated with activity of Z.
Optimal values of λ for different stochastic environments
Given the general desirability of a mechanism for discounting older information, it is natural to go further and ask what the implications might be of the actual values of λ that we have observed. Do they represent an optimum of some kind? Intuitively, it seems clear that an environment in which circumstances are frequently changing will favor a larger value of λ, but when they change only seldom, λ should be smaller. More formally, we might postulate a stochastic environment in which the underlying probability p of some event undergoes unpredictable stepwise changes, the probability of such a change at any particular moment being constant, such that transitions occur with an average frequency f. We may then ask what value of λ minimizes the average discrepancy between the true value of p and our current estimate of it, q. Unfortunately, this question does not appear to have an analytical solution; we therefore tackled the problem by Monte Carlo simulation using a customized computer program that used a linear congruential random number generator with a period of 2 24 − 1. 
A single simulated run consisted of 100,000 sequential trials; for each trial, the underlying probability p could take one of two values, p 0 or p 1, with transitions between these values occurring randomly with an average frequency f. We updated q at each trial according to Equation 3 and calculated the standard deviation σ of the prediction error ( qp), over the entire run, as a measure of overall performance. To find an optimum value for λ for a particular set of parameters ( p 0, p 1, f), we performed runs for values of λ from .0 through 1.0 in .01 steps and then took the value for λ that returned the lowest standard deviation σ. The results of these simulations are shown in Figures 9 and 10 and show that for a given ( p 0, p 1), the optimum λ had a monotonically increasing relation to f; in other words, the more frequently the underlying probability changed, the larger λ had to be to achieve the best performance. This relationship between optimum λ and f is shown in Figure 10A: Its shape appears roughly constant (for convenience, we model it with a cube root function), with a scaling factor k that chiefly depends on the size of the step in probability ( p 0p 1): The larger the changes in probability, the larger λ needs to be for optimum performance. This relationship is shown in Figure 10B: k is an accelerating function of ( p 0p 1), which can be approximated quite well with an exponential. We attach no particular meaning to this empirical fit, however, but simply note that it economically describes the functional relationship we observe. Finally, we simulated a (more realistic) situation in which p 0 and p 1 are not fixed but are themselves randomly determined at each transition, with a uniform distribution between p = 0 and p = 1. As might have been anticipated, this resulted in very similar behavior to the case when ( p 0p 1) = .5 ( Figure 10A) and suggests that in the absence of any other information, we might expect to find λ having values between 0 and .18; as it happens, our observed values ( Figure 4) do indeed lie in the middle of this range. 
Figure 9
 
Error in predicting expectation as a function of the Lethean factor, λ, for our simple model of expectation development. We quantified error as the standard deviation of the differences between the true appearance probability, p, and our current estimation of it, q. The left panel shows data from a transition between p = .3 and p = .7, with a frequency f of 0, .001, .002, .005, .01, .03, .07, or .1 (lower curve through to upper curve). The right panel shows a similar analysis but for a transition between p = .1 and p = .9.
Figure 9
 
Error in predicting expectation as a function of the Lethean factor, λ, for our simple model of expectation development. We quantified error as the standard deviation of the differences between the true appearance probability, p, and our current estimation of it, q. The left panel shows data from a transition between p = .3 and p = .7, with a frequency f of 0, .001, .002, .005, .01, .03, .07, or .1 (lower curve through to upper curve). The right panel shows a similar analysis but for a transition between p = .1 and p = .9.
Figure 10
 
(A) Relationship between the frequency of change, f, in the underlying appearance probability, p, and the value of the Lethean factor, λ′, that minimizes the prediction error in tracking these changes. Each data set is approximated by a function of the form λ = kf 1/3 (solid lines), where k is a vertical scaling factor. (B) Relationship between the scaling factor, k, and the absolute magnitude of the change between two underlying appearance probabilities, Δ p (= p 0p 1). The solid line shows a best fitting exponential function.
Figure 10
 
(A) Relationship between the frequency of change, f, in the underlying appearance probability, p, and the value of the Lethean factor, λ′, that minimizes the prediction error in tracking these changes. Each data set is approximated by a function of the form λ = kf 1/3 (solid lines), where k is a vertical scaling factor. (B) Relationship between the scaling factor, k, and the absolute magnitude of the change between two underlying appearance probabilities, Δ p (= p 0p 1). The solid line shows a best fitting exponential function.
Despite our simulations showing a systematic relationship between the magnitude of the probability change and λ, we would not expect to see this in our empirical data. As neither subject knew what probability change would occur on any particular run, the magnitude of the change could only be estimated some time after the probability shift occurred when adjusting λ would be of little, if any, use. Consistent with this, we found no significant differences between best fitting values for λ in Table 1 for low (.33 and .67) and high (.1 and .9) magnitude probability shifts ( t test: p = .21 and .53 for Observers A and B, respectively). 
Discussion
Our study suggests that changes in expectation are well described by a simple model that calculates expectation from a weighted combination of old and new information. When a target's probability of appearing abruptly changes, this model predicts a smooth change in the LATER parameter S 0 that encodes prior probability, as found experimentally ( Figure 2). In addition, our model predicts small fluctuations in S 0 on a trial-to-trial basis even when the probability of appearance is constant ( Figure 7). Such trial-to-trial variation has previously been thought to arise from a deliberate randomization process in the LATER model (Carpenter, 1999). Although this is predominantly the case, our results show that a proportion of this variation can be attributed to fluctuations in the estimate of prior probability that are dependent upon the short-term pattern of the target's appearance (Figure 6). This proportion is small, however (a few milliseconds, Figure 7, vs. the near 100 ms of random trial-to-trial variation in latency, Figure 3), and thus, it is appropriate to assume that S0 is essentially constant when appearance probability is fixed. However, in the ever-changing visual environment of the real world, the appearance probabilities of visual targets are dynamic, and our simple method for updating prior probability allows the LATER model to be applied in these situations. 
Our model for prior probability, as given in Equations 2 and 3, is in some ways functionally similar to a classic Kalman filter (Kalman, 1960), which attempts to predict an underlying state through a weighted combination of prior estimates and newly arriving information and, in turn, shares similarities with the more generalized, and typically more complex, hidden Markov models (Rabiner, 1989). Such hidden Markov models share the exponential trajectories characteristic of our observations and our simple model designed to explain them, and are also capable of more sophisticated learning. However, this sophistication is achieved at the cost of requiring very many more neurons for their implementation, rising as the square of the number of neurons used to code all potential probabilities. Furthermore, the recursive techniques used to predict the stimulus appearance probability that underlies a series of events require that these events be discretely held “in storage” over a period of many trials, adding further to the number of neurons required. Finally, probability in such models is coded as an ensemble property, by the spatial pattern of activity, whereas in the brain, it appears to be coded by firing frequency itself (Basso & Wurtz, 1997). The main virtue of our model is its simplicity and the ease with which its exponential decay in the effect of stimulus history can be implemented by biologically plausible means (Figure 8). Similar exponential decays in stimulus history have been posited for a wide range of situations, ranging from comparatively simple reaction time analyses (Cho et al., 2002) through more complex tasks of reward maximization (Sugrue et al., 2004) and even up to how animals might develop evolutionarily stable behaviors (Harley, 1981). Such decays also form a part of temporal difference (TD) models for conditioning experiments (Niv, Duff, & Dayan, 2005; Seymour et al., 2004; Sutton, 1988; Sutton & Barto, 1981) discussed below. To distinguish definitively between a simple exponential model of forgetting, such as ours, and more generalized hidden Markov models would require the use of more complex sequences of stimuli than we used, as well as experiments of even greater laboriousness. The continuous change demonstrated in our analysis of trial-by-trial latency (Figure 7) data does, however, argue against hidden Markov models in which probability is encoded in a very small number of discrete steps. The data in Figure 7 could be generated by a model that can produce multiple, fine step changes in probability if it were falsely registering that a change in appearance probability had taken place, though, given that the true appearance probability is fixed throughout. The curves would then give some indication of the likelihood that a false registration n trials in the past had been maintained and so was still influencing the current trial. As multiple, fine step models could approximate our forgetful model with increasing closeness, it may be very difficult to distinguish them from our forgetful model. Precise decision rules for how such step models might operate would need to be developed before detailed analyses could be attempted, however. 
TD models for learning
Our forgetful model shares a number of common features with TD models for neural learning, and so it is worthwhile to examine the structure of the latter more closely. TD models (Sutton & Barto, 1981) are designed to produce a sequential adjustment of a weighting parameter used in forming a prediction of the eventual outcome from a dynamically evolving system. Using Sutton's (1988) notation, we have a series of observations x at various sequence times t followed by a final outcome z, that is, 
x1,x2,x3,,xm,z.
 
We also have a weighing vector w, which is used in combination with the appropriate observation to determine predictions P 1 through P m for what z will eventually be. The TD learning method states that the error signal indicating a modification in weight w t is required is calculated not from the difference between the instantaneous prediction and the final outcome (i.e., zP t) as occurs in supervised learning models but rather from the difference between successive predictions, P t + 1P t. The amount of change in w depends not only upon the magnitude of this error signal but also upon a sum of partial derivative calculations for how changing w will influence the various predictions P. For most TD learning algorithms, the components in this sum are weighted exponentially, such that past predictions have less influence. By running the model over numerous sequences leading to an outcome z, the TD model can “learn” the appropriate weighting vector w to best predict future events when a new sequence is presented. 
Other shared features include the presence of an exponential weighting of past predictions and the ability to update estimates incrementally. In addition, neither model needs to know the actual outcome z (in our situation, the true underlying appearance probability) to modify predictions. Despite these similarities, there are important differences, however. Firstly, the defining feature of TD models is high-pass filtering of the outcome: synaptic weighting changes in relation to changes in the output, equivalent to taking the difference between the actual and predicted response at any moment (Sutton & Barto, 1981). This is not a feature of our model. Secondly, the exponential decay in our model is incorporated to allow it to follow changes in the underlying parameter being estimated, in a fashion similar to that done by Sugrue et al. (2004). In contrast, the decay in TD models is not designed to account for variations in z but rather to reduce computational complexity, with the decay made infinitely short and the weighting vector w driven by the last observation in the limiting case (Seymour et al., 2004; Sutton & Barto, 1981). Indeed, it is not clear that the inclusion of an exponential decay in TD models incurs any benefits when the underlying parameter to be predicted is constantly changing. Finally, TD models are primarily designed to learn across multiple sequences to predict a certain outcome z (Schultz, Dayan, & Montague, 1997; Seymour et al., 2004; Sutton, 1988; Sutton & Barto, 1981). In contrast, our model is designed to track underlying patterns within a sequence; there is no ultimate outcome z, and in our experiment, the patterns in one sequence offer no predictive value (as far as TD models are concerned) to subsequent sequences. 
The differences between models should not be overemphasized, however, and it is likely that both make assumptions that are only approximations of both how biological systems truly act and how dynamic systems typically evolve. If λ in our model is ultimately found to be variable under different experimental circumstances, as our simulations suggests might be beneficial, TD models may provide useful insights into how the brain establishes what optimal value λ should take. 
Potential neural mechanisms for our model
We can speculate about the neural mechanisms that might give rise to our simple, forgetful model. The activity of certain saccade-related visuomotor neurons in the superior colliculus of nonhuman primates reflects the log of the prior probability that a target will appear (Basso & Wurtz, 1997), showing increased firing rates prior to the onset of a target when targets have appeared in the same location on previous trials (Dorris & Munoz, 1998; Fecteau & Munoz, 2003), decreasing the time taken for a neural signal to rise to threshold and thereby shortening saccadic latency. Dorris, Paré, and Munoz (2000) have shown that this activity can alter on a trial-by-trial basis, as predicted by our model, and that such alterations correlate with the changes in latency seen when particular sequences of saccades occur. One implementation of our model predicts that a proportion of the pre-target neural signal is combined with information from the current trial, such that a new pre-target neural signal is formed. As only a proportion of the original pre-target signal is used, the influence of old information that encodes the stimulus history progressively decays. Alternatively, the activity of a suitably arranged Hebbian synapse would also behave similarly (Figure 8). One way to distinguish between these two potential implementations of our model would be to investigate whether the influence of the stimulus history decays primarily because of the time interval between successive presentations or whether it reflects a time-invariant process that reduces the influence of prior information on each trial. As our study did not systematically manipulate the average time between sequentially presented targets, we do not know whether λ varies systematically with intervening time, so we cannot directly address this question. Some clues are available from work that has investigated “repetition” effects, as measured by either saccadic latency or the pre-target firing in the superior colliculus, that suggests that decay either does not occur or is very small when the time between saccades is increased (Gore, Dorris, & Munoz, 2002). However, the values of λ obtained in our study predict only a small decay in pre-target firing between successive saccades, which might not be readily detectable. 
It is worth considering whether the effects on latency we describe here might be purely motor in origin or whether they reflect the actions of higher, cognitive processes. Dorris et al. (2000) suggested that their sequential dependences in superior colliculus activity could reflect an automatic change in motor responses but that cognitive influences were also plausible, if strategically unjustified. Clearly, such sequential dependences cannot depend on motor responses alone as the influence of each saccade made to a peripheral target would be cancelled by the equal and opposite saccade directing the eye back to fixation, so changes in latency based on the previous pattern of target appearance could not develop. Therefore, the site responsible for our effects must be modifiable by higher functions so that saccades to a peripheral target are treated differently from those that restore central fixation. The superficial and intermediate layers of the superior colliculus are but one suitable candidate as they receive inputs from a variety of cortical areas (Carpenter, 1988), and these may be responsible for the modulation of presaccadic collicular activity by target selection (Glimcher & Sparks, 1992) and voluntary shifts in attention (Kustov & Robinson, 1996). It has, therefore, been suggested that one role of the superior colliculus is to integrate sensory and cognitive signals and form a final decision about whether to initiate a saccade (Dorris & Munoz, 1998), although the distinction between developing motor responses and processes such as expectation and attention is not always obvious (Dorris & Munoz, 1998; Gold & Shadlen, 2000). However, in frontal eye field neurons, clear effects of stimulus history unrelated to motor response have been demonstrated in visual search tasks (Bichot & Schall, 2002; McPeek & Keller, 2001) yet appear also to share the exponential trajectory given in our model. It is therefore possible that the colliculus may not be primarily involved in calculating expectation but simply reflects the results of calculations happening elsewhere in the nervous system. Irrespective of the precise location for these calculations, recent evidence suggests that saccadic latencies in even quite complex cognitive tasks, such as developing matching strategies that maximize rewards, can be modeled using a simple exponential decay in the influence of stimulus history (Sugrue et al., 2004), and so it seems reasonable to suggest that our model for expectation development may find applications outside the comparatively simple task reported here. 
Appendix:
Derivation of Equation 3
Let Z n be the outcome of the nth trial (1 if the target is on the left, 0 otherwise). Let p be the probability that Z = 1; hence, E( Z) = p, where E is the expectation function. Let L n( x) be the log likelihood function after trial n for different possible values x of p. Let S n( x) be the support afforded to the hypothesis ( p = x) by trial n, that is, log(prob( Z n| p = x)). 
1. No forgetting 
After the nth trial, applying Bayes' law: L n = L n 1 + S n+ κ, where S n = Z n log( x) + (1 − Z n) log(1 − x) and κ is the arbitrary constant necessarily entailed in measures of likelihood (Edwards, 1972). 
Thus, L n = L 0 + m n log( x) + ( nm n) log(1 − x) + κ, where m n is the total number of leftward responses seen by the nth trial and κ is once again an arbitrary constant. 
If we assume a uniform prior distribution L 0, then differentiating: d L n/d x = m n/ x − ( nm n)/(1 − x) = m n/ x + m n/(1 − x) − n/(1 − x). 
If the best estimate of p is q n, then L n( x) should be maximal for q n = x, and d L n/d x = 0. 
Thus, m n (1/ q n + 1/(1 − q n)) − n/(1 − q n) = 0; rearranging, q = m n/ n
In other words, the maximum-likelihood estimate of p is simply the observed proportion of leftward trials. 
2. Forgetting 
Let λ be the Lethean factor; for convenience, let β = (1 − λ). 
Then, after the nth trial, L n = β L n − 1 + S n + κ, where S n = Z n log( x) + (1 − Z n) log(1 − x) and κ is, as before, an arbitrary constant. 
Now, E( S n + κ) = p log( x) + (1 − p) log(1 − x) + κ = say Φ( x, p). 
Hence, E( L n) = β n L 0 + Φ( x, p) (1 + β + β 2 + … + β n − 1) = β n L 0 + Φ( x, p) (1 − β n)/(1 − β). 
As n → ∞, this tends to Φ( x, p)/(1 − β). 
Now, suppose we start in this asymptotic state, fully adapted to probability θ, which then changes at trial n = 0 to probability p
Then, L 0 = Φ( x,θ)/(1 − β). 
Thus, L n = [β nΦ( x,θ) + (1 − β n) Φ( x, p)]/(1 − β), and d L n/d x = [β n(θ/ x − (1 − θ)/(1 − x)) + (1 − β n) ( p/ x − (1 − p)/(1 − x))]/(1 − β). 
This will be zero (when x = q n) if β n(θ/ q n − (1 − θ)/(1 − q n)) = (1 − β n) ( p/ q n − (1 − p)/(1 − q n )). 
Rearranging, [β nθ + (1 − β n) p]/ q n = [β n(1 − θ) + (1 − β n) (1 − p)]/(1 − q n), which simplifies to q n = β nθ + (1 − β n)p; hence, β nθ = q n − (1 − β n)p. 
Now, because q n − 1 = β n 1 θ + (1 − β n − 1)p, β n − 1θ = q n − 1 (1 − β n − 1) p; multiplying both sides by β, β nθ = β q n − 1 (β − β n) p
Eliminating θ and rearranging: q n = β q n − 1 + (1 − β) p
Substituting for β, q n = (1 − λ) q n 1 + λ p
Because p = E( Z), this is equivalent to the updating rule: q n = (1 − λ) q n 1 + λ Z n.QED 
Acknowledgment
This work was supported in part by an ARC Project Grant (DP0663915) to A.J.A. and R.H.S.C. We have been greatly aided in our thinking through discussion with Quentin Huys of the Gatsby Institute. 
Commercial relationships: none. 
Corresponding author: R. H. S. Carpenter. 
Email: rhsc1@cam.ac.uk. 
Address: The Physiological Laboratory, University of Cambridge, Downing Street, Cambridge CB2 3EG, UK. 
References
Anderson, A. J. (2004). Eye movements: Viewing the window of opportunity. Current Biology, 14, R951–R952. [PubMed] [Article] [CrossRef] [PubMed]
Anderson, A. J. Vingrys, A. J. (2001). Small samples: Does size matter? Investigative Ophthalmology & Visual Science, 42, 1411–1413. [PubMed] [Article] [PubMed]
Basso, M. A. Wurtz, R. H. (1997). Modulation of neuronal activity by target uncertainty. Nature, 389, 66–69. [PubMed] [CrossRef] [PubMed]
Bichot, N. P. Schall, J. D. (2002). Priming in macaque frontal cortex during popout visual search: Feature-based facilitation and location-based inhibition of return. The Journal of Neuroscience, 22, 4675–4685. [PubMed] [Article] [PubMed]
Carpenter, R. H. S. Fisher,, D. F. Monty,, R. A. Senders, J. W. (1981). Oculomotor procrastination. Eye movements: Cognition and visual perception. (pp. 237–246). Hillsdale, NJ: Lawrence Erlbaum Associates.
Carpenter, R. H. S. (1988). Movements of the Eyes. London: Pion.
Carpenter, R. H. S. (1994). SPIC: A PC-based system for rapid measurements of saccadic responses. Journal of Physiology, 480, (4P),
Carpenter, R. H. S. (1999). A neural mechanism that randomises behaviour. Journal of Consciousness Studies, 6, 13–22. [Article]
Carpenter, R. H. S. (2001). Express saccades: Is bimodality a result of the order of stimulus presentation? Vision Research, 41, 1145–1151. [Article] [CrossRef] [PubMed]
Carpenter, R. H. S. Williams, M. L. L. (1995). Neural computation of log likelihood in control of saccadic eye movements. Nature, 377, 59–62. [CrossRef] [PubMed]
Cho, R. Y. Nystrom, L. E. Brown, E. T. Jones, A. D. Braver, T. S. Holmes, P. J. (2002). Mechanisms underlying dependencies of performance on stimulus history in a two-alternative forced-choice task. Cognitive, Affective & Behavioral Neuroscience, 2, 283–299. [PubMed] [CrossRef] [PubMed]
Dorris, M. C. Munoz, D. P. (1998). Saccadic probability influences motor preparation signals and time to saccadic initiation. The Journal of Neuroscience, 18, 7015–7026. [PubMed] [Article] [PubMed]
Dorris, M. C. Paré, M. Munoz, D. P. (2000). Immediate neural plasticity shapes motor performance. The Journal of Neuroscience, 20,
Dorris, M. C. Taylor, T. L. Klein, R. M. Munoz, D. P. (1999). Influence of previous visual stimulus or saccade on saccadic reaction times in monkey. Journal of Neurophysiology, 81, 2429–2436. [PubMed] [Article] [PubMed]
Edwards, A. W. F. (1972). Likelihood. Cambridge: Cambridge University Press.
Fecteau, J. H. Munoz, D. P. (2003). Exploring the consequences of the previous trial. Nature Reviews. Neuroscience, 4, 435–443. [PubMed] [CrossRef] [PubMed]
Gallistel, C. R. Fairhurst, S. Balsam, P. (2004). The learning curve: Implications of a quantitative analysis. Proceedings of the National Academy of Sciences of the United States of America, 101, 13124–13131. [PubMed] [Article] [CrossRef] [PubMed]
Glimcher, P. W. Sparks, D. L. (1992). Movement selection in advance of action in the superior colliculus. Nature, 355, 542–545. [PubMed] [CrossRef] [PubMed]
Gold, J. I. Shadlen, M. N. (2000). Representation of a perceptual decision in developing oculomotor commands. Nature, 404, 390–394. [PubMed] [CrossRef] [PubMed]
Gold, J. I. Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends in Cognitive Sciences, 5, 10–16. [PubMed] [CrossRef] [PubMed]
Gold, J. I. Shadlen, M. N. (2003). The influence of behavioral context on the representation of a perceptual decision in developing oculomotor commands. The Journal of Neuroscience, 23, 632–651. [PubMed] [Article] [PubMed]
Gore, J. L. Dorris, M. C. Munoz, D. P. (2002). Time course of a repetition effect on saccadic reaction time in non-human primates. Archives Italiennes de Biologie, 140, 203–210. [PubMed] [PubMed]
(1998). The Aeneid..
Hanes, D. P. Schall, J. D. (1996). Neural control of voluntary movement initiation. Science, 274, 427–430. [PubMed] [CrossRef] [PubMed]
Harley, C. B. (1981). Learning the evolutionarily stable strategy. Journal of Theoretical Biology, 89, 611–633. [PubMed] [CrossRef] [PubMed]
Jüttner, M. Wolf, W. (1994). Stimulus sequence effects on human express saccades described by a Markov model. Biological Cybernetics, 70, 247–253. [PubMed] [CrossRef] [PubMed]
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transaction of the ASME—Journal of Basic Engineering, 82, (Series D), 35–45. [CrossRef]
Klein, R. M. (2000). Inhibition of return. Trends in Cognitive Sciences, 4, 138–147. [PubMed] [CrossRef] [PubMed]
Kustov, A. A. Robinson, D. L. (1996). Shared neural control of attentional shifts and eye movements. Nature, 384, 74–77. [PubMed] [CrossRef] [PubMed]
Leach, J. C. Carpenter, R. H. (2001). Saccadic choice with asynchronous targets: Evidence for independent randomisation. Vision Research, 41, 3437–3445. [PubMed] [CrossRef] [PubMed]
Luce, D. L. (1986). Response Times. Oxford: Oxford University Press.
McPeek, R. M. Keller, E. L. (2001). Short-term priming, concurrent processing, and saccade curvature during a target selection task in the monkey. Vision Research, 41, 785–800. [PubMed] [CrossRef] [PubMed]
Niv, Y. Duff, M. O. Dayan, P. (2005). Dopamine, uncertainty and TD learning. Behavioural and Brain Functions, 1, 6. [CrossRef]
Poggio, T. Rifkin, R. Mukherjee, S. Niyogi, P. (2004). General conditions for predictivity in learning theory. Nature, 428, 419–422. [PubMed] [CrossRef] [PubMed]
Press, W. H. Teukolsky, S. A. Vetterling, W. T. Flannery, B. P. (1992). Numerical recipes in C. The art of scientific computing. Cambridge: Cambridge University Press.
Rabiner, L. R. (1989). A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286. [CrossRef]
Rao, R. P. (2004). Bayesian computation in recurrent neural circuits. Neural Computation, 16, 1–38. [PubMed] [CrossRef] [PubMed]
Reddi, B. A. Asrress, K. N. Carpenter, R. H. (2003). Accuracy, information, and response time in a saccadic decision task. Journal of Neurophysiology, 90, 3538–3546. [PubMed] [Article] [CrossRef] [PubMed]
Reddi, B. A. Carpenter, R. H. (2000). The influence of urgency on decision time. Nature Neuroscience, 3, 827–830. [PubMed] [Article] [CrossRef] [PubMed]
Schall, J. D. Hanes, D. P. (1993). Neural basis of saccade target selection in frontal eye field during visual search. Nature, 366, 467–469. [PubMed] [CrossRef] [PubMed]
Schultz, W. Dayan, P. Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. [PubMed] [CrossRef] [PubMed]
Seymour, B. O'Doherty, J. P. Dayan, P. Koltzenburg, M. Jones, A. K. Dolan, R. J. (2004). Temporal difference models describe higher-order learning in humans. Nature, 429, 664–667. [PubMed] [CrossRef] [PubMed]
Smith, S. W. (1999). The scientist and engineer's guide to digital signal processing. San Diego, California: California Technical Publishing.
Sugrue, L. P. Corrado, G. S. Newsome, W. T. (2004). Matching behavior and the representation of value in the parietal cortex. Science, 304, 1782–1787. [PubMed] [CrossRef] [PubMed]
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Sutton, R. S. Barto, A. G. (1981). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88, 135–170. [PubMed] [CrossRef] [PubMed]
Figure 1
 
Average saccadic latency over 10 trials prior to the probability change (filled symbols) and for the 71st through the 80th trial after the probability change (unfilled symbols), as a function of the final probability. The upper and lower data sets are for left-going saccades of Observers A (circles) and B (squares), respectively. Error bars show the standard error of the harmonic mean. Weighted (1/variance) linear regression slopes for the unfilled circles were −32.8 ms/log unit (95% CI = −24.9 to −40.7; R2 = .98) for Observer A and −26.4 ms/log unit (95% CI = −15.2 to −37.6; R2 = .95) for Observer B, using the SPSS™ software package (SPSS Inc., Chicago, IL).
Figure 1
 
Average saccadic latency over 10 trials prior to the probability change (filled symbols) and for the 71st through the 80th trial after the probability change (unfilled symbols), as a function of the final probability. The upper and lower data sets are for left-going saccades of Observers A (circles) and B (squares), respectively. Error bars show the standard error of the harmonic mean. Weighted (1/variance) linear regression slopes for the unfilled circles were −32.8 ms/log unit (95% CI = −24.9 to −40.7; R2 = .98) for Observer A and −26.4 ms/log unit (95% CI = −15.2 to −37.6; R2 = .95) for Observer B, using the SPSS™ software package (SPSS Inc., Chicago, IL).
Figure 2
 
Change in average saccadic latency as a result of changing target probability from an initial value of .5 (abscissa <0) to .1 (top data set), .33, .67, or .9 (bottom data set). The solid lines show the latency expected from the forgetful model using the average values for λ given in Figure 4, with the initial and final asymptotic latencies estimated from the average of the promptness values (see Materials and methods) in the figure from −19 through 0 and 71 through 80, respectively. For comparison, the dashed lines give the fit for a non-forgetful maximum likelihood model ( Equation 1) for final probabilities of .9 and .1, assuming an average of 95 trials prior to the probability change. The left and right panels represent the data for Observers A and B, respectively. The red line represents the average latency expected from the best-fitting step functions reported in Figure 6, from 0 through 70.
Figure 2
 
Change in average saccadic latency as a result of changing target probability from an initial value of .5 (abscissa <0) to .1 (top data set), .33, .67, or .9 (bottom data set). The solid lines show the latency expected from the forgetful model using the average values for λ given in Figure 4, with the initial and final asymptotic latencies estimated from the average of the promptness values (see Materials and methods) in the figure from −19 through 0 and 71 through 80, respectively. For comparison, the dashed lines give the fit for a non-forgetful maximum likelihood model ( Equation 1) for final probabilities of .9 and .1, assuming an average of 95 trials prior to the probability change. The left and right panels represent the data for Observers A and B, respectively. The red line represents the average latency expected from the best-fitting step functions reported in Figure 6, from 0 through 70.
Figure 3
 
Reciprobit plots of latency for three target appearance probabilities. The data for p = .5 were pooled from Presentations −9 through 0 in runs having a final appearance probability of .1 and .9. Data for p = .1 and .9 were pooled from Presentations 30 through 39 of the same runs. Straight lines are maximum likelihood fits (KS, p = .85 and 0.93 for Observers A and B, respectively) constrained to have a common intercept when latency is infinite, which is the characteristic pattern that occurs when an observer's prior probability shifts. Such fitting was favored (log10 LR = +0.6 and +0.04 for Observers A and B, respectively) over a maximum likelihood fit of a model in which the slope was fixed but the intercept varied.
Figure 3
 
Reciprobit plots of latency for three target appearance probabilities. The data for p = .5 were pooled from Presentations −9 through 0 in runs having a final appearance probability of .1 and .9. Data for p = .1 and .9 were pooled from Presentations 30 through 39 of the same runs. Straight lines are maximum likelihood fits (KS, p = .85 and 0.93 for Observers A and B, respectively) constrained to have a common intercept when latency is infinite, which is the characteristic pattern that occurs when an observer's prior probability shifts. Such fitting was favored (log10 LR = +0.6 and +0.04 for Observers A and B, respectively) over a maximum likelihood fit of a model in which the slope was fixed but the intercept varied.
Figure 4
 
Weighted means for λ for left- and right-going saccades in two observers (A and B), based on the best-fit values given in Table 1. Weightings were the reciprocal of the width of the 95% CIs. The error bars show weighted standard error of the mean.
Figure 4
 
Weighted means for λ for left- and right-going saccades in two observers (A and B), based on the best-fit values given in Table 1. Weightings were the reciprocal of the width of the 95% CIs. The error bars show weighted standard error of the mean.
Figure 5
 
Average latency (upper panel) and standard deviation for Observer A (left-going saccades; final probability, .9) and a Monte Carlo model where expectation changes in a step-like manner between .5 and .9 at a point 1 to 70 presentations after the true probability change. We modeled 100 runs, with the proportion of runs remaining at an expectation of .5 dropping exponentially for each presentation. The slope of the log probability versus latency function was taken from Observer A's results in Figure 1. The parameter σ gives the variability in the rise-to-threshold rate of the LATER decision unit, as described in Carpenter and Williams (1995).
Figure 5
 
Average latency (upper panel) and standard deviation for Observer A (left-going saccades; final probability, .9) and a Monte Carlo model where expectation changes in a step-like manner between .5 and .9 at a point 1 to 70 presentations after the true probability change. We modeled 100 runs, with the proportion of runs remaining at an expectation of .5 dropping exponentially for each presentation. The slope of the log probability versus latency function was taken from Observer A's results in Figure 1. The parameter σ gives the variability in the rise-to-threshold rate of the LATER decision unit, as described in Carpenter and Williams (1995).
Figure 6
 
Cumulative distribution of the location of the best-fitting step-change in expectation (left-going saccades; final probability, .9) for Observer A ( n = 123 runs) and Observer B ( n = 34 runs). The Monte Carlo data are from those presented in Figure 5A. Runs where the best fit was not unique, that is, two positions gave the same minimum SSQ, were excluded from the analysis. Similar cumulative distributions could not be determined for when the final probability was .1 as the sparseness of the data meant that only a very low number of runs gave unique best fits (6 and 2 runs for Observers A and B, respectively).
Figure 6
 
Cumulative distribution of the location of the best-fitting step-change in expectation (left-going saccades; final probability, .9) for Observer A ( n = 123 runs) and Observer B ( n = 34 runs). The Monte Carlo data are from those presented in Figure 5A. Runs where the best fit was not unique, that is, two positions gave the same minimum SSQ, were excluded from the analysis. Similar cumulative distributions could not be determined for when the final probability was .1 as the sparseness of the data meant that only a very low number of runs gave unique best fits (6 and 2 runs for Observers A and B, respectively).
Figure 7
 
The average difference in saccadic latency between two saccades of the same direction, separated by the number of trials (in any direction) as given on the abscissa. Only those trials up to and including the moment of the underlying change in appearance probability were analyzed, with the first 20 trials in each run being discarded; for this reason, the appearance probability was always .5. For runs where the probability remained at .5 throughout the entire run, data were analyzed up to a random endpoint between 70 and 120 trials, thereby giving endpoints consistent with those runs in which the probability did change. Data shown are for the left (filled circles) and right (unfilled circles) saccades of Observer A, with error bars giving one standard error. Negative deflection means that a previous saccade speeded up the subsequent saccade. The smooth curves show a least squares exponential function, with decay coefficients of 4.7 and 4.1 for left- and right-going saccades, respectively (95% CI = 1.6 to 7.7 and 1.1 to 7.1), and magnitudes of −2.3 and −2.6 (95% CI = −2.9 to −1.6 and −3.6 to −1.6). Data from Observer B (dashed line) are represented by the dashed line, showing the average from both left- and right-going saccades. Yellow triangles are left-going saccadic results from a Monte Carlo simulation of an observer whose log probability versus latency function was taken from Observer A's results in Figure 1 and whose rise-to-threshold variability σ was .55, as used in Figure 5. One million runs of 40 saccades were generated with the final saccade always being to the left, and differences were always taken relative to this final saccade. We used Equation 3 to calculate expectation, where λ = .068 (the average left-going values for Observer A, Table 1).
Figure 7
 
The average difference in saccadic latency between two saccades of the same direction, separated by the number of trials (in any direction) as given on the abscissa. Only those trials up to and including the moment of the underlying change in appearance probability were analyzed, with the first 20 trials in each run being discarded; for this reason, the appearance probability was always .5. For runs where the probability remained at .5 throughout the entire run, data were analyzed up to a random endpoint between 70 and 120 trials, thereby giving endpoints consistent with those runs in which the probability did change. Data shown are for the left (filled circles) and right (unfilled circles) saccades of Observer A, with error bars giving one standard error. Negative deflection means that a previous saccade speeded up the subsequent saccade. The smooth curves show a least squares exponential function, with decay coefficients of 4.7 and 4.1 for left- and right-going saccades, respectively (95% CI = 1.6 to 7.7 and 1.1 to 7.1), and magnitudes of −2.3 and −2.6 (95% CI = −2.9 to −1.6 and −3.6 to −1.6). Data from Observer B (dashed line) are represented by the dashed line, showing the average from both left- and right-going saccades. Yellow triangles are left-going saccadic results from a Monte Carlo simulation of an observer whose log probability versus latency function was taken from Observer A's results in Figure 1 and whose rise-to-threshold variability σ was .55, as used in Figure 5. One million runs of 40 saccades were generated with the final saccade always being to the left, and differences were always taken relative to this final saccade. We used Equation 3 to calculate expectation, where λ = .068 (the average left-going values for Observer A, Table 1).
Figure 8
 
A simple neural implementation of the model. T is an afferent neuron encoding the occurrence of a trial; Z is a neuron associated only with a particular response. The strength of the Hebbian synapse qn from T to Z represents the estimated conditional probability q(Z|T) and is appropriately increased or decreased from trial to trial accordingly as activity of T is or is not associated with activity of Z.
Figure 8
 
A simple neural implementation of the model. T is an afferent neuron encoding the occurrence of a trial; Z is a neuron associated only with a particular response. The strength of the Hebbian synapse qn from T to Z represents the estimated conditional probability q(Z|T) and is appropriately increased or decreased from trial to trial accordingly as activity of T is or is not associated with activity of Z.
Figure 9
 
Error in predicting expectation as a function of the Lethean factor, λ, for our simple model of expectation development. We quantified error as the standard deviation of the differences between the true appearance probability, p, and our current estimation of it, q. The left panel shows data from a transition between p = .3 and p = .7, with a frequency f of 0, .001, .002, .005, .01, .03, .07, or .1 (lower curve through to upper curve). The right panel shows a similar analysis but for a transition between p = .1 and p = .9.
Figure 9
 
Error in predicting expectation as a function of the Lethean factor, λ, for our simple model of expectation development. We quantified error as the standard deviation of the differences between the true appearance probability, p, and our current estimation of it, q. The left panel shows data from a transition between p = .3 and p = .7, with a frequency f of 0, .001, .002, .005, .01, .03, .07, or .1 (lower curve through to upper curve). The right panel shows a similar analysis but for a transition between p = .1 and p = .9.
Figure 10
 
(A) Relationship between the frequency of change, f, in the underlying appearance probability, p, and the value of the Lethean factor, λ′, that minimizes the prediction error in tracking these changes. Each data set is approximated by a function of the form λ = kf 1/3 (solid lines), where k is a vertical scaling factor. (B) Relationship between the scaling factor, k, and the absolute magnitude of the change between two underlying appearance probabilities, Δ p (= p 0p 1). The solid line shows a best fitting exponential function.
Figure 10
 
(A) Relationship between the frequency of change, f, in the underlying appearance probability, p, and the value of the Lethean factor, λ′, that minimizes the prediction error in tracking these changes. Each data set is approximated by a function of the form λ = kf 1/3 (solid lines), where k is a vertical scaling factor. (B) Relationship between the scaling factor, k, and the absolute magnitude of the change between two underlying appearance probabilities, Δ p (= p 0p 1). The solid line shows a best fitting exponential function.
Table 1
 
Fitted parameters: log 10 LR (AICc) = log 10 likelihood ratio, using a corrected Akaike's Information Criterion.
Table 1
 
Fitted parameters: log 10 LR (AICc) = log 10 likelihood ratio, using a corrected Akaike's Information Criterion.
Saccade direction Probability Observer A Observer B
Trials λ 95% CI ( λ) log 10 LR (AICc) Trials λ 95% CI ( λ) log 10 LR (AICc)
Left .10 128 .11 .18–.08 8.6 33 .14 .34–.06 3.2
.33 97 .02 .03–.01 0.47 36 .05 1.00–.02 0.11
.67 124 .06 .10–.03 5.9 31 .11 1.00–.01 –0.01
.90 140 .08 .09–.06 27 37 .03 .05–.02 2.4
Right .10 140 .06 .08–.05 6.3 37 .06 .11–.04 1.5
.33 124 .03 .04–.02 0.54 31 .41 1.00–.04 1.7
.67 97 .01 .04–.00 0.69 36 .17 1.00–.04 1.2
.90 128 .62 1.00–.25 15 33 .26 1.00–.12 13
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×