Expectations broadly influence our experience of the world. However, the process by which they are acquired and then shape our sensory experiences is not well understood. Here, we examined whether expectations of simple stimulus features can be developed implicitly through a fast statistical learning procedure. We found that participants quickly and automatically developed expectations for the most frequently presented directions of motion and that this altered their perception of new motion directions, inducing attractive biases in the perceived direction as well as visual hallucinations in the absence of a stimulus. Further, the biases in motion direction estimation that we observed were well explained by a model that accounted for participants' behavior using a Bayesian strategy, combining a learned prior of the stimulus statistics (the expectation) with their sensory evidence (the actual stimulus) in a probabilistically optimal manner. Our results demonstrate that stimulus expectations are rapidly learned and can powerfully influence perception of simple visual features.

^{2}at 100 Hz refresh rate) moving coherently at a speed of 9°/sec within a circular annulus, with minimum and maximum diameter of 2.2° and 7°, respectively. The background luminance of the display was set to 5.2 cd/m

^{2}.

^{2}) was presented for 400 ms. With the fixation point still onscreen, the motion stimulus was then presented, along with a red bar which projected out (initial angle of bar randomized for each trial) from the fixation point (Figure 1). The bar was located entirely within the center of the annulus containing the moving dots (length 1.1°, width 0.03°, luminance 3.4 cd/m

^{2}). Participants indicated the direction of motion by orienting the red bar with a mouse, clicking the mouse button when they had made their estimate (estimation task). The display cleared when either the participant had clicked on the mouse, or a period of 3000 ms had elapsed. On trials where no motion stimulus was presented, the red bar still appeared and participants were required to estimate the perceived direction of motion as normal. Participants were instructed to fixate on the central point throughout this period. Participants' reaction time in the estimation task determined how long the stimulus was presented for. On average this was equal to 1978 ± 85 ms (standard error on the mean; see Supplementary Figure 7 for a plot of reaction time versus presented motion direction). After the estimation task had finished, there was a 200-ms delay before a vertical white line was presented at the center of the screen, with text to either side (reading “NO DOTS” and “DOTS,” respectively). Participants moved a cursor to the right or left of this line to indicate whether they had or had not seen a motion stimulus (detection) and clicked the mouse button to indicate their choice. The cursor flashed green or red for a correct or incorrect detection response, respectively. The screen was then cleared and there was a 400-ms blank period before the beginning of the next trial.

^{2}above the 5.2 cd/m

^{2}background. For each session, there were 250 trials at zero contrast and 100 trials at high contrast. Contrasts of other stimuli were determined using 4/1 and 2/1 staircases on detection performance (García-Pérez, 1998). For each session, there were 135 trials with the 2/1 staircase and 365 trials with the 4/1 staircase.

*a*) ·

*V*(

*μ*,

*κ*) +

*a*/2

*π*, where

*a*is the proportion of trials where the participant make random estimates, and

*V*(

*μ*,

*κ*) is a von Mises (circular normal) distribution with mean

*μ*and width determined by 1/

*κ*, given by:

*V*(

*μ*,

*κ*) = exp(

*κ*cos(

*θ*−

*μ*)/(2

*πI*

_{0}(

*κ*))). Parameters were chosen by maximizing the likelihood of generating the data from the distribution. Participants' estimation mean and standard deviation were taken as the circular mean and standard deviation of the von Mises distribution,

*V*(

*μ*,

*κ*). The average biases obtained using this method were qualitatively similar to those obtained through calculating the estimated direction by simply averaging over trials, while the variances were significantly smaller and with more consistency across participants and motion directions when the parametric fits were used. Therefore, in all of the following analysis, we used this parametric method to quantify performance in the estimation task.

*p*= 0.11 and

*p*= 0.41, respectively, four-way within-subjects ANOVA). Therefore, we collapsed data across the two experimental sessions.

^{2}and 0.054 ± 0.001 cd/m

^{2}, respectively; similar to the average luminance difference between the two levels (0.052 ± 0.004 cd/m

^{2}). Further, there was no significant difference between the luminance levels achieved for both staircases (

*p*= 0.23, three-way within-subjects ANOVA). This was reflected in the estimation data: there was no significant difference between participants' estimation standard deviations for both staircased contrast levels (

*p*= 0.12, four-way within-subjects ANOVA). Therefore, we collapsed data across these contrast levels for all of the analysis described in the main text. Later, we looked at the effect of contrast level on participants' behavior by separating participants' responses at different luminance levels, depending on their detection performance at different luminance levels. Details of this procedure are described in the Supplementary materials.

*p*= 0.87, four-way within-subjects ANOVA). There was also no significant three-way interaction between motion direction, experimental session, and detection response (

*p*= 0.81, four-way within-subjects ANOVA). Therefore, we collapsed data across experimental sessions for analysis of the participants' responses when no stimulus was present.

*p*< 0.001, three-way within-subjects ANOVA; Figure 3a, gray). We quantified the probability ratio that participants made estimates that were close to the most frequently presented motion directions, relative to other directions, by multiplying the probability that they estimated within 8° of these motion directions by the total number of 16° bins (

*p*

_{rel}=

*p*(

*θ*

_{est}= ±32(8)°)). This probability ratio would be equal to 1 if participants were equally likely to estimate within 8° of ±32° as they were to estimate within other 16° bins. We found that the median value of

*p*

_{rel}was significantly greater than 1, indicating that participants were strongly biased to report motion in the most frequently presented directions when no stimulus was presented (median(

*p*

_{rel}) = 2.7;

*p*= 0.005, signed rank test, comparing

*p*

_{rel}to 1; Figure 3b).

*p*= 0.12, three-way within-subjects ANOVA; Figure 3a, red). Further, for these trials, participants were not significantly more likely to estimate close to the most frequently presented motion directions than other motion directions (median(

*p*

_{rel}) = 1.28;

*p*= 0.13, signed rank test, comparing

*p*

_{rel}to 1; Figure 3b). Indeed they were significantly more likely to report motion in the most frequently presented motion directions when they also reported detecting a stimulus compared to when they did not (

*p*= 0.012, signed rank test, comparing the values of

*p*

_{rel}obtained for trials where participants either did or did not report seeing a stimulus in the detection task; Figure 3b).

*p*< 0.001, three-way within-subjects ANOVA; Figure 3a, blue), and they were biased to report motion in the two most frequently presented directions (median(

*p*

_{rel}) = 1.71;

*p*< 0.001, signed rank test comparing

*p*

_{rel}to 1). However, the size of this bias was reduced compared to the case when we looked only at trials where participants detected stimuli (

*p*= 0.027, signed rank test comparing the values of

*p*

_{rel}obtained for all trials with trials where participants reported seeing a stimulus in the detection task).

*p*

_{rel}) = 2.11;

*p*= 0.026, signed rank test, comparing

*p*

_{rel}to 1).

*p*= 0.008, signed rank test, comparing

*p*

_{rel}to 1 after 200 trials; see Supplementary Figure 3), indicating rapid learning of motion direction expectations.

*p*< 0.001, three-way within-subjects ANOVA).

*p*= 0.005 and

*p*= 0.001, respectively, signed rank test). This indicates that on average, participants were biased to estimate stimuli as moving in directions that were closer to the most frequently presented motion directions (±32°) than they actually were.

*p*< 0.001, three-way within-subjects ANOVA).

*p*< 0.001 signed rank test; Figure 6b). Overall, there was a significant effect of motion direction on the fraction detected (

*p*= 0.002, three-way within-subjects ANOVA).

*p*< 0.001, signed rank test; Supplementary Figure 7). Overall, there was a significant effect of motion direction on participants' reaction time (

*p*= 0.003, three-way within-subjects ANOVA).

*θ*

_{obs}. We parameterize the probability of observing the stimulus to be moving in a direction

*θ*

_{obs}by a von Mises (circular normal) distribution centered on the actual stimulus direction and with width determined by 1/

*κ*

_{ l }:

*θ*

_{perc}) that is based entirely on their sensory observation so that

*θ*

_{perc}=

*θ*

_{obs}. However, on a certain proportion of trials, when participants are uncertain about whether a stimulus was present or not, they resort to their “expectations” by making a perceptual estimate that is sampled from a learned distribution,

*p*

_{exp}(

*θ*). For simplicity, we parameterize this distribution as the sum of two circular normal distributions, each with width determined by 1/

*κ*

_{exp}, and centered on motion directions –

*θ*

_{exp}and

*θ*

_{exp}, respectively:

*α*, where participants make estimates that are completely random. Thus, the estimation response

*θ*

_{est}is related to the perceptual estimate

*θ*

_{perc}via the equation

*a*(

*θ*) determines the proportion of trials that participants sampled from the “expected” distribution,

*p*

_{exp}(

*θ*). For this model, free parameters that were fitted to the estimation data for each participant were the center and width of participants' “expected” distributions (determined by

*θ*

_{exp}and 1/

*κ*

_{exp}, respectively), the width of their sensory likelihood (determined by 1/

*κ*

_{ l }), the fraction of trials where they made estimates by sampling from their “expected” distribution (

*a*(

*θ*)), the magnitude of the “motor” noise in their responses (determined by 1/

*κ*

_{ m }), and the fraction of trials where they made estimations that were completely random (

*α*).

*p*

_{exp}(

*θ*) was divided into two parts:

*θ*

_{obs}or sampled from a learned distribution of expected motion directions. However, instead of sampling from a single distribution of expected motion directions,

*p*

_{exp}(

*θ*), participants could now make estimates that were sampled either from the distributions

*p*

_{anticlockwise}(

*θ*) or

*p*

_{clockwise}(

*θ*), with a probability that was dependent on the actual stimulus motion direction. For example, on a single trial, a participant might be aware that the stimulus was moving “anticlockwise from center” and thus would be more likely to make an estimate that was sampled from the distribution,

*p*

_{antilockwise}(

*θ*), than from

*p*

_{clockwise}(

*θ*).

*a*(

*θ*) and

*b*(

*θ*) were additional free parameters that determined the proportion of trials where participants sampled from each distribution.

*κ*

_{exp}” set to zero.

*θ*

_{obs}), with a probability

*p*

_{ l }(

*θ*

_{obs}∣

*θ*) =

*V*(

*θ*,

*κ*

_{ l }). From Bayes' rule, the posterior probability that the stimulus is moving in a particular direction

*θ*, given a sensory observation

*θ*

_{obs}, is obtained by multiplying the likelihood function (

*p*

_{ l }(

*θ*

_{obs}∣

*θ*)), with the prior probability (

*p*

_{prior}(

*θ*)):

*p*

_{prior}(

*θ*), directly, we hypothesized that they learned an approximation of this distribution, denoted

*p*

_{exp}(

*θ*). In our model, this “learned prior” was parameterized similarly to

*p*

_{prior}(

*θ*) in ADD1 (see Equation 2).

*θ*

_{exp}, by choosing the mean of the posterior distribution so that:

*Z*is a normalization constant. An alternative choice would be for the perceptual estimate to be given by the maximum of the posterior distribution. For our work, both methods gave qualitatively identical results.

*θ*

_{exp}and 1/

*κ*

_{exp}, respectively), the width of their sensory likelihood (determined by 1/

*κ*

_{l}), the magnitude of the “motor” noise in their responses (determined by 1/

*κ*

_{m}), and the fraction of trials where they made estimations that were completely random (

*α*). We included two variants of the Bayesian model: “BAYES_var,” where the width of the likelihood function was allowed to vary with the stimulus motion direction, and “BAYES,” where it was held constant.

*κ*

_{l}∼ 0). Therefore, for all models, the distribution of estimations should be given by Equation 3, with the substitution,

*θ*

_{exp}=

*θ*. We used this equation to fit participants' estimation distributions at high contrast (by maximizing the log probability of getting the observed the data; see later), thus allowing us to approximate the “motor noise” (determined by 1/

*κ*

_{m}) for each participant.

*M*, we were able to calculate the probability of making an estimate

*θ*

_{est}given a stimulus moving in a direction

*θ*(

*p*(

*θ*

_{est}∣

*θ*;

*M*)). Assuming that participants' responses on each trial were independent, this allowed us to calculate the likelihood of generating our experimental data “

*D*” from the particular model and parameter set

*M*. We then chose model parameters to fit the data for each participant by maximizing the log of the likelihood function:

*θ*

_{ i }and

*θ*

_{ i,data}represent the presented motion direction and the estimation response on the

*i*th trial, respectively. We found the maximum of the likelihood function using a simplex algorithm (the Matlab function “fminsearch”). We were concerned that for some participants our model fits might converge to local rather than local maxima. To reduce this possibility, we ran the model fits with a range of initial values for

*κ*

_{ l }and

*κ*

_{exp}(

*κ*

_{ l }

^{−1/2}and

*κ*

_{exp}

^{−1/2}were varied independently in 2° increments, between 1° and 21°), selecting the model fit that produced the highest value for the log-likelihood. The results obtained were also found to be robust to changes in all of the other initial parameter values.

*κ*

_{ m }(as this was obtained from the high contrast responses, not the low contrast responses that were the principle area of investigation), ADD1 and ADD2 required 9 and 14 free parameters, respectively:

*κ*

_{ l },

*θ*

_{exp},

*κ*

_{exp}, and

*α*, plus 5 values for

*a*(

*θ*), and for ADD2, another 5 values for

*b*(

*θ*) (one for each presented motion direction). ADD1_mode and ADD2_mode required 8 and 13 free parameters, respectively (one less parameter than ADD1 and ADD2 respectively, as

*κ*

_{exp}was no longer a free parameter). BAYES required only four free parameters (

*κ*

_{ l },

*θ*

_{exp},

*κ*

_{exp}, and

*α*). BAYES_var required eight free parameters (including a value for

*κ*

_{ l }for each presented motion direction).

*L*is the likelihood of generating the experimental data from the model,

*k*is the number of parameters in the model, and

*n*is the number of data points available. In general, given two estimated models, the model with the lower value of BIC is the one to be preferred (Schwarz, 1978). The first term of this expression accounts for the error between the data and the model predictions, while the second term represents a penalty for including too much complexity in the model.

*p*= 0.002,

*p*< 0.001,

*p*= 0.003,

*p*= 0.005, and

*p*< 0.001, respectively; signed rank test). Thus, while a small minority of participants were not best fitted by the BAYES model (two participants exhibited a lower BIC value with the ADD1 model, two participants exhibited a lower BIC value with the ADD1_mode model, and two participants exhibited a lower BIC value with the ADD2_mode model), this model provided the best description of the data for the majority of participants.

*p*< 0.001, signed rank test), while there was no significant difference between the BAYES and ADD2_mode models.

*p*

_{exp}(

*θ*), to be symmetrical around 0°. Thus, even in the extreme case where all responses are sampled from this distribution, there would only be an attractive bias toward the central motion direction.

*p*= 0.91 and

*p*= 0.34, respectively, for comparisons of the ADD1_mode and ADD2_mode models with the BAYES model; signed rank test).

*κ*

_{ l },

*θ*

_{exp},

*κ*

_{exp}, and

*α*) were held constant across presented motion directions, in order for the “response bias” models (ADD1, ADD2, ADD1_mode, and ADD2_mode) to fit the data, additional free parameters were required (

*a*(

*θ*) and

*b*(

*θ*)), which had to be varied between different presented motion directions. Thus, for the ADD1 and ADD2 models to be valid, participants would have had to alter their response strategy, varying the proportion of trials where they sampled from their “expected” probability distributions, depending on the direction of the presented stimulus. In addition, the ADD1_mode and ADD2_mode models assumed that when participants were unsure about the presented motion direction, they made a perceptual estimate of motion direction that was exactly the same on each trial. This seems unrealistic: in reality there would be some trial-to-trial variation in the expected motion direction.

*R*

^{2}value of 0.71. The behavior of individual participants was also well predicted by the model: the fits for participants' zero stimulus estimation distributions had a positive

*R*

^{2}value for 8 out of 12 of them. For these participants, the median

*R*

^{2}value was 0.65 (0.46, 0.83; 25th and 75th percentiles). The fact that the majority of participants' behavior in the absence of a stimulus could be predicted, based solely on their estimation responses in the presence of a stimulus, provides strong evidence in favor of the Bayesian model put forward here.

*p*= 0.71, five-way within-subjects ANOVA).