Existing studies of sensory integration demonstrate how the reliabilities of perceptual cues or features influence perceptual decisions. However, these studies tell us little about the influence of feature reliability on visual learning. In this article, we study the implications of feature reliability for perceptual learning in the context of binary classification tasks. We find that finite sets of training data (i.e., the stimuli and corresponding class labels used on training trials) contain different information about a learner's parameters associated with reliable versus unreliable features. In particular, the statistical information provided by a finite number of training trials strongly constrains the set of possible parameter values associated with unreliable features, but only weakly constrains the parameter values associated with reliable features. Analyses of human subjects' performances reveal that subjects were sensitive to this statistical information. Additional analyses examine why subjects were sub-optimal visual learners.

*p*(

*scene property*∣

*feature value*). If this distribution has a small variance, then the feature provides highly precise or diagnostic information about the scene property and, thus, is regarded as a reliable feature. In contrast, if this distribution has a large variance, then the feature provides imprecise information about the scene property and is regarded as an unreliable feature.

*p*(

*curvature*∣

*stereo cue*) has a small variance) and, thus, is reliable, but the haptic cue provides imprecise information (i.e.,

*p*(

*curvature*∣

*haptic cue*) has a large variance) and, thus, is unreliable. In this case, the model will form its estimate of curvature as a weighted average of the estimate based on the visual cue and the estimate based on the haptic cue. Because the stereo cue is more reliable, the curvature estimate based on this cue will be assigned a large weight. In contrast, the haptic cue is less reliable, meaning that the curvature estimate based on it will be assigned a small weight.

*A*or class

*B*. Auditory feedback indicates the correctness of the learner's decision.

*X*

_{1}is an unreliable indicator of class membership, whereas

*X*

_{2}is a reliable indicator.

*S*=

*w*

_{1}

*x*

_{1}+

*w*

_{2}

*x*

_{2}, where

*x*

_{1}and

*x*

_{2}are the current stimulus values of the features

*X*

_{1}and

*X*

_{2}, respectively, and

*w*

_{1}and

*w*

_{2}are the learner's weights or parameters. If the sum

*S*is positive, the learner is likely to decide that the stimulus belongs to class

*A*; otherwise, the learner is likely to decide that the stimulus belongs to class

*B*.

*w*

_{1}and

*w*

_{2}. For us, an important question is: How much information does the training data (i.e., the 600 stimuli and their corresponding class labels, which were presented on the training trials) provide about good values of the parameters? To address this question, we examine the probability distributions of the parameters given the training data,

*p*(

*w*

_{1}∣{

*data*}) and

*p*(

*w*

_{2}∣{

*data*}).

*p*(

*w*

_{1}∣{

*data*}) and

*p*(

*w*

_{2}∣{

*data*}) for the classification task illustrated in the left panel. For parameter

*w*

_{1}, the parameter associated with the unreliable feature

*X*

_{1}, the distribution is centered at zero and has a small variance. In other words, the training data indicate with high certainty that the value of this parameter should be zero. For parameter

*w*

_{2}, the parameter associated with the reliable feature

*X*

_{2}, the distribution is centered at a positive value and has a large variance. That is, the data indicate that feature

*X*

_{2}should be positively weighted, but there is significant uncertainty as to the exact value to which

*w*

_{2}should be set. Thus, according to the distributions in Figure 2, the training data provide very different statistical information about the parameters associated with reliable versus unreliable features.

*p*(

*w*

_{ i}∣{

*data*}) for all weights

*w*

_{ i}, where {

*data*} refers to the finite set of visual stimuli and their corresponding class labels used on training trials) indicates with high precision that an unreliable feature is unreliable. In contrast, the information provided by the data indicates with low precision the exact relevance of a reliable feature.

*A*is 0.7, but still judged the stimulus as belonging to class

*B*on an experimental trial. If so, this would suggest that the subject engaged in “exploration”, a strategy that can be useful in many learning situations (Bellman, 1956; Sutton & Barto, 1998).

^{ T}and [1 −1]

^{ T}. As illustrated in the leftmost column of Figure 3, three versions of the task were created differing in their covariance matrices. In all versions, the covariance matrices for classes

*A*and

*B*were identical, diagonal matrices. The covariance structures were isotropic in the first version, meaning that stimulus features

*X*

_{1}and

*X*

_{2}had equal variances (

*σ*

_{ X 1}

^{2}=

*σ*

_{ X 2}

^{2}= 1). Because of the placement of the mean vectors, and because these variances were equal, the two stimulus features were equally reliable indicators of class membership. The variance of

*X*

_{1}was relatively large and the variance of

*X*

_{2}was small in the second version (

*σ*

_{ X 1}

^{2}= 25,

*σ*

_{ X 2}

^{2}= 1). Consequently,

*X*

_{1}was an unreliable indicator of class membership, whereas

*X*

_{2}was reliable. In the final version, the variance of

*X*

_{1}was small and the variance of

*X*

_{2}was large (

*σ*

_{ X 1}

^{2}= 1,

*σ*

_{ X 2}

^{2}= 25), meaning that

*X*

_{1}was a reliable feature, but

*X*

_{2}was unreliable.

*A*(one minus this value is the probability that a stimulus belongs to class

*B*). Let

*x*

_{1}

*x*

_{2}]

^{ T}denote a stimulus where

*x*

_{1}and

*x*

_{2}are the stimulus values for features

*X*

_{1}and

*X*

_{2}, respectively. Let

*y*= 1 denote that the stimulus belongs to class

*A,*and

*y*= 0 denote that the stimulus belongs to class

*B*. The logistic regressor works as follows. It first calculates a weighted sum, denoted

*S,*of the stimulus feature values:

*S*=

*w*

_{ i}

*x*

_{ i}where {

*w*

_{ i}} is the set of parameters of the regressor. It then uses this weighted sum and the logistic function to calculate the probability that the stimulus belongs to class

*A*:

*p*(

*y*= 1∣

*e*

^{− S}).

*w*

_{1}

*w*

_{2}]

^{ T}of a logistic regressor. The maximum likelihood model is referred to as the ML model with infinite data. For each task version, its parameters were set to values that maximized the likelihood of a fictional data set containing an infinite number of data items:

*w*

_{ i}= (

*μ*

_{ i}

^{ A}−

*μ*

_{ i}

^{ B}) /

*σ*

_{ i}

^{2}where

*μ*

_{ i}

^{ A}and

*μ*

_{ i}

^{ B}are the values of feature

*X*

_{ i}for the prototypes of classes

*A*and

*B,*respectively, and

*σ*

_{ i}

^{2}is the variance of feature

*X*

_{ i}(Bishop, 2006).

*p*(

*y*= 1∣

*p*(

*y*= 0∣

*p*(

*w*

_{i}) ∼

*N*(0, 100

^{2}). A single chain was run, and 100,000 samples were collected. The first 10,000 samples were discarded as burn-in. After examining the autocorrelation function of the samples, the chain was then thinned to every 10th sample to reduce correlations among nearby samples. Thus, the results for the Bayesian model were based on 9,000 samples.

^{1}

*w*

_{1}and

*w*

_{2}, respectively. The point estimates of the parameter values for the ML model with infinite data are given by the red dashed lines. The distributions are the posterior marginal distributions calculated by the Bayesian model.

*p*({

*data*}∣

*w*

_{1},

*w*

_{2}), for each version of the task. For the first task version (left graph), in which stimulus features

*X*

_{1}and

*X*

_{2}are equally reliable, contours of equal likelihood are diagonally oriented ellipses. For the second task version (middle graph), in which

*X*

_{1}was an unreliable feature and

*X*

_{2}was reliable, the likelihood function in the local region near its peak is relatively steep along dimension

*w*

_{1}and flat along dimension

*w*

_{2}. In other words, the likelihood changes quickly as the value of

*w*

_{1}is perturbed. However, it changes slowly as the value of

*w*

_{2}is perturbed. For the final task version (right graph), in which

*X*

_{1}was a reliable feature and

*X*

_{2}was unreliable, the likelihood changes slowly along

*w*

_{1}and quickly along

*w*

_{2}.

*A*were randomly set to either 1.0 or −1.0. The coefficients for class

*B*were the negative of the coefficients for class

*A*. In addition, a matrix

*K*was added to each prototype where

*K*consisted of the background luminance plus an arbitrary image constructed in the null space of the basis feature set (the addition of this arbitrary matrix prevented the prototypes from appearing as contrast-reversed versions of the same image). In summary, a prototype was computed using the following equation:

*F*

_{ i}is basis feature

*i*and

*c*

_{ i}is its corresponding linear coefficient.

*c*

_{ i}} defining the prototype for that class. This was done using the following equation:

*ɛ*

_{ i}is a random sample from a normal distribution with mean zero and variance

*σ*

_{ i}

^{2}. This variance is referred to as a feature's noise variance. Importantly, each feature had its own noise variance, and the magnitude of this variance determined the reliability of a feature. Features with small noise variances tended to have coefficient values near one of the class prototypes. Therefore, these features were highly diagnostic of whether an exemplar belonged to class

*A*or

*B*. In contrast, features with large noise variances tended to have coefficient values far from the class prototypes. These features were less diagnostic of an exemplar's class membership. To avoid outliers, if a feature's coefficient value was more than two standard deviations from the corresponding value for the prototype, then this value was discarded and a new value was sampled. Consequently, the exemplars from the two classes were linearly separable.

*A*and

*B*. Subjects were instructed to decide which of the two prototypes had appeared in the test stimulus and responded by pressing the key corresponding to the selected prototype. Subjects received immediate auditory feedback after every trial indicating the correctness of their response. In addition, after every 15 trials, a printed message appeared on the screen indicating their (percent correct) performance on the previous 15 trials.

*σ*

^{2}= 1). The remaining features served as unreliable features and were assigned a large noise variance (

*σ*

^{2}= 25). In Task 2, the roles of the two sets of features were swapped such that the reliable features were made unreliable, and the unreliable features were made reliable.

*A*or

*B,*as opposed to the subject's responses or estimates of the correct class labels (the latter is considered in the next section).

*ML*

_{ IO}

^{∞}, the parameters were set to values that maximized the likelihood function based on a fictional data set containing an infinite number of data items. As described in the Two-dimensional binary classification task section, parameter

*w*

_{ i}was set using the equation

*w*

_{ i}= (

*μ*

_{ i}

^{ A}−

*μ*

_{ i}

^{ B}) /

*σ*

_{ i}

^{2}, where

*μ*

_{ i}

^{ A}and

*μ*

_{ i}

^{ B}are the values of feature

*X*

_{ i}for the prototypes of classes

*A*and

*B,*respectively, and

*σ*

_{ i}

^{2}is the variance of feature

*X*

_{ i}(Bishop, 2006).

*BM*

_{ IO}, used finite data sets based on the subject's experimental trials. Recall that the experiment contained two tasks in which the sets of reliable and unreliable features were swapped between tasks. The trials devoted to each task were divided into 6 blocks of 600 trials each. A data item used when estimating

*BM*

_{ IO}'s parameter values consisted of representations of a test stimulus displayed on an experimental trial along with a class label for that stimulus. A stimulus was encoded by its representation in the space of visual basis features (i.e., the 20 linear coefficients used to construct the stimulus). The class label was set in a stochastic manner using the probabilities from the ML model with infinite data (i.e., the true posterior probabilities

*p*(

*y*= 1∣

*p*(

*y*= 0∣

*BM*

_{ IO}used the set of data items associated with a single block of trials. Thus, it was simulated 12 times, once for each experimental block. On each simulation, the model inferred the joint distribution of its parameters using a Markov chain Monte Carlo sampling method (see 1). Because the two classes of data items in a data set were linearly separable in the space defined by the visual basis features, there are many different logistic regressors that could be fit to a data set. That is, the data did not provide a strong constraint on the model's distributions of parameters. As a result, the sampling procedure of a model with a vague prior distribution [e.g.,

*p*(

*w*

_{ i}) ∼

*N*(0, 100

^{2})] often did not converge within a reasonable number of iterations. We used, therefore, a prior distribution on each parameter with a small variance [

*p*(

*w*

_{ i}) ∼

*N*(0, 2)].

^{2}Three Markov chains were run, and 100,000 samples were collected from each chain (see 1 for details on how the chains were initialized). The Gelman–Rubin scale reduction factor was used to diagnose convergence (Gelman, 1996).

^{3}Based on this factor, the initial 10,000 samples from the first chain were discarded as burn-in. To reduce correlations among nearby samples, this chain was then thinned to every 10th sample. Thus, the posterior joint distributions of

*BM*

_{IO}were based on 9,000 samples.

*ML*

_{ IO}

^{∞}are given by the red dashed lines. The distributions are the posterior marginal distributions calculated by

*BM*

_{ IO}.

*BM*

_{ IO}'s parameters across all experimental blocks. The black lines correspond to parameters associated with reliable features in Task 1 (unreliable in Task 2), and the red lines correspond to parameters associated with unreliable features in Task 1 (reliable in Task 2). It seems that there are enough trials within a single block for

*BM*

_{ IO}to learn the reliabilities of the features.

*A*or

*B*. The most interesting result is that the posterior marginal distributions of the model's parameters had small variances for parameters associated with unreliable features, and large variances for parameters associated with reliable features. In other words, the information in the training data constrains the values of parameters associated with unreliable features with high precision but constrains the values of parameters associated with reliable features with low precision. We next report the data of a Bayesian model based on the subject's experimental data. That is, this model estimates the subject's response, or estimate of the class label, on each experimental trial.

*BM*

_{ subj}, that used finite data sets based on the subject's trials in an experimental block. A data item consisted of representations of a test stimulus displayed on a trial along with the subject's response or estimate of the correct class label for that stimulus. The model used a vague prior distribution [

*p*(

*w*

_{ i}) ∼

*N*(0, 100

^{2})]. Three Markov chains were run, and 100,000 samples were collected from each chain (see 1 for further details). The Gelman–Rubin scale reduction factor was used to diagnose convergence (Gelman, 1996). Based on this factor, the first 10,000 samples from the first chain were discarded as burn-in. After examining the autocorrelation functions for the samples, the first chain was then thinned to every 10th sample to reduce correlations among nearby samples. The remaining samples were used to estimate the posterior joint distribution of

*BM*

_{subj}'s parameters.

*BM*

_{ subj}'s performances (black dots and lines; a dot indicates the mean and error bars denote one standard deviation around the mean) on each experimental block. The distribution of

*BM*

_{ subj}'s performances on a block was obtained by sampling from its joint distribution of parameters. Clearly,

*BM*

_{ subj}provides a good fit to the subject's performances.

*BM*

_{ subj}and the point estimates of the ideal observer

*ML*

_{ IO}

^{∞}. Define the “normalized dot product” to be the quantity:

_{ subj}is a sample of parameter values drawn from the joint distribution of parameters for

*BM*

_{ subj}and

_{ IO}is the parameter point estimates of

*ML*

_{ IO}

^{∞}. This quantity is analogous to a correlation coefficient (Michel & Jacobs, 2008, referred to the square of this quantity as “template efficiency”). It is near one when

_{subj}and

_{IO}are similar, and near zero when

_{subj}and

_{IO}are unrelated. Figure 9 shows the median normalized dot product (error bars show the 25th and 75th percentiles of the distribution of normalized dot products) at each experimental block. The black points and line show the data based on the ideal observer

*ML*

_{IO}

^{∞}for Task 1 of the experiment, whereas the red points and line are based on the ideal observer for Task 2. Clearly, the parameter values of

*BM*

_{subj}are closer to the optimal point estimates based on Task 1's stimulus noise structure during the first half of the experiment. They are closer to the optimal estimates based on Task 2's noise structure during the second half of the experiment.

*BM*

_{ subj}across all experimental blocks. Black lines correspond to parameters associated with reliable features in Task 1 (unreliable in Task 2), and red lines correspond to parameters associated with unreliable features in Task 1 (reliable in Task 2). Although there is considerable noise in the mean data, the overall trend is expected; the black lines in the left graph tend to be at larger values in the first half of the experiment, and the red lines are at larger values in the second half. Importantly, the standard deviations are larger for parameters associated with reliable features, and smaller for parameters associated with unreliable features.

*BM*

_{ subj}. The graphs on the left and right are based on the trials in blocks 6 and 12, the final blocks for Tasks 1 and 2, respectively. The red lines show the parameter point estimates from

*ML*

_{ IO}

^{∞}, the ideal observer with infinite data described above (the red lines in Figures 6 and 11 are identical although the scales of the graphs are different).

*BM*

_{ subj}and

*BM*

_{ IO}, the Bayesian models trained with the subject's responses and with the true posterior probabilities over class labels, respectively. Recall that

*BM*

_{ IO}'s parameter distributions associated with unreliable features have small variances, and its distributions associated with reliable features have large variances. Above, we reasoned that this outcome follows from the nature of the constraints imposed by the training data. If people are sensitive to these constraints, then models that are fit to human subjects' responses will show similar behaviors. The results of

*BM*

_{ subj}displayed in Figure 11 verify that this is indeed the case. The distributions of

*BM*

_{ subj}, like those of

*BM*

_{ IO}, have significantly larger variances for parameters associated with reliable features.

*BM*

_{ subj}'s distributions are smaller than those of

*BM*

_{ IO}. This can be explained by the fact that the set of stimuli that the subject labeled as class

*A*and the set that he or she labeled as class

*B*overlapped (in the space defined by the visual basis features), whereas the true classes did not. As a consequence, the training data for

*BM*

_{ subj}placed strong constraints on

*BM*

_{ subj}'s possible parameter values. The constraints placed by the training data for

*BM*

_{ IO}were comparatively weaker.

*BM*

_{ subj}is illustrated in Figure 11.

*BM*

_{ subj}'s parameters typically have expected values with correct signs. On both blocks 6 and 12, the expected values of 8 of the 10 parameters associated with reliable features have the same signs as the optimal point estimates of the ideal observer

*ML*

_{ IO}

^{∞}. However, these values are much smaller (in magnitude) than the optimal point estimates. This result is surprising because the (percent correct) performance of

*BM*

_{ subj}would be significantly improved if its parameter distributions were located at larger values.

^{4}There are at least two possible explanations for this outcome (see Eckstein, Abbey, Pham, & Shimozaki, 2004; Jacobs, 2009, for other discussions of sub-optimal visual learning).

*BM*

_{subj}'s posterior marginal parameter distributions are located at small values.

*S*=

*w*

_{ i}

*x*

_{ i}, is mapped to the probability that the subject judged a stimulus as belonging to class

*A*(

*y*= 1) using a modified logistic function:

*p*(

*y*= 1∣

*e*

^{− S/ β}) (the original logistic function is recovered by setting

*β*= 1). In this new model, the parameter

*β*is analogous to a variance parameter. If

*β*is a small value (e.g.,

*β*= 0.1), then the model will tend to always believe that a stimulus belongs to class

*A*with a probability of either 1 or 0 (intermediate probabilities will be rare). In this case, the model is essentially deterministic, and the model is said to “exploit” its current knowledge. If

*β*is a large value (e.g.,

*β*= 10), the model will tend to always believe that a stimulus belongs to class

*A*with an intermediate probability (extreme probabilities near 1 or 0 will be rare). It will appear to be at least partially random. For example, if the model believes that the probability that a stimulus belongs to class

*A*is 0.6, then it will judge the stimulus as belonging to class

*A*with a probability of 0.6 and will judge the stimulus as belonging to class

*B*with a probability of 0.4. In this case, the model is said to “explore”. In the field of machine learning, there is a lot of discussion about the advantages and disadvantages of exploration and exploitation. Exploration is often thought to be useful when a learner has incomplete knowledge of its environment or when an environment is non-stationary (Bellman, 1956; Sutton & Barto, 1998; note that the exploitation/exploration trade-off is closely related to a sub-optimal decision-making strategy known as “probability matching” [e.g., Newell, Lagnado, & Shanks, 2007]).

*w*

_{ i}} and the parameter

*β*in the modified logistic function. Consider this new model where the expected values of the parameter values are relatively large in magnitude. In fact, suppose they are roughly equal to the optimal point estimates of the ideal observer

*ML*

_{ IO}

^{∞}. However, the parameter

*β*in the new model is set to a moderately large value, meaning that the model is moderately random. This new model would show the same (percent correct) performance as the original model

*BM*

_{ subj}(and as was shown by the subject). However, it leads to different implications about the subject's behavior. According to the original model, the subject was sub-optimal because he or she under-estimated the information carried by each reliable feature about a stimulus category. Based on the new model, the subject properly estimated the information carried by each feature, but the subject's performance was sub-optimal because he or she did not exploit this knowledge but rather engaged in exploratory behavior. Future research will need to design experiments to distinguish the predictions of these two models.

*p*(

*w*

_{i}∣{

*data*}) for all parameters

*w*

_{i}, where {

*data*} refers to the finite set of visual stimuli and their corresponding class labels used on training trials) strongly constrains the set of possible parameter values associated with unreliable features but only weakly constrains the possible parameter values associated with reliable features.

*a priori*that a parameter is unlikely to have a large value, this information can be incorporated by placing an appropriately chosen prior distribution (one that has a small mass over large values) over that parameter. The use of prior information makes inference more robust and less variable by constraining the set of possible values that parameters can take.

*i*th data item consist of a vector of covariate variables, denoted

_{ i}, and a scalar response variable, denoted

*y*

_{ i}. In addition, let

*z*

_{ i}, such that

*ɛ*

_{ i}is a sample from a standard logistic distribution. The response variable

*y*

_{ i}is related to the latent variable

*z*

_{ i}by the following equation:

*N*(

*σ*

^{2}

*I*), with mean vector

*σ*

^{2}

*I*. In this case, it is difficult to construct an efficient Gibbs sampler because the full conditional distribution of

*ɛ*

_{ i}is distributed according to a Gaussian distribution.) H&H solved this problem by introducing an additional latent variable, denoted

*λ*

_{ i}, and by making the noise variable dependent on this new latent variable as follows:

*KS*is the Kolmogorov–Smirnov distribution. Importantly, the conditional distribution of

*ɛ*

_{ i}given

*λ*

_{ i}is Gaussian, whereas the marginal distribution of

*ɛ*

_{ i}is logistic (Andrews & Mallows, 1974).

*Logistic*(

_{ i}

^{ T}

*y*

_{ i}) is a truncated logistic distribution with mean

_{ i}

^{ T}

*y*

_{ i}: if

*y*

_{ i}= 1, the distribution is truncated below 0; otherwise, it is truncated above 0. In these equations,

*X*is a matrix whose

*i*th row is the covariate variable

_{ i}, and

*z*

_{ i}} and {

*λ*

_{ i}}, respectively. H&H used a rejection sampling method to sample from the conditional distribution of

*λ*

_{ i}because this distribution does not have a standard form.

*BM*

_{ IO}) produced a single chain of 100,000 samples. The variables {

*λ*

_{ i}} were initialized to 1, and the variables {

*z*

_{ i}} were initialized to values sampled from a truncated logistic distribution with mean parameter 0 and scale parameter 1 (the side of truncation depended on

*y*

_{ i}). The first 10,000 samples of the chain were discarded as burn-in, and the remaining samples were then thinned to every 10th sample.

*BM*

_{ IO}and

*BM*

_{ subj}each produced three chains of 100,000 samples for each experimental block. In Chain 1, the variables {

*λ*

_{ i}} were initialized to 1, and the variables {

*z*

_{ i}} were initialized to values sampled from a truncated logistic distribution with mean parameter 0 and scale parameter 1. In Chain 2, the variables {

*λ*

_{ i}} were initialized to values sampled from a uniform distribution on the interval [0.5, 1.5], and the variables {

*z*

_{ i}} were initialized to values sampled from a truncated logistic distribution whose mean was sampled from a uniform distribution on the interval [0, 1] and whose scale was set to 5. Chain 3 was initialized in the same manner as Chain 2. Relative to Chain 2, however, it reversed the update order of the variables {

*z*

_{ i}} and {

*λ*

_{ i}}. The first 10,000 samples of Chain 1 were discarded as burn-in, and the remaining samples were thinned to every 10th sample.

^{1}In our research, we also considered models containing lapse parameters (Wichmann & Hill, 2001). These models are useful when subjects' responses seem to be random (stimulus-independent) guesses on significant numbers of trials. However, we found that the subjects in Michel and Jacobs (2008) had small lapse rates, and thus, we omit models with lapse parameters from this article.

^{2}For a binary classification task with linearly separable classes, a maximum likelihood estimator of a logistic regressor's weights is not well defined because the likelihood function can always be increased by increasing the magnitudes of the weights. To circumvent this problem, practitioners typically seek weights that maximize the likelihood function and are not too large in magnitude (so-called maximum penalized likelihood estimation). In a Bayesian setting, this corresponds to placing a relatively restrictive prior distribution on the logistic weights.

^{3}Roughly, the Gelman–Rubin scale reduction factor is a mathematical tool designed to detect when multiple chains, each initialized in its own way, are showing similar statistical properties, meaning that the chains have converged to the same distribution. The time period prior to convergence is referred to as “burn-in”, and the chains' samples during burn-in are discarded.

^{4}The subject's performance (and, thus,

*BM*

_{ subj}'s performance) was sub-optimal. To better understand why, we did the following. We fit a logistic regressor to the subject's responses using maximum likelihood estimation. It could be that the vector of parameter estimates is too small in magnitude, points in the wrong direction, or both. We scaled the magnitude of this vector, maintaining its direction, and measured the performance of a logistic regressor whose parameter values were set to this scaled vector. By increasing the magnitude of the vector, a logistic regressor could increase its performance from about 77% correct to 83% correct on block 6, and from 83% correct to 90% correct on block 12. The remaining error is due to the fact that this vector points in the wrong direction.