People routinely perform multiple visual judgments in the real world, yet, intermixing tasks or task variants during training can damage or even prevent learning. This paper explores why. We challenged theories of visual perceptual learning focused on plastic retuning of low-level retinotopic cortical representations by placing different task variants in different retinal locations, and tested theories of perceptual learning through reweighting (changes in readout) by varying task similarity. Discriminating different (but equivalent) and similar orientations in separate retinal locations interfered with learning, whereas training either with identical orientations or sufficiently different ones in different locations released rapid learning. This location crosstalk during learning renders it unlikely that the primary substrate of learning is retuning in early retinotopic visual areas; instead, learning likely involves reweighting from location-independent representations to a decision. We developed an Integrated Reweighting Theory (IRT), which has both V1-like location-specific representations and higher level (V4/IT or higher) location-invariant representations, and learns via reweighting the readout to decision, to predict the order of learning rates in different conditions. This model with suitable parameters successfully fit the behavioral data, as well as some microstructure of learning performance in a new trial-by-trial analysis.

*All*condition, training intermixed four different reference angles spaced in orientation, one per location. This is a perfect setup for substantial roving interference in learning—if learning in different locations interact—because opposite responses are required for every set of adjacent Gabor stimuli in representations of orientations (see Figure 2b). Two other groups intermixed training of two tasks, each occurring in two locations, either with more similar reference angles (

*Near*) or quite dissimilar reference angles (

*Far*). Finally, a no-roving condition (

*Single*) trained the same reference in all four locations. (See Methods, Simulation Methods, for details.)

*All*(

*n*= 4) and

*Near*(

*n*= 4) conditions, compared to the easier

*Far*(

*n*= 1) and

*Single*(

*n*= 1) conditions. (As a consequence, if anything, the results may slightly underestimate the learning rate differences between groups.) Each observer who completed the study performed 7,680 trials over eight training sessions, 960 per session, or 368,640 trials over all observers.

*f =*1.33 cpd, and SD of the Gaussian envelope

*σ*= 0.5 degrees, maximum contrast

*c*, and

*l*is the mid-gray background luminance. Each external noise image, newly generated for each trial, was composed of 2 × 2 pixel noise elements with contrasts randomly chosen from a Gaussian distribution with mean value

_{0}*0*and SD 0.25. External noise images and signal Gabor images (NNSNN) were displayed sequentially at the frame rate (see Procedure). The 64 × 64 pixel images subtended 3° × 3° visual angle, located at 5.67° eccentricity, at a viewing distance of 72 cm. Stimuli were generated in MATLAB with PsychToolbox on a Macintosh G4 computer using the internal 10-bit video card, refresh rate of 67 Hz, resolution of 640 × 480 pixels, and displayed on a 19 in.Viewsonic color monitor in pseudo-monochrome. A lookup table, estimated by a visual calibration procedure (Lu & Dosher, 2013) and validated by photometric measurement, linearized the luminance range into 127 levels from 1 cd/m

^{2}to 67 cd/m

^{2}; the mid-grey background luminance was 34 cd/m

^{2}. The observer's head was stabilized using a chin rest.

*All*condition, each of the four locations used a different reference angle (i.e. -67.5°, -22.5°, 22.5°, or 67.5° from vertical for the lower left, upper left, upper right, and lower right positions). In the

*Near*condition, two closer reference angles were used in two diagonal positions (e.g. -22.5° or 22.5°). In the

*Far*condition, two dissimilar reference angles were used in two diagonal positions (e.g. -67.5° or 22.5°). In the

*Single*condition, the same reference angle occurred in all locations. Zero and high external noise test conditions were intermixed. There were 8 sessions of 960 trials per session. Adaptive methods (see below) were used to measure contrast thresholds at 75% correct for each location and external noise condition separately (120 trials each within a session).

*n*is the trial number,

*X*is the stimulus contrast in trial

_{n}*n*, Zn = 0 or 1 is the response accuracy in trial

*n*,

*X*is the contrast for the next trial, and

_{n+1}*s*is the pre-chosen step size at the beginning of the trial sequence. From the third trial on, the sequence is “accelerated”:\({X_{n + 1}} = {X_n} - \frac{s}{{2 + {m_{shift}}}}( {{Z_n} - \phi } )\), where

*m*is the number of shifts in response category (from correct response to incorrect response and vice versa). In our application, the method was modified such that while

_{shift}*m*= 0 the increased contrast on an error is capped at 0.125

_{shift}*s*. See (Treutwein, 1995) for a discussion of this adaptive method, and (Lu & Dosher, 2013) for an analysis of the convergence properties and guidelines for step size and starting values.

*C*(

*t*) = λ(

*t*+ 1)

^{−β}+ α , with initial threshold of λ + α, asymptotic threshold of α, learning rate β, and training block

*t*. The curves for the four roving conditions were tested for significant differences with a lattice of nested

*F*-tests, each of which compares a restricted model to a fuller model of which it is a proper subset. For example, if roving conditions actually differ in learning rate (or any other parameter), constraining the model system to equate that parameter will significantly reduce the quality of fit. The proportion of variance accounted for by a model is

*r*

^{2}:

*N*observations and \(\bar x\) is the mean of the observed values.

*F*-tests for nested models compared the fit of the fuller and reduced models: \(F( {d{f_1},d{f_2}} ) = \;\frac{{( {r_{full}^2 - r_{reduced}^2} )/d{f_1}}}{{( {1 - r_{full}^2} )/d{f_2}}}\), where

*df*

_{1}=

*k*−

_{full}*k*, and

_{reduced}*df*

_{2}=

*N*−

*k*− 1. The

_{full}*k*’s are the number of model parameters. The

*F*-test computes the ratio of the improvement in error variance for each additional parameter in the

*fuller*model to the (average) error variance per degree of freedom.

*h*= 1 octave and the bandwidth of the orientation tuning was set at

_{f}*h*= 30° (half-amplitude full-bandwidth), based on estimated cellular tuning bandwidths in primary visual cortex. The location-invariant representations were more broadly tuned, with bandwidths of 1.6 • those of the location-specific units, and also had more internal noise (Dosher et al., 2013). The descriptions of the representation, decision, and learning modules are similar to those in earlier treatments (Liu et al., 2015; Petrov et al., 2005).

_{θ}*A*(

*θ, f*) of the orientation- and frequency-selective representation units, whether location-specific or location-invariant, in response to the stimulus image(s). This measures the normalized spectral energy in those channels. Sets of retinotopic phase-sensitive maps

*S*(

*x*,

*y*, θ,

*f*, \(\phi\)) are applied to the input image

*I*(

*x*,

*y*):

*S*(

*x*,

*y*, θ,

*f*, \(\phi\)) = [

*RF*

_{θ,f, \(\phi\)}(

*x*,

*y*)⊗

*I*(

*x*,

*y*)]

_{,}for spatial frequency

*f*, orientation

*θ*, and spatial phase ϕ. The input (stimulus) image

*I*(

*x*,

*y*) is convolved with the filter for each spatial-frequency/orientation unit by fast Fourier transform, followed by half-squaring rectification, to produce phase-sensitive activation maps analogous to “simple cells.” These are pooled over spatial phase: \(E( {x,y,\theta ,f} ) = \mathop \sum \limits_{}^{} S( {x,y,\theta ,f,\phi } ) + {\varepsilon _1}\) and subjected to inhibitory normalization: (Heeger, 1992) \(C( {x,y,\theta ,f} ) = \frac{{aE( {x,y,\theta ,f} )}}{{k + N( f )}}\). The noise ε

_{1}is Gaussian-distributed internal noise with mean 0 and standard deviation σ

_{1}. The normalization pool

*N*(

*f*) is independent of orientation and only modestly tuned for spatial frequency, as suggested by the physiology. The parameter

*a*is a scaling factor, and

*k*is the saturation constant to prevent division by zero at very low contrasts. For this behavioral task where the observer judges orientation, activations were pooled over spatial phase, and then spatially pooled with a Gaussian kernel of radius

*W*around the target Gabor. Another Gaussian-distributed noise of mean 0 and SD σ

_{r}_{2}introduces another source of stochastic variability: \(A^{\prime}( {\theta ,f} ) = \mathop \sum \limits_{x,y}^{} {W_r}( {x,y} )C( {x,y,\theta ,f} ) + {\varepsilon _2}\). The activations of the representation units are limited within a range by a nonlinear function with gain parameter γ: \(A( {\gamma ,f} ) = \{ \frac{{1 - {e^{ - \gamma A^{\prime}}}}}{{1 + {e^{ - \gamma A^{\prime}}}}}{A_{max,}}\;if\;A^{\prime} \ge 0\;0,\;{\rm {otherwise}}\).

*w*s are the current weights on representation units,

_{i}*b*is a bias term integrated with weight

*w*

_{b}_{,}and ε

_{d}(Gaussian, mean 0, SD σ

_{d}) is decision noise. A sigmoidal function with gain γ transforms it into an “early” post-synaptic decision activation

*o*′: \(o^{\prime} = G( u ) = \frac{{1 - {e^{ - \gamma u}}}}{{1 + {e^{ - \gamma u}}}}{A_{max}}\), with negative and positive values mapping to a CCW or CW responses, respectively.

*F*= ± 1), when available, moves the “late” post-synaptic activation in the decision unit

*o*toward the correct response:

*o*=

*G*(

*u*+

*w*). If the feedback weight

_{f}F*w*is high, activation of the decision unit approaches the correct positive or negative maximum (±

_{f}*A*

_{max}= ± 1). In the absence of feedback (

*F*= 0), learning operates on the early decision activation (

*o*=

*o*′), which is often intermediate. Learning occurs by changing the connection strengths

*w*from sensory representation units

_{i}*i*to the decision unit on each trial. The weight changes depend on the activation at the pre-synaptic connection,

*A*(θ,

*f*), the post-synaptic activation compared to its long-term average, (

*o*− \(\bar{o}\)), the distance of the weight from the minimum or maximum saturation value (

*w*or

_{min}*w*) and the (system) learning rate (η). So, the change in weight is: Δ

_{max}*w*= (

_{i}*w*−

_{i}*w*)[δ

_{min}_{i}]

_{−}+ (

*w*−

_{max}*w*)[δ

_{i}_{i}]

_{+}, where δ

_{i}= η

*A*(θ

_{i},

*f*)(

_{i}*o*− \(\bar{o}\)), and the time-weighted average of post-synaptic activation is \(\bar{o}\)(

*t*+ 1) = ρ

*o*(

*t*) + (1 − ρ)\(\bar{o}\)(

*t*). This Hebbian learning rule is augmented both by feedback (when it occurs in the behavioral experiment) and by information in the bias control unit

*b*that contributes to the early decision activation. The bias is a time-weighted average of responses

*r*(

*t*), weighted exponentially with a time constant ρ = 0.02 (about 50 trials),

*r*(

*t*+ 1) = ρ*

*R*(

*t*) + (1 − ρ)*

*r*(

*t*). The bias serves as a counter to deviations from 50% to 50% response histories (assuming symmetric experimental designs). Bias control tracks the observer's responses, while feedback tracks the external teaching signals.

_{1}) and internal multiplicative noise (σ

_{2}) for location-specific and for location-independent representations, a decision noise (σ

_{d}), and learning rate (η). A scaling factor (

*a*) matched initial performance of the different randomly assigned observer groups, and could differ slightly between groups if necessary.

*r*

^{2}between the mean contrast thresholds from the simulation and the average contrast thresholds in the experiment. The fit was also assessed using the statistic Kendall's τ that measures the consistency in the ordinal predictions between the model and the data. In this application, Kendall's τ was lower than

*r*

^{2}because some conditions led to similar predicted outcomes, and so could easily trade ordinal positions in the data (e.g. the Far and Single groups).

*All*,

*Near*,

*Far*, and

*Single*) were tested using analysis of variance on the contrast thresholds, with external noise (zero and high) and training block as within-observer factors, roving group as a between-observer factor, and observers as the random factor (α = 0.05). Higher contrast was, of course, required to accurately judge the stimuli embedded in external noise (

*F*(1, 44) = 418.44;

*p*< 0.0001; \(\eta _p^2\) = 0.905). Learning reduced contrast thresholds over training sessions (

*F*(7, 308) = 52.85;

*p*< 0.0001; \(\eta _p^2\) = 0.5457). Of central importance for this study, the four training groups showed different amounts of learning (

*F*(3, 44) = 5.025;

*p*< 0.005; \(\eta _p^2\) = 0.255), showing large differences in learning after several sessions. There was also an interaction between external noise and training group (

*F*(3, 44) = 2.750;

*p*< 0.05; \(\eta _p^2\) = 0.158); and among training condition, external noise, and block (

*F*(21, 308) = 1.48;

*p*≈ 0.08; \(\eta _p^2\) = 0.092). The methods of Masson (2011) were used to compute the Bayes information criterion probabilities (p

_{BIC}(H

_{1}|D)) (essentially a transformation of \(( {1 - \eta _p^2} )\)): p

_{BIC}(H

_{1}|D) > 0.999 for the effects of training (roving) group, blocks of training, external noise, and the group by external noise interaction; p

_{BIC}(H

_{1}|D) < 0.001 for the interaction among training condition, external noise, and session.

*p*< 0.0001 except

*All*versus

*Near*,

*p*< 0.01), although the

*Far*and

*Single*conditions were statistically equivalent (NS) (Bonferroni correction α = 0.008). See Appendix A, 1 for equivalent results of analyses separately in zero and high external noise conditions. As described later, the IRT reweighting model predicts this same order of learning rates:

*All*<

*Near*<

*Far*≈

*Single*, from least to most.

*C*(

*t*) = λ(

*t*+ 1)

^{−β}+ α) where

*c*(

*t*) is the threshold in session

*t*, λ + α is the initial threshold, α is the asymptotic value late in training, and β is the learning rate. Power functions provide a good description of average contrast threshold learning functions (Dosher & Lu, 2007). In this case, the training sessions started at t = 1 because the thresholds reflect session-end performance, and we tested the equality of pre-training thresholds at t = 0 in additional nested model tests (see Appendix A, 1).

*r*

^{2}= 0.9411) and βs = 0.5538, 0.7936, 1.3242, and 1.2836 (λ = 0.8979, and α= 0.3262) for high external noise (

*r*

^{2}= 0.9554), listed for

*All*,

*Near*,

*Far*, and

*Single*, from slower to faster. A lattice of subcase models and nested significance tests rejected more complicated models (see the discussion in Appendix A, 2, and Tables A.1 and A.2). The SDs of the estimated parameters, computed using bootstrap methods, are listed in Table A.3. The parameter SDs are relatively large (reflecting slight threshold level differences between observer groups and parameter correlations; added variance from parameter correlations was partially discounted in SDs of normalized rates). Despite this, the ordinal consistency of the four rates from the bootstrapped methods, which is perhaps more meaningful, was very high. For example, in zero noise, β

_{All}was slower than β

_{Single}, slower than β

_{Far}, and slower than β

_{Near}in 998, 949, and 786 fits, respectively, out of 1,000 fits to resampled data sets; and in high noise, β

_{All}was slower than β

_{Single}in 1,000 fits, slower than β

_{Far}in 1,000 fits, and slower than β

_{Near}in 950 fits out of 1,000 fits to resampled data sets (ordinal statistics are also listed in Table A.3). Consistent with the ANOVA tests, in high noise, β

_{Far}is slower than β

_{Single}only in 469 out of 1,000 fits—they are not significantly different from each other.

*p*> 0.05), and steadily increased throughout the session (\(\eta _p^2\) = 0.015 for trials 40 to 80, p

_{BIC}(H

_{1}|D) < 0.05, whereas \(\eta _p^2\) = 0.221 for the last 40 trials, p

_{BIC}(H

_{1}|D) > 0.999; differences in contrast thresholds between groups emerged as early as 200 to 300 trials of training in the first session

*p*values < 0.05, uncorrected for multiple tests; see Appendix A, 3 for more detailed analysis for each noise level). Additionally, the contrast thresholds of a subset of observers in the

*All*group showed a deterioration in the last few sessions that we believe may have reflected a lack of motivation in this more challenging roving condition.

*All*<

*Near*<

*Far*≈

*Single*, least to most. These different learning rates predicted by the model are induced solely through differential training in the different roving conditions.

*All*condition that maps to a “clockwise” response has two stimuli adjacent in orientation (rotation) space that maps to the competing “counter-clockwise” response, leading to lots of interference; each stimulus in the

*Near*condition has a competitive response mapping with a stimulus on one side; the stimuli in the

*Far*condition are widely separated in orientation space and so have no near stimuli with competing responses; while in the

*Single*condition the same stimuli and responses are trained in all locations. Although the focus here is on the weights from the location-invariant representations to decision because the experiment trains different task variants in different locations, task roving in a single location would lead to interference in the weights from both location-invariant and the location-specific levels of representation. These intuitions were validated by computing (nearly) optimal weights for the four training groups by simulating a very large number of training trials in zero noise. Despite different speeds of weight development during early and middle stages of training, with corresponding predicted differences in contrast thresholds, the four conditions nearly converge after very extensive training in zero external noise (this would require an amount of training far far beyond the thousands of trials for each observer in the behavioral experiment).

_{1}, additive noise σ

_{2}, decision noise σ

_{d}, scaling factor

*a*, the weight on feedback

*w*, and model learning rate η. Spatial frequency and orientation bandwidths of the sensory representations were selected a priori based on the physiology and some nonlinearity parameters were set from prior model applications, with the orientation and spatial frequency bandwidths of the location-invariant representations slightly broader than the location-specific representations. The location-invariant internal noises were set at twice the location-specific internal noises, based on prior applications of the model. With the model constrained from physiology and prior applications, there were 6 parameters (of 20 total parameters) free to vary to optimize the fit to the 64 data points (average contrast thresholds in 8 sessions × 4 groups × 2 external noise levels). Table 1 shows the best-fit parameter values.

_{f}*All*,

*Near*,

*Far*, and

*Single*groups. The differences in zero and high external noise thresholds also emerge naturally from the same parameter values. These best-fit model predictions provided a good quantitative fit to the behavioural contrast threshold data (Kendall's τ = 0.870,

*p*< 0.000001; and

*r*

^{2}= 0.919,

*p*< 0.000001). We also looked at a more complicated model, one that allowed small differences in internal noise parameters between groups (i.e. small differences between the groups of observers), which slightly but nonsignificantly improved the fit to the data (Kendall's τ = 0.883;

*r*

^{2}= 0.938) (see Appendix B, 1 and Figure B.1 for the graph of the fit, and Table B.1 for details and significance tests), but even with slight level differences between the randomly assigned groups of observers, it is the intrinsic differences in learning experiences that control the learning rates. In sum, the learning rate differences between the training groups emerge organically from the model based on the intermixture of trial experiences and are qualitatively and quantitatively consistent with the behavioural data in the experiment.

*sf*= 1.4 cpd, cycles per degree) increased or decreased during training, as appropriate, resulting in increased range in weights (Figure 4b). The weights on units tuned to relevant orientations in other spatial frequency channels (e.g.

*sf*= 2.8 cpd) were relatively unchanged. In the

*All*condition, changes in weights connecting the location-invariant representations to decision “fight” each other from one trial to the next because adjacent orientation stimuli require opposite correct responses in other locations. This, in turn, forces learning into the location-specific representation weights in this condition. In the

*Near*condition, fewer weights on location-invariant representations conflict for the most similar stimuli, whereas in the

*Far*condition stimuli are sufficiently dissimilar that weights for the tasks can be nearly independent. In the

*Single*condition, the weights on the most relevant location-invariant representation units show the largest increases, due to the consistent training in all locations. The full sets of weight changes in the best-fitting IRT model are shown in Appendix B, 2, and in Figure B.2. In short, training intermixed tasks with similar stimuli that require different responses sets the conditions for catastrophic interference, a common property in neural network learning models. Furthermore, this interference occurs specifically in weights on higher-level location-invariant representations in this experiment.

*Single*and

*Far*conditions, especially in zero noise. The

*Near*condition starts to improve next, somewhere during the second session, and clearly by the third session. The

*All*condition is even more delayed in showing improvements. (See Appendix C, for a detailed description of the analysis and results, and contrast error bands for all trials.)

*p*< 0.00001; and proportion variance accounted for

*r*

^{2}= 0.929;

*p*< 0.00001). Note that these model predictions used the parameter values estimated from the fits to the session contrast thresholds (e.g. they involved no additional optimization to fit the trial-by-trial data), except for an added lapse rate (i.e. rate of guessing trials) that increased within each session (0.0 to 0.2) to capture within-session deterioration. The same IRT model without the within-session lapse rate provided a reasonable, but somewhat worse fit to the data (

*F*(1, 7658) = 2372.9;

*p*< 0.0001, with rank order correlation Kendall's τ = 0.835;

*p*< 0.00001; and

*r*

^{2}= 0.907;

*p*< 0.00001). Although these models differed significantly (given the very large number of data points and so degrees of freedom), our choice of the model with an increasing lapse rate was also based on the presence of visible systematic errors in prediction without it, especially at the end of later sessions. A full discussion of trial-by-trial analyses, fits of the model to trial-by-trial data, increasing internal noise as an alternative to the lapse rate, and a discussion of adaptive methods all appear in Appendix C.

*Single*to

*All*).

*Vision Research,*49, 2591–2598.

*Nature,*415, 790–793.

*Perception & Psychophysics,*67, 691–698.

*Vision Research,*27, 953–965.

*Neuroscience,*165, 436–444.

*Vision Research,*46, 4071–4074.

*Behav Res*, doi:10.3758/s13428-019-01301-6

*Journal of Vision,*14, 1.

*Nature Neuroscience,*4, 519–525.

*Nature Human Behaviour,*1, Article 0035.

*Learning & Perception,*1, 37–58.

*Vision Research,*39, 3197–3221.

*Proceedings of the National Academy of Sciences,*95, 13988–13993.

*Proceedings of the National Academy of Sciences,*110, 13678–13683.

*Psychological Science,*18, 531–539.

*Annual Review of Vision Science,*3, 343–363.

*Cognitive Science,*11, 23–63.

*Psychonomic Bulletin & Review,*7, 185–207.

*Visual Neuroscience,*9, 181–197.

*Vision Research,*61, 95–99.

*Vision Research,*61, 25–32.

*Vision Research,*50, 1928–1940.

*Journal of Vision,*9(3), 1.

*Proceedings of the National Academy of Sciences,*88, 4966–4970.

*The Annals of Mathematical Statistics,*21, 41–59.

*Nature Neuroscience,*8, 1497–1499.

*Nature Neuroscience,*11, 505–513.

*Journal of Vision,*11, 992. (Abstract).

*Journal of Vision,*15(10):10, 10–21.

*Psychological Review,*115, 44.

*Visual psychophysics: From laboratory to theory*. MIT Press, Cambridge, MA.

*Behavior Research Methods,*43, 679–690.

*Psychology of Learning and Motivation,*24, 109–165.

*Behavioral Neuroscience,*122, 267–272.

*Nature Neuroscience,*6, 697–698.

*Vision Research,*46, 3223–3233.

*Psychological Science,*21, 1058–1063.

*Journal of Vision,*8, 5.

*Psychological Review,*112, 715.

*Vision Research,*46, 3177–3197.

*Journal of Neuroscience,*26, 6589–6602.

*The Annals of Mathematical Statistics,*22, 400–407.

*Journal of Vision,*3, 173. (Abstract).

*Trends in Cognitive Sciences,*9, 329–334.

*Proceedings of the National Academy of Sciences of the United States of America,*102, 14895–14900.

*Nature,*412, 549–553.

*Vision Research,*51, 585–599.

*Vision Research,*152, 17–39.

*Journal of Vision,*15, 17.

*Vision Research,*49, 1420–1427.

*Learning & Perception,*1, 155–163.

*Vision Research,*35, 2503–2522.

*Journal of Neuroscience,*36, 2238–2246.

*Current Biology,*18, 1922–1926.

*Journal of Neuroscience,*24, 1617–1626.

*Journal of Vision,*4, 4.

*PLoS Biol,*6, e197.

*Vision Research,*99, 93–98.

*Journal of Vision,*19, 9.

*Network: Computation in Neural Systems,*14, 233–247.

*All*,

*Near*,

*Far*, and

*Single*) was the threshold contrast required to achieve 75% correct from the adaptive staircase procedure. Learning is measured by reduction in thresholds as a function of training. Contrast threshold at the end of each session was estimated by the average contrast in the last 30 trials of the adaptive staircase measured in each condition. (Measuring learning at the scale of large blocks or sessions is typical in the perceptual learning literature; trial-by-trial data are considered below.) The main analysis of variance on the contrast threshold data was described in the main text: 8 sessions for 12 observers in each of four training conditions, in zero and in high external noise (i.e., training session and external noise were within subjects factors and roving or intermixture group was a between subjects factor), or 8 × 2 × 12 or 192 values. Contrast thresholds were, of course, higher in high external noise. Training improved (reduced) the contrast thresholds, and the rate of improvement was different in the four training conditions differed.

_{BIC}(H

_{1}|D) in both high external noise and zero external noise were > 0.999 for training block and training (roving) group, and < 0.001 for the interaction.

*All*versus

*Near*, p < .01), and

*Far*versus

*Single*(n.s.); in the zero external noise data, all condition differences were significant (p < .001 except

*All*versus

*Near*, p <.05,

*Far*versus

*Single*, p<.02, and

*Near*versus

*Far*, p<.08.

*All*,

*Near*,

*Far*, and

*Single*groups) differed significantly, as measured by fitted power function learning curves. The power function equation is:

*C*(

*t*) = λ(

*t*+ 1)

^{−β}+ α where

*c*(

*t*) is the threshold in session

*t*, λ + α is the initial threshold, α is the asymptotic threshold late in training, and β is the learning rate (Dosher & Lu, 2007). For this data, the initial threshold estimates performance prior to training at

*t*=0; training counts starting at

*t*=1 because behavioural performance was measured near the end of the corresponding sessions. (See also Appendix C for a trial-by-trial analysis.) The four conditions were fit simultaneously by models using different numbers of free parameters in a partial lattice of systems of equations using the Matlab routine

*fminsearch*. The null model (1λ - 1β- 1α) assumes that the different training mixture conditions were the same, while a fully saturated model (4λ - 4β - 4α) fits each curve independently with no common parameter values, with a variety of models in between. The data for high external noise and for zero external noise data were fit separately, because the descriptive parameters for threshold learning functions depend on the external noise. Nested sub-models, in which one model is a special restricted case of another, can be tested for significant differences using an F-test (see Methods for a description of the F test for nested models).

*r*

^{2}for the different nested models, and significance values separately for the high external noise and the zero external noise data. Power function learning curves with different learning

*rates*for each training (roving) condition, but equivalent initial and final asymptotic performance levels (1λ - 4β - 1α), provided a significantly better fit than a single learning curve (1λ - 1β - 1α) (F(3,41) = 138.789, p ≪ .00001 in high external noise; F(3,41) = 58.370, p ≪ .00001 in low external noise; see tables for

*r*

^{2}values)—demonstrating that the four different training groups differed significantly in learning rate. Although similar, the

*Far*and

*Single*conditions differed slightly, the (1λ - 4β - 1α was better than a 1λ - 3β - 1α model in low noise, F(1,41)=16.183, p < .001; but the difference is not significant in high noise, F(1,41) = 0.273, p > 0.1).

*r*

^{2}, with and without the added λ parameters are relatively small and may reflect over fitting unconstrained by data at time

*t*=0, and some of less constrained estimated parameter values are implausible, suggesting that these may be the result of over-fitting.

_{Far}and β

_{Single}in high external noise.

_{BIC}(H

_{1}/D) increased from 0.0011 to 1.000; for high noise, \(\eta _p^2\) increased from 0.0029 to 0.241, and p

_{BIC}(H

_{1}/D) increased from 0.0005 to 1.000; both p

_{BIC}(H

_{1}/D) > 0.99 from about the first 50 trials in each location, or 200 trials total).This equivalence was tested with the session-level data using the power function fits (above), which fit the data very well with models assuming equal initial performance before learning, although differences were already seen by the end of the first session at the first data point. We carried out collateral tests of the equivalence between the four groups at the beginning of training—the equivalence of the initial state before learning—from the trial-by-trial data (see Appendix C).

*r*

^{2}= 0.919). The parameters for this simulation were listed in table 1 and the model fits and summaries of the weight changes were shown in figure 4 in the main text, with 6 of 20 parameters free to vary to fit 64 data points: normalization constant k, scaling factor a, location-specific internal noise σ

_{1}and location-invariant internal noise σ

_{2}, decision noise σ

_{d}and learning rate η. Location-invariant internal noises were set at twice the location-specific internal noise, and other parameters were set from prior applications of the model.

*r*

^{2}= 0.938). The parameters for this simulated model variant are listed in Table B.1, and the corresponding fit to the data is shown in Figure B.1. This model has 15 varying parameters (4 scaling factors

*a*, 4 σ

_{1}’s and 4 σ

_{2}’s, k, σ

_{d}, and η, (total of 29 parameters). Although this model seems to fit the data somewhat better by eye, the quality of fit was not significantly better (

*r*

^{2}= 0.938 versus

*r*

^{2}= 0.919) after accounting for the increase in the number of free parameters (F(9,34) = 1.158, p = 0.352). Even assuming that there were slight differences between the randomly assigned groups of observers, the added parameters primarily adjust the initial level of performance, and it is still the intrinsic differences in learning experiences that largely accounted for the different rates of perceptual learning.

*initial*and

*final*weights (blue and red, respectively) for units tuned to all the different orientations at a near-matched spatial frequency (sf=1.4

*cpd*, or cycles per degree) and at a less relevant spatial frequency (sf=2.8

*cpd*) for location-specific and location-invariant representations for the fit listed in the main text. Examination of these patterns shows that learning results in larger positive weights on orientation units best tuned to the “clockwise” stimulus orientations, and larger negative weights on orientation units best tuned to the “counter-clockwise” stimulus orientations. For the location-specific representations, the magnitudes of weight change (learning) from location-invariant representations to decision is about the same in the four training conditions. In contrast, the magnitudes of weight change from location-invariant representations to decision differ markedly in the four training conditions; these weight make the most substantial contribution to predicting the condition differences.

*All*<

*Near*<

*Far*<≈

*Single*.

^{st}, 2

^{nd}, 3

^{rd}, … etc. trials within each adaptive staircase, averaged over the separate staircases in the four retinal locations and over observers, separately for the four groups in the two external noise conditions. Vertical dashed lines mark between sessions, carried out on different days. Horizontal dashed lines in the proportion correct graphs are ±1.6σ (binomial) from the target accuracy of 75% correct of the adaptive algorithm. Error bars were shown in figure 5 only for mid-session values for visibility. Figure C.1 shows bootstrapped error regions for every trial of the trial-by-trial average contrasts.

*accelerated stochastic approximation*algorithm, (Kesten, 1958) which tracks a target accuracy ϕ, here 75% correct (see the Methods for details). This adaptive algorithm, as most others such as the up/down staircases and others (Lu & Dosher, 2013), either reduces the size of up or down steps (here, in contrast) over trials within a run or session, using big steps early to do range finding, and smaller steps later to converge with more precision or alternatively may focus test conditions in regions determined early in the sequence.

*c*

_{0}(above, below, and approximately at the “true” threshold). Figure C.2 shows the predicted patterns in the average stimulus contrast as a function of trial within a session that can emerge with the accelerated approximation algorithm for: (a) no changes in threshold, flat lapse rate (stationary performance); (b) decreasing threshold, flat lapse rate; (c) no changes in threshold, increasing lapse rate; and (d) decreasing threshold and increasing lapse rate. Two of these patterns are typical of those actually seen in the human behavioural data in earlier (d) and later (c) sessions, respectively.

*r*

^{2}=0.907, Kendall's τ= 0.835). There seems to be a deviation between the model and the data due to slower than expected learning of the

*All*condition in the first and second sessions, which is not predicted by the model, with or without lapse rates. We looked at models that incorporated different lapse rates for the different task mixture conditions, e.g., larger lapse rates for the

*All*condition and smaller for the

*Single*condition. While these models visually improved the fits slightly, they did not significantly improve the

*r*

^{2}enough to outweigh the increase in parameters in F-tests.