**Human brains are finite, and thus have bounded capacity. An efficient strategy for a capacity-limited agent is to continuously adapt by dynamically reallocating capacity in a task-dependent manner. Here we study this strategy in the context of visual working memory (VWM). People use their VWM stores to remember visual information over seconds or minutes. However, their memory performances are often error-prone, presumably due to VWM capacity limits. We hypothesize that people attempt to be flexible and robust by strategically reallocating their limited VWM capacity based on two factors: (a) the statistical regularities (e.g., stimulus feature means and variances) of the to-be-remembered items, and (b) the requirements of the task that they are attempting to perform. The latter specifies, for example, which types of errors are costly versus irrelevant for task performance. These hypotheses are formalized within a normative computational modeling framework based on rate-distortion theory, an extension of conventional Bayesian approaches that uses information theory to study rate-limited (or capacity-limited) processes. Using images of plants that are naturalistic and precisely controlled, we carried out two sets of experiments. Experiment 1 found that when a stimulus dimension (the widths of plants' leaves) was assigned a distribution, subjects adapted their VWM performances based on this distribution. Experiment 2 found that when one stimulus dimension (e.g., leaf width) was relevant for distinguishing plant categories but another dimension (leaf angle) was irrelevant, subjects' responses in a memory task became relatively more sensitive to the relevant stimulus dimension. Together, these results illustrate the task-dependent robustness of VWM, thereby highlighting the dependence of memory on learning.**

*efficient*, in the formal sense of making optimal use of its available capacity. This would require a system that is adapted to the statistical regularities of the to-be-remembered items, and also adapted to the importance of storing different dimensions more or less accurately. In other words, this would require a system that has successfully addressed the “bit allocation” problem.

^{1}

*statistical learning*in human visual perception (Fiser & Aslin, 2001, 2002a, 2002b; Orbán, Fiser, Aslin, & Lengyel, 2008). For example, Orbán et al. (2008) used scenes containing novel shapes arranged in a grid, where the shapes were drawn from a finite set. The arrangements of shapes contained “chunks,” where a chunk was a group of shapes that often appeared together in a particular spatial configuration. The authors demonstrated that subjects implicitly learned these chunks, and that their learning was best accounted for by a hierarchical Bayesian model that inferred the most likely chunks given the arrangements of objects observed. We propose that, like visual perception, VWM may be similarly adaptive, if it relies on general statistical learning mechanisms. Thus VWM could quickly tune itself based on the statistical properties of the visual environment.

*x*drawn from some probability distribution

*p*(

*x*), and produces a possibly different signal

*p*(

*x*). In conventional Bayesian models, knowledge of the sensory distribution

*p*(

*x*) is referred to as the observer's “prior” distribution, and this distribution sets the background knowledge or “probabilistic context” for processing sensory signals. In rate-distortion theory, knowledge about

*p*(

*x*) plays the same role. When applied to the study of VWM, rate-distortion theory predicts that memory performances for a given visual item will differ across contexts, such as when the item is sampled from a uniform distribution versus a normal distribution across possible items (assuming the memory system has been given time to learn a particular distribution). The idea that neural systems adapt to their afferent signal statistics has received extensive empirical support in sensory neuroscience, where it is known as the efficient coding hypothesis (Barlow, 1961).

*channel capacity*to achieve the desired level of performance. When channel capacity is lower than the statistical complexity of the information source (measured by its information entropy; Cover & Thomas, 1991; MacKay, 2003), it is certain that errors must occur in the course of information transmission. While capacity limits present a major topic of study in VWM (for a review, see Ma, Husain, & Bays, 2014), information theory contributes a principled measure of limited-capacity perceptual systems.

*cost function*. While limits in capacity necessitate errors in information processing, an optimal system should seek to minimize the cost of perceptual error. The cost function ℒ(

*x*as a possibly different signal

^{2}. For an optimally efficient channel, the cost function is defined by the behavioral task that the organism seeks to perform. Note that the organism's implicit task may not necessarily agree perfectly with the experimenter-defined task.

*x*and output

*p*(

*x*). In Figure 3a, a stimulus

*x*

_{0}(indicated by the vertical line) is sampled from a probability distribution

*p*(

*x*), shown in blue. The stimulus

*x*

_{0}is stored in a capacity-limited VWM. Rate-distortion theory predicts that the precision of the distribution of VWM representations

*p*(

*x*), the capacity of the channel 𝒞, and the cost function that is minimized ℒ(

*x*,

*μ*,

*σ*), and we infer

*μ*and

*σ*. Since the channel capacity of VWM is not known in advance, it is estimated from the experimental data as a free parameter in the model. In Experiment 1, the cost function, which specifies the cost of each possible memory error

*x*, the model generates predictions for the distribution of possible memory errors,

*y*, via Bayes' theorem:

*p*(“change”) =

*p*

_{change}. To allow for noise in responses, we assume that subjects exhibit probability matching (Vulkan, 2000). That is, the probability that a subject responds “different” (or “change”) on a given trial is equal to the computed probability that a change has occurred (see Keshvari, van den Berg, & Ma, 2013, for a similar application of this idea in modeling VWM). A schematic of the VWM model and the decision rule is shown in Figure 5. A derivation of the Bayesian decision rule is provided in Appendix B.

*p*

_{change},

*μ*, and

*σ*. We fit the model to the aggregated data from all subjects in each condition (220 trials × 81 subjects = 17,820 trials). However, we chose to make our model hierarchical by fixing

*p*

_{change}and capacity across subjects, as we did not expect these two parameters to be condition-dependent (see below). Model parameters were estimated by means of maximum likelihood estimation.

^{2}

*p*< 0.01 for each pairwise comparison between performances in the uniform condition versus a normal condition, where hypothesis testing was conducted via bootstrapping). This outcome is broadly consistent with the predictions of rate-distortion theory, and does not support conventional Bayesian approaches that predict that memory performance should be largely insensitive to changes in the widths of stimulus distributions (see discussion above, Figure 2, and Appendix A). A potential problem with the left graph of Figure 6 is that it suggests that performance differences between conditions existed at the start of the experiment. This problem arises because we smoothed the data across trials to prepare this graph. The graph on the right shows the raw (i.e., non-smoothed) data for the first 50 trials. It indicates that performance differences did not exist at the start of the experiment, though they appear to have arisen due to learning relatively early during training.

*p*

_{change}), while the remaining parameters were allowed to vary by condition (

*μ*and

*σ*). We report values using only subjects' last 100 trials. (We found that a trial window of about 100 was necessary for stable estimates.)

*μ*and standard deviation

*σ*estimates are sensible, and follow the expected qualitative trend. In the uniform condition,

*σ*was higher than in all others, indicating that subjects in this condition learned a higher variance stimulus distribution. In the normal conditions with means of 30, 50, and 75 (standard deviations set to 10 in all cases),

*μ*was about 30, 50, and 60, respectively, while

*σ*was always about 20. We find these outcomes to be impressive. With relatively little exposure, subjects in each condition learned roughly the correct stimulus distribution and used this information when making VWM judgments. Taken as a whole, these results support our rate-distortion theory framework because they strongly suggest that subjects adapted their VWMs based on the stimulus distributions in order to improve memory performances.

*not*adaptation to statistical context, performance in the subsampled trials should be identical to those in the corresponding normal conditions.

*p*< 1 × 10

^{−10}). These results are general in the sense that they held for 50 replications where each replication used different subsampled subsets.

*all*feature values of observed items. Unfortunately, people are not always able to do so. Because people's perceptual and memory systems have information processing limits, they cannot simultaneously perceive all features of all objects in an environment and, even if they could, they could not remember and process all this information.

*n*= 50), the widths of a plant's leaves determined the plant's category membership (leaf angle was an irrelevant feature dimension). For instance, plants with narrow leaves may have been members of taxiforma alpha, whereas plants with wide leaves were members of taxiforma beta. For the remaining subjects (

*n*= 51), the angles of a plant's leaves determined the plant's category membership (leaf width was an irrelevant dimension). On each of 64 categorization trials, a plant, referred to as the target, was randomly selected (with the constraint that the target could not lie on the category boundary). The subject viewed a fixation cross (displayed for 500 ms) and then an image of the target (displayed for 1000 ms). Then the subject judged the target's category. A feedback message indicated whether the subject's response was correct.

*fixed effects*) were modeled using an intercept and independent variables delta-leaf-width, delta-leaf-angle, and condition, as well as interactions between delta-leaf-width and condition and between delta-leaf-angle and condition. The intercept and the coefficients on delta-leaf-width and delta-leaf-angle were allowed to vary by subject (

*random effects*). Model

^{2}(2) = 9.46,

*p*= 0.009.

^{2}(3) = 9.8242,

*p*= 0.02. The performances of

*p*= 0.55). Consistent with our hypothesis, these results indicate that subjects allocated more VWM capacity toward remembering the feature values of objects when those features were relevant for determining objects' category membership. In other words, subjects dynamically reallocated their limited VWM capacity based on the categorical structure of the to-be-remembered items.

*p*[

*x*] = 1 /

*N*for all stimuli

*x*). Furthermore, we assumed that subjects adopted a perfect model of the stimulus statistics, hence

*x*, the cost function specifies the cost of each possible memory error

*x*as a different stimulus

*p*

_{change}, and the 20 parameters characterizing the cost function. We fit the model to the aggregated data from all subjects in each condition (a minimum of 2,200 trials). Model parameters were inferred using maximum likelihood estimation.

*p*

_{change}was also nearly identical between conditions: 0.21 in the leaf-angle relevant condition, and 0.20 in the leaf-width relevant condition. Hence, according to the model, performance in the two conditions differed only in terms of the implicit cost function for memory error.

*x*and

^{3}

*x*-axis) and leaf width (

*y*-axis). Intuitively, large changes are very likely to be detected. More interesting is that the “just noticeable difference” in stimulus space differs between the two conditions (compare left and right panels). As hypothesized, smaller changes to leaf angle are more noticeable when subjects are trained to categorize plants based on their leaf angle. This adaptation is accompanied by a decrease in sensitivity to changes in leaf width when compared to the data from the leaf-width relevant condition.

*d*′) to each level of stimulus change. As predicted by rate-distortion theory, increases in sensitivity to changes in leaf angle were accompanied by decreases in sensitivity to changes in leaf width.

*r*:1 of that distance, where the parameter

*r*is estimated by the model. In addition, the channel capacity, and prior probability of change,

*p*

_{change}, were estimated from the data. The log of the aspect ratio parameter,

*t*test. A log transform was applied because ratios are linear on a logarithmic scale (i.e., following a log transformation, the difference between a ratio of 1/2:1 and 1:1 equals the difference between a ratio of 2:1 and 1:1). The results of this comparison indicate that the aspect ratio was significantly larger in the leaf-width relevant condition,

*t*(99) = 2.15,

*p*= 0.034. In other words, on a subject-by-subject basis, leaf widths were psychologically more distinct in the condition where subjects were trained to categorize based on leaf width. Estimates of subjects' capacity and

*p*

_{change}were highly similar to estimates based on the aggregated data, and did not differ between conditions (mean capacity = 4.02, 4.11 bits in the leaf-angle and leaf-width relevant conditions, respectively; mean

*p*

_{change}= 0.24, 0.22).

^{4}

*Neuroscience*, 139 (1), 201–208.

*Journal of Experimental Psychology: General*, 144 (4), 744.

*Sensory communication*(pp. 217–234). Cambridge, MA: MIT Press.

*Journal of Statistical Software*, 67, 1–48.

*Journal of Neuroscience*, 34, 3632–3645.

*Journal of Vision*, 11 (10): 6, 1–15, https://doi.org/10.1167/11.10.6. [PubMed] [Article]

*Science*, 321, 851–854.

*Neuropsychologia*, 49, 1622–1631.

*IEEE Transactions on Information Theory*, 18 (4), 460–473.

*Psychological Science*, 22 (3), 384–392.

*Journal of Experimental Psychology: General*, 138 (4), 487–502.

*Journal of Vision*, 11 (5): 4, 1–34, https://doi.org/10.1167/11.5.4. [PubMed] [Article]

*Proceedings of the National Academy of Sciences, USA*, 113, 7459–7464.

*Psychological Review*, 120 (1), 85.

*Trends in Cognitive Sciences*, 10 (4), 159–166.

*Neuropsychologia*, 49 (6), 1407–1409.

*Psychological Science*, 28 (1), 12–22. Los Angeles, CA: Sage Publications.

*Elements of information theory*. New York, NY: Wiley.

*Topics in Cognitive Science*, 2 (2), 189–201.

*Journal of Experimental Psychology: General*, 143, 548–565.

*Extending the linear model with r*. Boca Raton, FL: CRC Press.

*Psychological Science*, 12 (6), 499–504.

*Journal of Experimental Psychology: Learning, Memory, and Cognition*, 28 (3), 458.

*Proceedings of the National Academy of Sciences*, 99 (24), 15822–15826.

*Journal of Vision*, 10 (12): 27, 1–11, https://doi.org/10.1167/10.12.27. [PubMed] [Article]

*Science*, 349 (6245), 273–278.

*Vector quantization and signal compression*. Norwell, MA: Kluwer Academic Publishers.

*Journal of Experimental Psychology: General*, 123 (2), 178.

*Journal of Neuroscience*, 31, 8502–8511.

*Behavioral Research Methods*, 48, 829–842.

*Memory & Cognition*, 39 (3), 412–432.

*Journal of Experimental Psychology: General*, 129 (2), 220.

*PLoS Computational Biology*, 9 (2), e1002927.

*Psychonomic Bulletin & Review*, 20 (2), 228–242.

*Perception as Bayesian inference*. Cambridge, UK: Cambridge University Press.

*Modeling psychophysical data in r*. New York, NY: Springer.

*Cognition*, 115 (1), 147–153.

*Nature Neuroscience*, 17 (3), 347–356.

*Information theory, inference, and learning algorithms*. Cambridge, UK: Cambridge University Press.

*Detection theory: A user's guide*. New York, NY: Psychology Press.

*Generalized linear models*. London: Chapman & Hall.

*PLoS One*, 6 (12), e29296.

*Annual Review of Psychology*, 32, 89–115.

*Psychological Review*, 63 (2), 81–97.

*Journal of Vision*, 12 (11): 26, 1–17, https://doi.org/10.1167/12.11.26. [PubMed] [Article]

*Journal of Experimental Psychology: Learning, Memory, and Cognition*, 13 (1), 87–108.

*Proceedings of the National Academy of Sciences*, 105 (7), 2745–2750.

*Psychological Review*, 120 (2), 297.

*Attention, Perception, and Psychophysics*, 76, 2158–2170.

*Annual Review of Psychology*, 52 (1), 629–651.

*Cognitive Psychology*, 88, 1–21.

*Formal approaches in categorization*. Cambridge, UK: Cambridge University Press.

*R: A language and environment for statistical computing*[Computer software manual]. Vienna, Austria. Retrieved from https://www.R-project.org

*Nature Neuroscience*, 8, 1647–1650.

*Attention, Perception, & Psychophysics*, 72 (4), 1097–1109.

*Experimental Brain Research*, 126 (3), 289–306.

*The mathematical theory of communication*. Urbana, IL: University of Illinois Press.

*Science*, 237, 1317–1323.

*Journal of Vision*, 17 (12): 9, 1–19, https://doi.org/10.1167/17.12.9. [PubMed] [Article]

*Psychological Science*, 3 (3), 150–161.

*Journal of Vision*, 15 (3): 2, 1–27, https://doi.org/10.1167/15.3.2. [PubMed] [Article]

*Cognition*, 152, 181–198.

*Psychological Review*, 119 (4), 807–830.

*Categories and concepts*. Cambridge, MA: Harvard University Press.

*Journal of Vision*, 16 (3): 32, 1–12, https://doi.org/10.1167/16.3.32. [PubMed] [Article]

*Nature Neuroscience*, 5 (11), 1226–1235.

*Vision Research*, 44 (6), 541–556.

*Journal of Economic Surveys*, 14, 101–118.

*Journal of Experimental Psychology: General*, 131 (1), 48–64.

^{1}Here, “bit allocation” specifically refers to changes in the pattern of memory errors resulting from adapting VWM to the current task. How someone decides to “allocate” their capacity is synonymous with how they decide to distribute their errors, given what they have observed and given limits on their ability to store visual information with perfect fidelity. These decisions are crucial for maximizing performance. A mathematical description of this process is given in the following sections, and in particular the error distribution for memory is given by the channel,

^{2}The rate-distortion model was implemented using the “RateDistortion” library (Sims, 2016) of the R statistical computing environment. This library contains algorithms for efficiently solving the constrained optimization problem described by Equation 1. A tutorial introduction to this library and its use in modeling human perception is described in Sims (2016). Model parameters were obtained by maximum likelihood estimation, using L-BFGS or Nelder-Mead optimization implemented within R (via “optim”). Complete code for the model is available from the third author's website.

^{3}A reader may note that, in each condition, subjects remembered aspects of the task-irrelevant stimulus dimension, contrary to the predictions of rate-distortion theory. It should be kept in mind that subjects received only a small amount of training (approximately 20–30 min). With additional training, we would expect subjects to show additional adaptation.

^{4}It is not our intention to claim that visual attention and visual memory are one and the same. Rather, we believe that the terms “attention” and “memory” are vaguely defined in the cognitive science literature, each potentially covering many phenomena and mechanisms. Because the terms are so vague, they overlap. For instance, the results of many experiments can be accounted for by stating that subjects “allocated more attentional capacity to stimulus dimension

*A*than

*B*” or, equally validly, by stating that subjects “allocated more VWM capacity to stimulus dimension

*A*than

*B*.” Which description is used in an article often depends on the personal biases of the article's authors. Additional work (Chun, 2011) has addressed this issue by working towards an organizing taxonomic framework for differentiating between these two mechanisms.

*x*, sampled from a Gaussian distribution with arbitrary mean and variance such that

*x*∼ Normal(

*μ*,

*σ*

^{2}). An abstract visual memory task is modeled in which an observer is shown a sample from this distribution and tasked with remembering it as accurately as possible. The task is nontrivial because the observer's memory (and hence response) is related only probabilistically to the sensory signal according to a conditional probability distribution,

*loss function*for the observer,

*D*, and the variance of the prior over the signal distribution,

*σ*

^{2}.

*x*. The memory representation is related to the sensory signal according to:

_{e}*x*, a Bayes-optimal observer should produce an estimate

_{e}*x*, which minimizes the expected loss. For the squared error loss function, this expected loss as a function of the estimate

*x*. The corresponding distribution in terms of the observed sensory signal

_{e}*x*, or

*δ*(·) indicates the Dirac delta function. Lastly, the expected cost for this observer is obtained by evaluating the expression in Equation 4, with the distribution

*R*. Subject to this bound, the goal of the observer is to minimize expected cost, defined as above in terms of the mean squared error. The optimal channel,

*R*) necessary to achieve a specified level of performance (

*D*):

*σ*). This equation defines the “effective encoding noise” for a channel with information rate

*R*.

*R*and obtain the information rate required for the channel to reach a specified level of distortion

*D*:

*σ*= 10 (the value used in Experiment 1). This was achieved by ensuring that both models have the same effective encoding noise. As can be seen in the figure, increases to the standard deviation of the stimulus distribution lead to divergent predictions between the two models. For the rate-distortion model, memory performance declines monotonically with increases to the width of the stimulus distribution. In contrast, the Bayesian observer's performance is relatively invariant to such changes.

*σ*and

_{e}*p*

_{change}were shared across experimental conditions, and one in which only

*p*

_{change}was shared across conditions. The decision rule and choice for

*σ*is shared across conditions, the likelihood should be lower than that of the rate-distortion model; (b) when

_{e}*σ*is shared across conditions, the model should predict only a small decrement in overall percent correct responses in the uniform condition compared to the Gaussian conditions; and (c) when

_{e}*σ*is allowed to vary by condition, it should be highest in the uniform condition.

_{e}*σ*= 20 and

*R*= 2.2 bits, then

*σ*= 4.46). Also note that the value for

_{e}*p*

_{change}in both versions is very close to that of the rate-distortion model.

*σ*predicted only a small decrement in the overall proportion of correct responses in the uniform condition, whereas both the rate-distortion model and the other version of the Bayesian model better matched the subjects' proportion-correct scores. For each of these three models, we simulated responses by sampling from

_{e}*p*(

*C*|

*x*,

*y*). These results illustrate that the Bayesian model fails to account for subjects' worse performance in the uniform condition (see Table 2).

*μ*and

*σ*in some conditions of the Bayesian models to be less intuitive than their counterparts in the rate-distortion model. For example, in both versions of the model, the estimate of

*σ*was large in the mean-75 condition, and the estimate of

*μ*was low, relative to what we found in the rate-distortion model.

*C*is a binary random variable indicating whether or not the current trial is a change trial. The variables

*x*and

*y*. A probabilistic graphical model relating these variables is shown in Figure 10.

*p*(

*C*= 1), is treated as a parameter in the model,

*p*

_{change}. The distribution of the probe given the memory stimulus and change trial status is:

*f*(

*x*) was set to the true (experiment-defined) probability of a probe given a target. (We found qualitatively similar results for several other choices of

*f*[

*x*].) The prior distribution over stimuli,

*p*(

*x*), was modeled as a normal distribution in Experiment 1, with parameters

*μ*and

*σ*, normalized over the space of possible stimulus values. In Experiment 2, the prior distribution was uniform,

*p*(

*x*) = 1/

*N*. Lastly, the conditional distribution over memory representations,