**Visual working memory (VWM) is a highly limited storage system. A basic consequence of this fact is that visual memories cannot perfectly encode or represent the veridical structure of the world. However, in natural tasks, some memory errors might be more costly than others. This raises the intriguing possibility that the nature of memory error reflects the costs of committing different kinds of errors. Many existing theories assume that visual memories are noise-corrupted versions of afferent perceptual signals. However, this additive noise assumption oversimplifies the problem. Implicit in the behavioral phenomena of visual working memory is the concept of a loss function: a mathematical entity that describes the relative cost to the organism of making different types of memory errors. An optimally efficient memory system is one that minimizes the expected loss according to a particular loss function, while subject to a constraint on memory capacity. This paper describes a novel theoretical framework for characterizing visual working memory in terms of its implicit loss function. Using inverse decision theory, the empirical loss function is estimated from the results of a standard delayed recall visual memory experiment. These results are compared to the predicted behavior of a visual working memory system that is optimally efficient for a previously identified natural task, gaze correction following saccadic error. Finally, the approach is compared to alternative models of visual working memory, and shown to offer a superior account of the empirical data across a range of experimental datasets.**

^{1}Intuitively, increasing the number of items stored concurrently leaves less capacity available to code each item, and increasing the statistical complexity of visual features requires a greater capacity in order to maintain the same level of memory precision.

*efficient*information storage system. According to this hypothesis, visual working memory is limited in capacity, but yet simultaneously efficient, in the sense of making the most of its limited resources (Orhan et al., 2014). This same principle has been highly productive in sensory neuroscience, where it is forms the basis of the

*efficient-coding hypothesis*(Barlow, 1961; Geisler, 2008; Simoncelli & Olshausen, 2001). In the present paper, I explore a corresponding efficient memory hypothesis. In particular, if visual working memory is efficient, then by definition it must be efficient according to some particular loss function. In previous work (Sims et al., 2012), the assumption was made that the brain attempts to minimize a quadratic loss function in visual working memory (minimizing the squared error between actual, and remembered visual features). This assumption simplified the mathematical development of the model, but was not motivated by any theoretical consideration. Thus, the empirical loss function remains an open question.

*x*indicates a particular visual feature (such as a spatial position or orientation), and

*y*indicates the recalled value for the feature, then a loss function is a mathematical function that assigns a cost to the outcome where

*x*is remembered as

*y*:

*ρ*(

*x*,

*y*) → [0,∞). Simple choices for the loss function are linear,

*ρ*(

*x*,

*y*) = |

*y*−

*z*|, or quadratic functions,

*ρ*(

*x*,

*y*) = (

*y*−

*z*)

^{2}. Further, the present paper will restrict its attention to symmetric difference loss functions, such that

*ρ*(

*x*,

*y*) =

*f*(

*z*), where

*z*= |

*y*−

*x*| is the absolute memory error.

^{2}Note that the mathematical framework in general is not limited by this restriction, and examining asymmetric loss functions represents an interesting avenue for further exploration.

*p*(

*x*) describes the statistical distribution of visual features, and

*q*(

*y*|

*x*) gives the conditional probability distribution for memory; that is, the probability of recalling a visual feature

*x*as the value

*y*. This approach closely follows a large body of previous work that has defined visual perception in the framework of Bayesian decision theory (Ma, 2012). In subsequent use, I will refer to the distribution as an information channel or

*memory channel*.

*y*=

*x*for all stimuli

*x*). However, when the feature

*x*is continuously distributed, this goal is unachievable, even in principle (this is a fundamental result of information theory; Shannon & Weaver, 1949). Mathematically, if

*p*(

*x*) and

*q*(

*y*|

*x*) define an information source and information channel, then the average amount of information transmitted by this channel is given by the mutual information,

*x*after observing the channel output. When the logarithm is taken base 2, this quantity is measured in units of bits. The maximum rate of information transmission, across all possible distributions

*p*(

*x*), defines the

*capacity*of a channel,

*C*. For a fixed information source, the channel can be measured via Equation 2 to transmit at an

*information rate R*≤

*C*.

*q**) is given by where

*L*(

_{ρ}*q*) and

*I*(

*q*) refer to the expected loss, and mutual information associated with the channel

*q*(

*y*|

*x*), given by Equations 1 and 2, respectively. This equation states that an optimal memory channel is one that minimizes the expected loss according to a particular loss function, while subject to the constraint that the amount of memory that it can store or transmit is at or below a specified limit. This equation is also the basis for the mathematical field of rate–distortion theory (Berger, 1971), which concerns the design and analysis of optimal, but lossy information channels.

*p*(

*x*) is Gaussian, and the loss function is assumed to be quadratic. However, a general solution is needed if the goal is to estimate the empirical loss function, rather than assume a particular function.

*p*(

*x*) and

*q*(

*y*|

*x*) are discrete rather than continuous probability distributions. While any convex optimization algorithm will work in principle,

^{3}algorithms have been constructed that are particularly efficient for this application. One elegant algorithm (Blahut, 1972) can be used to efficiently solve for the optimal memory channel for a given information source, loss function, and constraint on information rate. This approach is illustrated in Figure 1.

*distortion*) according to a specified loss function. Decreasing expected loss (moving to the left along the

*x*-axis) requires a corresponding increase in the rate of information transmission by the channel, as illustrated by the curve. If an upper bound is placed on channel capacity (illustrated by the horizontal line and shaded region), then an optimally efficient memory system is defined by the intersection of these two lines, illustrated by the plot marker in Figure 1. Blahut (1972, figure 3) derived an iterative algorithm for computing the rate–distortion curve for arbitrary discrete loss functions. This algorithm can be used to search for the point along the rate–distortion curve that satisfies Equation 3. In order to apply this algorithm to typical visual working memory experiments, it is only necessary to discretize the stimuli and responses with a suitably small bin size.

*ρ*(

*z*) ∼ (1 − cos[

*z*]); linear,

*ρ*(

*z*) ∼

*z*; a step function,

*ρ*(

*z*) ∼

*z*

^{2}. Note that only relative cost matters, so that the predictions are invariant to multiplying the loss function by a constant. Each loss function was therefore normalized to the range [0,1]. I then computed the optimal memory channel for each of these loss functions, assuming two different constraints on memory capacity, either 1 or 3 bits. The results of this analysis are shown in Figure 2, which plots each loss function (left column), along with the predicted memory error distribution (the probability distribution for the quantity

*y*−

*x*).

*ρ*and constraint on memory capacity

*R*, is given by the optimal memory channel,

*q**(

*y*|

*x*;

*ρ*,

*R*). By searching through the space of possible loss functions, one can determine the function that maximizes the likelihood of the observed data.

*w*indicates the bin width. With a likelihood function defined, it is possible to recover the loss function by maximum likelihood estimation or Bayesian inference. To reduce the complexity of the inference process, one can specify a parameterized family of loss functions,

*ρ*(

*z*;

*θ⃗*), where

*θ⃗*are the parameters. An ideal candidate should be flexible (able to capture a wide range of different loss functions), while having a small number of parameters (to facilitate the inference process). In this paper, I adopted the following parametric family of loss functions:

*μ*determines the error magnitude

*z*at which the loss reaches half of the maximum value, while

*β*controls the steepness of the function around the point

*μ*. Figure 3 illustrates a number of different loss functions constructed from this family by varying the parameters

*μ*and

*β*. By varying the parameters, it is also possible to exactly meet or closely approximate each of the loss functions shown in Figure 2. With a parametric loss function defined, it is straightforward to estimate the parameters of this function, along with memory capacity, via maximum likelihood estimation. Appendix A reports a parameter recovery analysis, in which artificial datasets are generated, and the model-fitting procedure is examined to determine how well it is able to recover the parameters used to generate the data.

*F*(1, 11) = 80.31,

*p*< 0.001, while also having lower variance,

*F*(1, 11) = 51.31,

*p*< 0.001. In addition, there were subtle differences between the error distributions for the single feature and conjunction conditions. To illustrate these differences, Figure 5b plots the difference in frequency of error between the single feature and conjunction conditions (conjunction – single feature). Compared to the single feature condition, the conjunction condition exhibits a relative decrease in small errors, with a corresponding increase in the “shoulders” of the error distribution. In terms of summary statistics (Figure 6), these changes correspond to both an increase in variance,

*F*(1, 11) = 12.83,

*p*< 0.001, and a decrease in kurtosis in the conjunction conditions,

*F*(1, 11) = 38.71,

*p*< 0.001, compared to the single feature conditions.

*n*is the number of parameters in the model, and log(

*L*) is the maximum log-likelihood value for the model. The model with the lowest relative AIC score is the preferred explanation for the data, and differences in AIC score can be interpreted in terms of the relative strength of alternative explanations. Burnham and Anderson (2004) suggest as a rough guideline that models with a difference in AIC value (ΔAIC) ≤ 2 have substantial support or evidence, models with 4 ≤ ΔAIC ≤ 7 have limited support, and models with ΔAIC ≥ 10 have “essentially no support” compared to the preferred model. As an additional check on the robustness and consistency of the model comparison, the 16 models were also evaluated, using two-fold cross validation, by successively fitting each model to one half of the data via maximum likelihood and examining the log-likelihood of the held-out data. The results of these analyses are provided in Table 1 and illustrated in Figure 7.

Model | Factor | Number of model parameters | ΔAIC | CV log-likelihood | Model rank (AIC) | Model rank (CV) | |||

A | B | C | D | ||||||

1 | N | N | N | N | 12 | 2.74 | −848.69 | 2 | 2 |

2 | N | N | N | Y | 8 | 0.00 | −846.12 | 1 | 1 |

3 | N | N | Y | N | 8 | 9.31 | −850.31 | 4 | 5 |

4 | N | N | Y | Y | 6 | 7.75 | −849.39 | 3 | 3 |

5 | N | Y | N | N | 10 | 16.13 | −850.72 | 6 | 6 |

6 | N | Y | N | Y | 6 | 13.29 | −849.52 | 5 | 4 |

7 | N | Y | Y | N | 6 | 23.36 | −854.10 | 8 | 8 |

8 | N | Y | Y | Y | 4 | 21.78 | −853.17 | 7 | 7 |

9 | Y | N | N | N | 10 | 76.06 | −882.39 | 10 | 10 |

10 | Y | N | N | Y | 6 | 73.49 | −881.12 | 9 | 9 |

11 | Y | N | Y | N | 6 | 88.37 | −887.72 | 14 | 14 |

12 | Y | N | Y | Y | 4 | 86.72 | −886.66 | 12 | 12 |

13 | Y | Y | N | N | 9 | 87.22 | −887.30 | 13 | 13 |

14 | Y | Y | N | Y | 5 | 85.26 | −885.89 | 11 | 11 |

15 | Y | Y | Y | N | 5 | 100.39 | −892.88 | 16 | 16 |

16 | Y | Y | Y | Y | 3 | 98.79 | −891.84 | 15 | 15 |

Memory load | Visual feature | R, bits | μ | β |

Single feature | Color | 1.42 (0.27) | 0.85 (0.12) | 1.79 (0.22) |

Orientation | 2.20 (0.63) | 0.68 (0.09) | 1.98 (0.30) | |

Conjunction | Color | 1.13 (0.38) | 0.85 (0.12) | 1.79 (0.22) |

Orientation | 1.86 (0.66) | 0.68 (0.09) | 1.98 (0.30) |

*F*(1, 11) = 40.65,

*p*< 0.001, and lower for the conjunction condition compared to the single feature condition,

*F*(1, 11) = 40.65,

*p*< 0.001. Memory capacity decreased by 20% for color and 15% for orientation in the conjunction condition compared to the single feature condition. A simple model of visual memory, according to which a single memory capacity is evenly shared across all encoded objects, might predict that capacity should decrease by half in the conjunction condition (as the number of attended features is doubled). However, this prediction is complicated by several factors. First, perceptual noise and response noise contribute to response variability, but these are independent of memory load. Hence, capacity estimates in the single feature condition may underestimate total memory capacity. In addition, it is possible that subjects sometimes encoded both the color and orientation of the objects in the single feature condition. For both reasons, the total capacity of visual memory is likely higher than observed in the single feature condition. The obtained results do, however, rule out strict independence between visual working memory for distinct features. Encoding both the color and orientation of a visual feature decreases the memory precision with which either can be recalled.

*μ*significantly differed between color and orientation,

*F*(1, 11) = 17.23,

*p*< 0.002, while there were no significant differences in the

*β*parameter.

*F*(1, 11) = 0.11,

*p*= 0.074,

*ns*]. Hence, model comparison based on AIC scores, and comparison of the parameters in the unconstrained model lead to the same conclusion.

*x*indicate the true feature value for the target, and

*y*refer to the memory representation of

*x*. The feature value for the distractor is indicated by

*x*. Under these circumstances, an identification error will occur when the angular difference between

_{d}*y*and

*x*is less than the difference between

_{d}*y*and

*x*. If the orientation of the distractor is independent of the orientation of the target, the cost function (i.e., the probability of error) for this task can be explicitly derived:

*ρ*(

*z*) =

*z*/

*π*. This states that the probability of making an identification error increases linearly with the magnitude of memory error (given by

*z*). The corresponding optimal visual memory channel for minimizing this cost function is shown in the second row, right column of Figure 2. As previously noted, this distribution exhibits a sharper peak, and heavier tails than a von Mises distribution matched in variance—a property that is in qualitative agreement with human visual working memory performance (Bays, 2014; van den Berg et al., 2012). Equation 7 yields the task-defined loss function when there is a single distractor item. This loss function can be extended in a straightforward manner to handle the case when there are multiple distractors. Appendix B provides analytical expressions for the relevant loss functions for target identification with up to four distractors.

*set size effect*. Information theory naturally predicts this qualitative pattern, as illustrated in Figure 2. In fact, the only assumption necessary to explain a decrease in memory precision is that, as more visual features are encoded in memory, less capacity is available to encode each one. An information-theoretic model has previously been shown to offer a close quantitative fit to human performance (Sims et al., 2012). However, this previous work assumed a particular loss function (minimizing the squared error in memory). In addition, several alternative explanations for the set size effect have also been proposed.

*variable precision*model (van den Berg et al. 2012), visual working memory is a doubly stochastic process. Each memory item is recalled as a sample from a von Mises distribution, but the precision of this recall distribution is itself a stochastic variable (modeled as a gamma distribution). This variability in encoding precision leads to a memory error distribution that deviates from a von Mises distribution, even while individual items are von Mises distributed.

^{4}Each dataset consists of the results from a standard delayed estimation visual memory task, with set size varying between one and eight items. Here, I examine how a decision-theoretic model compares to the VP-P model. I focus on seven of the 10 available datasets, listed in the left column of Table 3. Five of the experiments examine visual memory for color values uniformly sampled from a color wheel, while two of the experiments examine visual memory for orientation. The remaining three datasets analyzed by van den Berg et al. (2014) are not considered in the present paper since they employ visual features distributed in the range 0°–180

**°**rather than 0°–360

**°**; fitting these datasets is straightforward but requires specifying a modified family of loss functions. The seven experiments differ by numerous factors, such as the visual eccentricity and stimulus presentation time; complete methodological details can be found in the references listed in Table 3.

Reference | Feature | Set sizes | Model ΔAIC | |

DT | VP-P | |||

Wilken & Ma, 2004 | Color | 1, 2, 4, 8 | 0.00 | +14.28 |

Zhang & Luck, 2008 | Color | 1, 2, 3, 6 | 0.00 | +5.04 |

Bays, Catalao, & Husain, 2009 | Color | 1, 2, 4, 6 | 0.00 | +0.56 |

Anderson, Vogel, & Awh, 2011 | Orientation | 1–4, 6, 8 | +1.13 | 0.00 |

Anderson & Awh, 2012 | Orientation | 1–4, 6, 8 | 0.00 | +0.24 |

van den Berg et al., 2012 | Color (scrolling) | 1–8 | 0.00 | +5.75 |

van den Berg et al., 2012 | Color (wheel) | 1–8 | 0.00 | +6.45 |

Mean ΔAIC (all experiments) | 0.0 | +4.45 |

*K*+ 2 free parameters, where

*K*indicates the number of set size conditions, and the remaining two parameters (

*μ*and

*β*) characterize the loss function. Notably, the decision-theoretic model does not assume any variability in model parameters, such as trial-to-trial variability in capacity, the number of items encoded, the parameters of the loss function, or additive response noise. Rather, it is assumed that the variability in observed responses is entirely due to the rational minimization of expected loss for a given capacity constraint and loss function. Model parameters were fit separately to the data from each participant by maximum likelihood estimation. The data were discretized into 1,000 bins before fitting the model.

*R*. The red curves in Figure 11a plot

_{total}*R*/

_{total}*k*, where

*k*indicates set size. This curve is the predicted drop-off in visual memory capacity according to a power law with exponent = −1. As can be seen, the theoretical prediction resembles the estimated capacity for large set sizes fairly closely, but substantially overestimates capacity in the single item conditions. In other words, given subjects' performance in the larger set size conditions, they should have performed better than they did in remembering a single item. This discrepancy may be explained simply by incorporating sensory noise and response noise into the model (as these factors will have the largest impact, relatively speaking, in the small set size conditions). However, this hypothesis remains to be tested in future research.

*some*loss function for visual working memory: a mathematical entity that quantifies the relative costs of making different kinds of memory errors. This function need not be explicitly represented in the brain, but rather is defined implicitly by the pattern of memory errors that the brain does and does not make.

*identifiability problem*raised by Anderson (1990).

*any*pattern of data. In other words, is the current model falsifiable? The present analysis considered only loss functions that are plausible a priori—for example, excluding loss functions in which it is preferable to make larger errors compared to smaller errors. Hence, in literal terms, the model is not capable of reproducing arbitrary error distributions. However, to give a better answer to the question of falsifiability, it is necessary to be clear about the claims that the model does, and does not make. In particular, the current model does not make assumptions about neural mechanisms, or contradict existing implementation-level theories. Rather, the model can be understood in terms of a weak claim, and a stronger claim regarding how costs influence visual working memory.

*some*loss function. This claim is probably not falsifiable, but as discussed above, this doesn't negate the utility or descriptive validity of the approach. The stronger claim is that the implicit loss function of visual working memory is shaped by the costs of memory error in natural tasks. Experiment 2 represents the first empirical test of this claim, but there is substantial room for future work. One important avenue for future research is developing techniques to measure the loss function of visual working memory based directly on measured performance in biologically relevant tasks, rather than separately estimating a loss function from a delayed estimation task. The finding that the measured loss function for visual working memory is substantially suboptimal in a biologically important task that a person is performing,

*as they are performing it*, would constitute strong evidence against the framework.

*IEEE Transactions on Automatic Control**,*19 (6), 716–723. [CrossRef]

*Trends in Cognitive Sciences**,*18 (11), 562–565. [CrossRef] [PubMed]

*Psychological Science**,*15 (2), 106–111. [CrossRef] [PubMed]

*. Hillsdale, NJ: Erlbaum.*

*The adaptive character of thought*

*Attention, Perception, & Psychophysics**,*74

*,*891–910. [CrossRef]

*Journal of Neuroscience**,*31 (3), 1128–1138. [CrossRef] [PubMed]

*(pp. 217–234). Cambridge, MA: MIT Press.*

*Sensory communication*

*Journal of Neuroscience**,*27 (26), 6984–6994. [CrossRef] [PubMed]

*Journal of Neuroscience**,*34 (10), 3632–3645. [CrossRef] [PubMed]

*Journal of Vision**,*9 (10): 7, 1–11, http://www.journalofvision.org/content/9/10/7, doi:10.1167/9.10.7. [PubMed] [Article] [PubMed]

*Science**,*321

*,*851–854. [CrossRef] [PubMed]

*(pp. 93–137). London, UK: Macmillan.*

*Vision and visual dysfunction: Vol. 8. Eye movements**. Englewood Cliffs, NJ: Prentice-Hall.*

*Rate distortion theory: A mathematical basis for data compression*

*IEEE Transactions on Information Theory**,*18 (4), 460–473. [CrossRef]

*Journal of Experimental Psychology: General**,*138 (4), 487–502. [CrossRef] [PubMed]

*Journal of Vision**,*11 (5): 4, 1–34, http://www.journalofvision.org/content/11/5/4, doi:10.1167/11.5.4. [PubMed] [Article]

*, 120 (1), 85. [CrossRef] [PubMed]*

*Psychological Review*

*Journal of Vision**,*7 (5): 6, 1–12, http://www.journalofvision.org/content/7/5/6, doi:10.1167/7.5.6. [PubMed] [Article] [PubMed]

*Journal of Vision**,*9 (1): 24, 1–19, http://www.journalofvision.org/content/9/2/24, doi:10.1167/9.1.24. [PubMed] [Article]

*Sociological Methods & Research**,*33 (2), 261–304. [CrossRef]

*Nature Reviews Neuroscience**,*13

*,*51–62. [CrossRef]

*Trends in Cognitive Sciences**,*3 (2), 57–65. [CrossRef] [PubMed]

*Journal of Experimental Psychology: Human Perception and Performance**,*35 (1), 94–107. [CrossRef] [PubMed]

*, 8 (12), 1684–1689. [CrossRef] [PubMed]*

*Nature Neuroscience*

*The Journal of Experimental Biology**,*205

*,*3717–3727. [PubMed]

*. Cambridge, MA: MIT Press.*

*Bayesian brain: Probabilistic approaches to neural coding*

*Journal of Mathematical Psychology**,*45

*,*497–542. [CrossRef] [PubMed]

*. Cambridge, UK: Cambridge University Press.*

*Statistical analysis of circular data*

*The Journal of Neuroscience**,*5 (7), 1688–1703. [PubMed]

*Journal of Vision**,*10 (12): 27, 1–11, http://www.journalofvision.org/content/10/12/27, doi:10.1167/10.12.27. [PubMed] [Article]

*Nature Communications**,*3: 1229, 1–8, doi:10.1038/ncomms2237.

*Trends in Cognitive Sciences**,*17 (3), 134–141. [CrossRef] [PubMed]

*Annual Review of Psychology**,*59

*,*167–192. [CrossRef] [PubMed]

*Cognitive Psychology**,*38

*,*129–166. [CrossRef] [PubMed]

*Journal of Neuroscience**,*31

*,*8502–8511. [CrossRef] [PubMed]

*. Los Altos, CA: Peninsula Publishing.*

*Signal detection theory and psychophysics*

*Current Directions in Psychological Science**,*21 (4), 263–268. [CrossRef]

*2010 IEEE Information Theory Workshop (ITW)**,*181–185.

*Nature**,*394

*,*780–784. [CrossRef] [PubMed]

*Journal of Vision**,*3 (1): 6, 49–63, http://www.journalofvision.org/content/3/1/6, doi:10.1167/3.1.6. [PubMed] [Article] [PubMed]

*Memory & Cognition**,*39

*,*412–432. [CrossRef] [PubMed]

*Journal of Experimental Psychology: General**,*137 (1), 163–181. [CrossRef] [PubMed]

*Journal of Neuroscience**,*32 (6), 2182–2190. [CrossRef] [PubMed]

*PLoS Computational Biology**,*9 (2), e1002927. [CrossRef] [PubMed]

*Journal of Neuroscience**,*31 (4), 1219–1237. [CrossRef] [PubMed]

*Science**,*318, 606–610. [CrossRef] [PubMed]

*Proceedings of the National Academy of Sciences, USA**,*101 (26), 9839–9842. [CrossRef]

*Journal of Vision**,*7 (6): 4, 1–15, http://www.journalofvision.org/content/7/6/4, doi:10.1167/7.6.4. [PubMed] [Article] [PubMed]

*Nature Neuroscience**,*1 (1), 36–41. [CrossRef] [PubMed]

*Journal of Neuroscience**,*27 (35), 9354–9368. [CrossRef] [PubMed]

*, 390 (6657), 279–281. [CrossRef] [PubMed]*

*Nature**, 17 (8), 391–400. [CrossRef] [PubMed]*

*Trends in Cognitive Sciences*

*Trends in Cognitive Sciences**,*16 (10), 511–518. [CrossRef] [PubMed]

*, 17 (3), 347–356. [CrossRef] [PubMed]*

*Nature Neuroscience*

*Vision Research**,*50 (23), 2362–2374. [CrossRef] [PubMed]

*, 24 (12), 2351–2360. [CrossRef] [PubMed]*

*Psychological Science**. San Francisco, CA: Freeman.*

*Vision**, 13 (2): 21, 1–13, http://www.journalofvision.org/content/13/2/21. doi:10.1167.13.2.21. [PubMed] [Article]*

*Journal of Vision*

*Journal of Experimental Psychology: Learning, Memory, and Cognition**,*39 (3), 760. [CrossRef] [PubMed]

*, 64 (7), 1055–1067. [CrossRef] [PubMed]*

*Perception & Psychophysics**, 120 (2), 297. [CrossRef] [PubMed]*

*Psychological Review*

*Current Directions in Psychological Science**,*23 (3), 164–170. [CrossRef]

*Journal of Experimental Psychology: Human Perception and Performance**,*16 (2), 332–350. [CrossRef] [PubMed]

*Annual Review of Psychology**,*53

*,*245–277. [CrossRef] [PubMed]

*. Champaign, IL: University of Illinois Press.*

*The mathematical theory of communication*

*Annual Review of Neuroscience**,*24

*,*1193–1216. [CrossRef] [PubMed]

*Journal of Neuroscience**,*31 (3), 928–943. [CrossRef] [PubMed]

*Psychological Review**,*119 (4), 807–830. [CrossRef] [PubMed]

*Attention, Perception, & Psychophysics**,*76 (7), 2071–2079. [CrossRef]

*Nature Neuroscience**,*7 (9), 907–915. [CrossRef] [PubMed]

*, 17 (11), 981–988. [CrossRef] [PubMed]*

*Psychological Scien*ce

*Spatial Vision**,*16 (3), 255–275. [CrossRef] [PubMed]

*Psychological Review**,*121 (1), 124–149. [CrossRef] [PubMed]

*Proceedings of the National Academy of Sciences, USA**,*109 (22), 8780–8785. [CrossRef]

*Journal of Vision**,*8 (3): 2, 1–15, http://www.journalofvision.org/content/8/3/2, doi:10.1167/8.3.2. [PubMed] [Article] [PubMed]

*Journal of Vision**,*4 (12): 11, 1120–1135, http://www.journalofvision.org/content/4/12/11, doi:10.1167/4.12.11. [PubMed] [Article] [PubMed]

*Current Opinion in Neurobiology**,*22

*,*996–1003. [CrossRef] [PubMed]

*, 11 (2), 269–274. [CrossRef] [PubMed]*

*Psychonomic Bulletin & Review*

*Nature**,*453, 233–236. [CrossRef] [PubMed]

^{1}Although “bits” are commonly associated with digital computers and binary coding, this unit of measure is equally applicable to analog systems, and defining capacity in this way does not make any assumptions about the nature of information coding. By analogy, a foot is a unit of length, but this does not require an item measuring 0.75 feet in length be constructed out of (fractions of) physical “feet.”

^{4}At the time of writing, the datasets and accompanying model code can be obtained from the website of Ronald van den Berg, http://www.ronaldvandenberg.org/code.html.

*SD*= 0.16). Figure 12b plots the histogram of reconstruction errors across all 1,400 artificial datasets, using 2,000 equal-width bins in the range