**Rhesus monkeys are widely used as an animal model for human memory, including visual working memory (VWM). It is, however, unknown whether the same principles govern VWM in humans and rhesus monkeys. Here, we tested both species in nearly identical change-localization paradigms and formally compared the same set of models of VWM limitations. These models include the classic item-limit model and recent noise-based (resource) models, as well as hybrid models that combine a noise-based representation with an item limit. By varying the magnitude of the change in addition to the typical set size manipulation, we were able to show large differences in goodness of fit among the five models tested. In spite of quantitative performance differences between the species, we find that the variable-precision model—a noise-based model—best describes the behavior of both species. Adding an item limit to this model does not help to account for the data. Our results suggest evolutionary continuity of VWM across primates and help establish the rhesus monkey as a model system for studying the neural substrates of multiple-item VWM.**

**Figure 1**

**Figure 1**

*Macaca mulatta*; weights: M1 = 16.5 kg, M2 = 14.5 kg, M3 = 13.5 kg; ages: M1 = 17.5 years, M2 = 16.5 years, M3 = 12.5 years) were tested in a change localization experiment for five days a week. Food and water were regulated prior to experimental sessions. After completing daily testing, animals were returned to their caging room, where they were housed individually and received primate chow and water to maintain their normal body weight. All animal procedures were performed in accordance with the National Institutes of Health guidelines, approved by the institutional review board of the University of Texas Health Science Center at Houston, and supervised by the Institutional Animal Care and Use Committee.

^{2}displayed on a black background. Based on the average distance of the monkey from the screen (approximately 35 cm), the stimuli subtended a visual angle of approximately 2.9° × 0.65°. Stimuli were presented in six possible locations on the screen, arranged on an imaginary circle of radius 7.4° (see Apparatus).

*π*) by multiplying all orientations and orientation-change magnitudes by 2 before analysis. All equations in this article are consistent with this convention, but orientations and orientation changes in the figures are back in actual orientation space.

*K*items. When

*N*≤

*K*, all items are stored. The probability of being correct is then 1 −

*ε*, where

*ε*accounts for lapses of attention and unintended responses. When

*N*>

*K*,

*K*randomly selected items from the sample display are stored. When the test display appears, there are three scenarios to consider:

- Both test items correspond to stored sample items. This happens with probability . The probability of being correct is then 1 −
*ε*. - One test item corresponds to a stored sample item and the other does not. This happens with probability . The probability of being correct is then 1 −
*ε*. - Neither test item corresponds to a stored sample item. This happens with probability . The observer then has to guess about which item changed, and the probability of being correct is 0.5.

*N*items (

*K*=

*N*) yields the same proportion correct, namely 1 −

*ε*, as storing only

*N*− 1 items, since even if one test item is not stored, the trial can be answered correctly by using the other test item. As can be seen from the equation, in the IL model the proportion correct depends on set size but not on change magnitude.

*φ*

_{1}and

*φ*

_{2}, are known noiselessly to the observer, because they remain on the screen until the subject responds. We model the memories of the orientations in the sample display as noisy. Noise can stem from encoding (presentation time was limited) or maintenance of memories; we do not distinguish between these sources. We model the noisy memory of the

*i*th item in the sample display, denoted

*x*(

_{i}*i*= 1, …,

*N*), as following a von Mises distribution (a circular analog of a Gaussian distribution, used because orientation space is periodic) centered at the true stimulus

*θ*with concentration parameter

_{i}*κ*: where

_{i}*I*

_{0}is the modified Bessel function of the first kind of order 0 (Mardia & Jupp, 1999). The concentration parameter controls the width of the noise distribution, and the Bessel function serves as a normalization. We have postulated previously that the role of precision is played by the Fisher information in this memory representation, denoted

*J*(Keshvari et al., 2013; van den Berg et al., 2012). Fisher information determines the best possible performance of any unbiased estimator through the Cramér–Rao bound (Cover & Thomas, 1991). When the measurement

_{i}*x*follows a Gaussian distribution, Fisher information is equal to inverse variance, . When neural variability is Poisson-like, Fisher information is proportional to the gain of a population (Seung & Sompolinsky, 1993). Thus, our choice of using Fisher information for precision is consistent with an interpretation of neural activity as “memory resource” (Bays, 2014; Ma et al., 2014; van den Berg et al., 2012). For Equation 1, Fisher information is related to the concentration parameter through where

*I*

_{1}is the modified Bessel function of the first kind of order 1. The relationship between precision and the concentration parameter is nearly the identity mapping, and none of our results would qualitatively change if we were to replace

*J*with

_{i}*κ*.

_{i}*K*items can be stored. Thus the number of stored items is min(

*N*,

*K*). The precision of a stored item is inversely related to the number of stored items through a power law:

*N*≤

*K*, the EPF model is equal to the EP model. The slots-plus-averaging model is very similar to this model (as was quantitatively shown by van den Berg, Awh, & Ma, 2014).

*N*items are drawn independently from a gamma distribution with mean

*J̄*and scale parameter

*τ*(a flexible family of distributions on the positive real line). We further assume that its mean is inversely related to set size through a power law: where

*J̄*

_{N}_{=1}is the mean precision of a single item (Figure 2).

**Figure 2**

**Figure 2**

*N*,

*K*); thus, no more than

*K*items can be stored. The precision of a stored item is again drawn from a gamma distribution with mean

*J̄*and scale parameter

*τ*, and the mean is inversely related to the number of stored items through a power law:

*N*-alternative change localization and change-detection tasks (Keshvari et al., 2012, 2013; van den Berg et al., 2012), but differs in the details.

*L*of the change (1 or 2), the magnitude Δ of the change, the relevant sample orientations

*θ*

_{1}and

*θ*

_{2}(all other sample items are irrelevant to the decision), their noisy memories

*x*

_{1}and

*x*

_{2}, and the two test orientations

*φ*

_{1}and

*φ*

_{2}. The ideal observer responds that the change occurred at location 1 when the log posterior ratio is positive:

*κ*

_{1}and

*κ*

_{2}on each trial. The decision rule is valid for both the VP and EP models. In the VP model, precision per item is a random variable, and therefore

*κ*

_{1}and

*κ*

_{2}will generally not be equal to each other. However, in the case of the EP model, we have

*κ*

_{1}=

*κ*

_{2}and the inequality simplifies to

*N*>

*K*. Then, a noisy measurement has a probability of not being stored. This is equivalent to setting the concentration parameter of the corresponding memory to 0. Thus, we can immediately obtain the decision rule from the EPF model by taking special cases of Equation 3:

*N*≤

*K*. When

*N*>

*K*, just as in the EPF model, a noisy measurement has a probability of not being stored (precision = 0). But unlike in the EPF model, the concentration parameters

*κ*

_{1}and

*κ*

_{2}in the VPF model are independent. With these modifications, we can again take the special cases of Equation 3 and obtain the decision rules for the VPF model:

*x*

_{1}and

*x*

_{2}on each trial, the models would predict the observer's response exactly. Since we do not know

*x*

_{1}and

*x*

_{2}, the best we can do is to compute the

*probability*of being correct for a given stimulus condition. Under the assumptions in our generative model, the stimulus condition is determined completely by set size

*N*and change magnitude Δ, and the values of

*θ*

_{1}and

*θ*

_{2}are irrelevant. Thus, we are interested in the probability that the decision rule (Equation 3 for VP, Equation 4 for EP, Equation 5 for EPF, and Equation 6 for VPF) returns the correct location when the memories

*x*

_{1}and

*x*

_{2}follow their model-specific distributions given

*N*and Δ. Without loss of generality, we compute the proportion correct by taking

*θ*

_{1}=

*θ*

_{2}= 0 and

*L*= 1, so that

*φ*

_{1}= 0 and

*φ*

_{2}= Δ.

*x*

_{1}and

*x*

_{2}are different: where

*κ*is related to

_{i}*J*through Equation 2.

_{i}*N*, Δ) combination, we drew 10,000 random samples of

*x*

_{1}and

*x*

_{2}(and in the case of the VP and VPF models, of

*J*

_{1}and

*J*

_{2}first). For each sample, we evaluated the decision rule and then computed the proportion of correct responses across all samples.

*N*, Δ) combination for one parameter combination.

**t**, the log likelihood of

**t**(the parameter log likelihood) is where the product is over trials (from 1 to

*n*

_{trials}) and correctness

*is 1 if the subject was correct on the*

_{i}*i*th trial and 0 if not. We can rewrite this as where we grouped trials by set size

*N*, change magnitude Δ, and whether the observer was correct or incorrect, and

*n*(

*N*, Δ, correct) is the number of trials with a particular

*N*, Δ, and correctness.

**t**) by LL

_{max}.

*log marginal likelihood*of a model

*m*given the data: LML(model) = log

*p*(data|model). The attribute “marginal” refers to an integration (marginalization) over the parameters:

*p*(

**t**|model), we chose a product of uniform distributions (one for each parameter), with their domains just covering the grid used for model predictions and parameter estimation (see before). We denote the size of the range of the

*j*th parameter by

*R*, where

_{j}*j*= 1, …,

*k*. We also peak-normalize the exponential term so as to avoid highly negative numbers in the exponent, which could cause numerical underflow; we add a correction to compensate for this. This gives

*R*are specified in Table A1. Our choices for these ranges were initially guided by parameter estimates from previous publications (Keshvari et al., 2013; van den Berg et al., 2012). These ranges worked well for our human data. In the monkey data, however, we noticed that the parameter estimates of

_{j}*J̄*

_{N}_{=1}and

*τ*tended to be much smaller that the upper limits of these ranges. Since the computational time required for numerically evaluating the parameter likelihood in the Riemann sum is determined by the number of grid points, we reduced the ranges for those parameters so that—keeping the number of grid values within each range constant—we could obtain a finer resolution for our parameter estimates. This more efficient use of computational resources comes at the cost of no longer being able to interpret the uniform distribution

*p*(

**t**|model) as a prior, because it is now (albeit weakly) informed by the data. We will comment on the consequences of this choice in the Results.

*J̄*

_{N=1}and

*τ*in the VP and VPF models. Those parameters were sometimes both overestimated or both underestimated, indicating that the data were approximately equally well fitted by a lower

*J̄*

_{N=1}and a lower

*τ*as by a higher

*J̄*

_{N=1}and a higher

*τ*. We conclude that we can trust the model comparison results but that the estimates of

*J̄*

_{N=1}and

*τ*should be taken with a grain of salt.

_{max}, with a correction for the number

*k*of free parameters in the model. These metrics are the corrected Akaike information criterion (Akaike, 1974; Hurvich & Tsai, 1989) and the Bayesian information criterion (Schwarz, 1978). In order to make these metrics comparable in magnitude to the marginal log likelihood LML(model), we report each of them multiplied by −0.5, so that the leading term is LL

_{max}: AICc* = −0.5AICc and BIC* = −0.5BIC.

*R*

^{2}, and computed AICc*, BIC*, and LML. The means each of these was computed by averaging across all bootstrapped data sets from the same monkey, and the standard deviations served as estimates of the standard errors of the means.

*F*(3, 27) = 64.05,

*p*< 0.001; change magnitude:

*F*(8, 72) = 80.36,

*p*< 0.001.

**Figure 3**

**Figure 3**

**Figure 4**

**Figure 4**

**Table 1**

*R*

_{j}in Equation 9, which for monkey subjects were (weakly) informed by the data. Fortunately, our qualitative results are reasonably robust to these choices. If we assume that the parameter log likelihood is zero outside the narrower [0, 30] ranges of the and

*τ*parameters in the VP model chosen for monkey subjects, then the effect of changing these ranges to the wider [0, 100] ranges we used in humans is to reduce the VP log marginal likelihoods by −2log(30) − (−2log(100)) = 2.4, which would not change the finding that VP outperforms IL, EP, and EPF. Moreover, the effect of changing the [0, 30] ranges of the and

*τ*parameters in the VPF model for monkeys to the wider [0, 200] ranges we chose in humans is to reduce the VPF log marginal likelihoods by 3.8, which would not change the finding that VP and VPF are indistinguishable. The reason for this robustness against the choice of parameter ranges arises from the fact that our differences in log marginal likelihoods are largely driven by the LL

_{max}term.

_{max}term.

**Figure 5**

**Figure 5**

**Table 2**

*α*in the relationship between mean precision and set size was similar across monkeys and somewhat higher in humans. In both species, however, the values were more negative than −1, indicating steep decreases in mean precision as set size increases. In the VPF model, the number of remembered items

*K*was fitted as 3.5 in monkeys and 4.1 in humans; while consistent with earlier reports of

*K*, this parameter should be interpreted with caution: In light of the finding that the VPF model is indistinguishable from the VP model, we cannot rule out the possibility that there is no item limit at all.

**Table 3**

*IEEE Transactions on Automatic Control*, 19 (6), 716–723.

*The Journal of Neuroscience*, 31 (3), 1128–1138.

*Psychological Science*, 18 (7), 622–628.

*The Journal of Neuroscience*, 18 (18), 7519–7534.

*The Journal of Neuroscience*, 34 (10), 3632–3645.

*Science*, 321 (5890), 851–854.

*Psychological Review*, 120 (1), 85–109.

*Proceedings of the National Academy of Sciences, USA*, 108 (27), 11252–11255.

*Nature Neuroscience*, 11 (6), 693–702.

*The Journal of Neuroscience*, 30 (45), 15241–15253.

*Elements of information theory*. New York: John Wiley & Sons.

*Behavioral and Brain Sciences*, 24 (1), 87–114.

*An introduction to the bootstrap*. New York: Chapman & Hall.

*Current Biology*, 21 (11), 975–979.

*Nature Communications*, 3, 1229–1242.

*Current Opinion in Neurobiology*, 20 (2), 177–182.

*Journal of Neurophysiology*, 61 (2), 331–349.

*Science*, 173 (3997), 652–654.

*Nature Neuroscience*, 14 (7), 926–932.

*Nature Neuroscience*, 17 (6), 858–865.

*Biometrika*, 76 (2), 297–307.

*Journal of the American Statistical Association*, 90 (430), 773–795.

*7*(6), e40216.

*9*(2), NN.

*Nature Neuroscience*, 17 (6), 876–883.

*Nature*, 390 (6657), 279–281.

*Trends in Cognitive Sciences*, 17 (8), 391–400.

*Nature Neuroscience*, 17 (3), 347–356.

*Information theory, inference, and learning algorithms*. Cambridge, UK: Cambridge University Press.

*Directional statistics*. London: John Wiley and Sons.

*The Journal of Neuroscience*, 16 (16), 5154–5167.

*Journal of Experimental Psychology*:

*Human Perception and Performance*, 16 (2), 332–350.

*Perception & Psychophysics*, 44 (4), 369–378.

*The Annals of Statistics*, 6 (2), 461–464.

*Proceedings of the National Academy of Sciences, USA*, 90 (22), 10749–10753.

*Attention and performance*. ( pp. 277–296). Hillsdale, NJ: Erlbaum.

*Psychological Review*, 121 (1), 124–149.

*Attention, Perception, & Psychophysics*, 76 (7), 2117–2135.

*Proceedings of the National Academy of Sciences, USA*, 109 (22), 8780–8785.

*Cerebral Cortex, 17*(suppl 1), i41–i50.

*Comparative cognition: Experimental explorations of animal intelligence*. Oxford, UK: Oxford University Press.

*Nature*, 453 (7192), 233–235.

*L*of the change (1 or 2), the magnitude Δ of the change, the relevant sample orientations

*θ*

_{1}and

*θ*

_{2}(all other sample items are irrelevant to the decision), their noisy memories

*x*

_{1}and

*x*

_{2}, and the two test orientations

*φ*

_{1}and

*φ*

_{2}. Each variable has an associated probability distribution.

**Figure A1**

**Figure A1**

- Since both test locations are equally likely to contain the change, we have
*p*(*L*) = 0.5. - The change magnitude Δ and each of the sample orientations have discrete distributions, but we approximate them by uniform distributions and . We chose continuous uniform distributions rather than discrete distributions at the 18 presented orientations (or change magnitudes) because we think that it is unlikely that an observer learns those exact orientations (or change magnitudes); the choice of continuous uniform distributions also allows for a closed form for the decision rule.
- We assume that the noisy memories
*x*_{1}and*x*_{2}are conditionally independent given the sample orientations*θ*_{1}and*θ*_{2}. Formally,*p*(*x*_{1},*x*_{2}|*θ*_{1},*θ*_{2}) =*p*(*x*_{1}|*θ*_{1})*p*(*x*_{2}|*θ*_{2}). - When the change happens in the first location (
*L*= 1), then*φ*_{1}=*θ*_{1}+ Δ and*φ*_{2}=*θ*_{2}. When the change happens in the second location (*L*= 2), then*φ*_{1}=*θ*_{1}and*φ*_{2}=*θ*_{2}+Δ. We can formally denote this by (*φ*_{1},*φ*_{2}) = (*θ*_{1},*θ*_{2}) + Δ**1**, where_{L}**1**is equal to (1, 0) when_{L}*L*= 1 and (0, 1) when*L*= 2.

*L*based on the noisy memories

*x*

_{1}and

*x*

_{2}and the test orientations

*φ*

_{1}and

*φ*

_{2}; we also assume that the observer knows the values of

*κ*

_{1}and

*κ*

_{2}. An ideal observer infers

*L*by computing the posterior distribution over

*L*,

*p*(

*L*|

*x*

_{1},

*x*

_{2},

*φ*

_{1},

*φ*

_{2}). Since

*L*is binary, all information about the posterior is contained in the log posterior ratio, which can be rewritten using Bayes's rule: since

*p*(

*L*= 1) =

*p*(

*L*= 2). We evaluate the likelihood of

*L*= 1 (the probability of the memories

*x*

_{1}and

*x*

_{2}if the change happened at the first location):

**Table A1**

**Figure A2**

**Figure A2**

**Figure A3**

**Figure A3**