Abstract
Rhesus monkeys are widely used as an animal model for human memory, including visual working memory (VWM). It is, however, unknown whether the same principles govern VWM in humans and rhesus monkeys. Here, we tested both species in nearly identical changelocalization paradigms and formally compared the same set of models of VWM limitations. These models include the classic itemlimit model and recent noisebased (resource) models, as well as hybrid models that combine a noisebased representation with an item limit. By varying the magnitude of the change in addition to the typical set size manipulation, we were able to show large differences in goodness of fit among the five models tested. In spite of quantitative performance differences between the species, we find that the variableprecision model—a noisebased model—best describes the behavior of both species. Adding an item limit to this model does not help to account for the data. Our results suggest evolutionary continuity of VWM across primates and help establish the rhesus monkey as a model system for studying the neural substrates of multipleitem VWM.
Introduction
Understanding cognition requires understanding its limitations. While cognitive limitations have been extensively characterized in humans, a complete understanding of their neural basis requires invasive studies in nonhuman animals. It cannot, however, be blindly assumed that findings from such studies will transfer to humans. This important concern can be partially preempted by demonstrating that the cognitive behavior of humans and of nonhuman animals are best described by the same models. In other words, modeldefined similarity of human and nonhuman behavior might help justify claims that invasive studies in nonhuman animals can teach us about human cognition.
Here, we pursue this goal as it pertains to visual working memory (VWM; Luck & Vogel,
2013; Ma, Husain, & Bays,
2014). VWM is limited in two aspects: time—how long memories are maintained—and content—what is remembered and how well. Monkey studies of VWM have traditionally focused on the process of maintaining a single memory item over time (Funahashi, Bruce, & GoldmanRakic,
1989; Fuster & Alexander,
1971; Miller, Erickson, & Desimone,
1996). More recently, they have started to address issues related to VWM content, in particular the effect of the number of items in a display (Buschman, Siegel, Roy, & Miller,
2011; Elmore et al.,
2011; Heyselaar, Johnston, & Paré,
2011; Lara & Wallis,
2012,
2014; Warden & Miller,
2007). However, these studies did not focus on formally comparing mathematical models of VWM limitations. Here, we quantitatively compare multiple VWM models in both humans and monkeys based on a nearly identical experimental paradigm.
VWM content limitations have traditionally been described using itemlimit models (Awh, Barton, & Vogel,
2007; Cowan,
2001; Fukuda, Awh, & Vogel,
2010; Luck & Vogel,
1997; Pashler,
1988). According to these models, only a fixed number of items (the capacity) are held in memory with high quality, and no information is retained about any other items. Recently, an alternative category of models, based on human behavioral studies, has risen to prominence. In these “noisebased” or “resource” models, all items are remembered, but memories are noisy and memory precision is inversely related to the number of items (Bays & Husain,
2008; Keshvari, van den Berg, & Ma,
2013; van den Berg, Shin, Chou, George, & Ma,
2012; Wilken & Ma,
2004).
We tested fixedcapacity models against noisebased models in parallel in monkeys and in humans. We used a change localization paradigm that is similar to paradigms that have been successfully used to compare VWM models in humans (Keshvari et al.,
2012,
2013; van den Berg et al.,
2012). In our experimental design (
Figure 1A), the subject viewed a sample array consisting of oriented bars, of which only the orientation was relevant. This display was followed by a 1s blank screen, then by a test array consisting of two oriented bars selected from the sample array but with one having changed its orientation. The subject touched the location of the changed bar.
In addition to varying the number of sample items (set size), we varied the magnitude of the change in orientation randomly from trial to trial. While fixedcapacity and noisebased models can equally well account for observers' performance as a function of only set size, we have previously shown that the parametric variation of the magnitude of change allows one to effectively distinguish fixed capacity from noisebased models (Keshvari et al.,
2013; van den Berg et al.,
2012).
For each individual monkey and human subject, we compared four leading models of VWM limitations (
Figure 1B). According to the itemlimit (IL) model, a fixed number of items (the capacity) are kept in memory with infinite precision, while remaining items are absent from memory (Cowan,
2001; Luck & Vogel,
1997; Pashler,
1988). The equalprecision (EP) model postulates that all items are remembered with equal precision and that precision per item decreases with increasing set size (Palmer,
1990; Shaw,
1980). Decreasing precision is associated with increasing noise; that is, at a larger set size, each item is remembered in a noisier fashion. The equalprecisionwithfixedcapacity (EPF) model is a hybrid model that combines elements of the IL and EP models: Only a fixed number of items can be remembered, but a fixed precision budget is distributed across the remembered items (Zhang & Luck,
2008). For set sizes smaller than or equal to the capacity, this model predicts that precision will decrease with increasing set size. The variableprecision (VP) model is similar to the EP model in that all items are remembered with finite precision, but precision varies from item to item and trial to trial (Fougnie, Suchow, & Alvarez,
2012; Keshvari et al.,
2013; van den Berg et al.,
2012).
We also tested a recently proposed hybrid model—variable precision with fixed capacity (VPF)—that combines elements of the IL and VP models: Only a fixed number of items can be remembered, but precision varies randomly across items and trials (van den Berg & Ma,
2014). The four finiteprecision models (EP, EPF, VP, and VPF) attribute all (EP and VP) or some (EPF and VPF) change localization errors to the difficulty of separating the signal from memory noise. For these four models, we used Bayesian inference to model the decision stage; on each trial, the observer reports the location that has the highest probability of containing the changed item (see
Theory). The IL, EP, EPF, VP, and VPF models have two, two, three, three, and four free parameters, respectively.
Methods: Experiments
Monkeys
Subjects
Three adult male rhesus monkeys (Macaca mulatta; weights: M1 = 16.5 kg, M2 = 14.5 kg, M3 = 13.5 kg; ages: M1 = 17.5 years, M2 = 16.5 years, M3 = 12.5 years) were tested in a change localization experiment for five days a week. Food and water were regulated prior to experimental sessions. After completing daily testing, animals were returned to their caging room, where they were housed individually and received primate chow and water to maintain their normal body weight. All animal procedures were performed in accordance with the National Institutes of Health guidelines, approved by the institutional review board of the University of Texas Health Science Center at Houston, and supervised by the Institutional Animal Care and Use Committee.
Apparatus
During experimental sessions, the monkeys were placed unrestrained in a custommade aluminum experimental chamber (47.5 cm wide × 53.1 cm deep × 66.3 cm high). An infrared touch screen detected touch responses to a 17in. computer monitor. The touch responses were guided using a Plexiglas template with six cutouts (each a circle with a diameter of 2.5 cm) that were arranged on an imaginary circle with a diameter of 9.0 cm, matching the six possible locations of the stimuli, and a cutout in the center for touches to a fixation point. Using a computercontrolled relay interface (Model P1012; Metrabyte, Taunton, MA), correct responses were rewarded with either a banana pellet or Tang orange drink (M1) or a banana pellet or cherry KoolAid (M2 and M3). The relay interface controlled the illumination of the chamber using a 25W green light bulb located outside of the chamber. The offset of the green light illuminating the chamber through a small gap between the touch screen and the monitor marked the start of the next trial. Throughout testing, the monkeys were monitored with a video camera outside the chamber and focused through a small glasscovered port on the right side of the chamber. Experimental sessions were designed, operated, and recorded using a custom program written in Microsoft Visual Basic 6.0.
Stimuli
Stimuli consisted of 1.8 cm × 0.4 cm gray bars with luminosity of 190 cd/m^{2} displayed on a black background. Based on the average distance of the monkey from the screen (approximately 35 cm), the stimuli subtended a visual angle of approximately 2.9° × 0.65°. Stimuli were presented in six possible locations on the screen, arranged on an imaginary circle of radius 7.4° (see Apparatus).
Trial procedure
Each trial began with a red fixation point in the center of the screen. The monkey had to make a onetouch response to the fixation point, which initiated the presentation of a sample display. This display contained two or more items (see later), and had a duration that differed between monkeys and between training and testing (see later). After a delay of 1000 ms, the test display was presented, which always consisted of two items placed at the same locations as two items from the sample display. One test item had the same orientation as the corresponding item in the sample display, and the other test item had a different orientation. The monkey's task was to identify which item had changed and to touch that item. The test display remained on the screen until response. Correct responses were rewarded. An intertrial interval of 3 s followed the response, during which a green light illuminated the chamber and the screen was dark.
Training
Two of the monkeys that participated in this study (M2 and M3) had been previously trained in a changelocalization task using clip art images and colored squares (Elmore et al.,
2011). For these two monkeys, we intermixed trials of oriented bars (new stimuli) with trials of colored squares for initial task acquisition. Once the monkeys' performance on these orientation trials was similar to their baseline colortrial performance, we began training them with only orientation trials. Since M1 had not been previously trained on this task, we directly trained him with oriented bars. All three monkeys were first trained at set sizes 2 and 3, change magnitudes of 22.5°, 45°, 67.5°, and 90°, and a sample viewing time of 1000 ms. Once overall accuracy reached approximately 70%, set sizes 4 and 5 and finer change magnitudes (10° to 90° in 10° increments) were gradually introduced. Finally, we gradually reduced sampleviewing times while maintaining approximately 70% accuracy on trials with set size 2. For M1 and M3, this led to a viewing time of 300 ms, and for M2 to a viewing time of 600 ms. Total training lasted approximately 8 months.
Testing
The sample display was shown for 300 ms for M1 and M3, and 600 ms for M2. Set size was 2, 3, 4, or 5. Set sizes were pseudorandomized within each 192trial block (48 trials per set size). The orientations of the sample items were drawn independently from a discrete uniform distribution over 18 possible orientations (−90° to 80° in increments of 10°). The orientation of the changed item in the test display was drawn from the same distribution, except that the orientations of the other sample stimuli were excluded. (This exception is unnecessary and potentially problematic because it slightly changed the statistics of the task. However, since at most four out of 18 orientations were excluded, and observers probably did not notice it, we expect the impact to be small and we did not model it.) Testing consisted of 60 sessions, with 192trial blocks per session, for a total of 11,520 trials per monkey.
Humans
Subjects
Ten human subjects (eight women, two men) aged 21–33 years (mean age = 27.1 years) participated. Each subject visited the lab for two 1.5hr sessions and was compensated $10 per session. Study procedures were approved by the institutional review board of the University of Texas Health Science Center at Houston.
Apparatus and stimuli
Subjects were seated in a chair in a small room equipped with a computer. At the beginning of the experiment, the distance between the chair and the screen was adjusted so that the stimuli and display would subtend approximately the same visual angles as for the monkeys. Subjects were asked to maintain approximately the same distance. The monitor and touch screen were identical to those used for monkeys. Two 25W light bulbs were mounted on the wall behind the subjects to provide feedback. Stimuli were identical to those used for monkeys.
Trial procedure
The trial procedure was identical to that for the monkeys, except for the feedback. Feedback consisted of a green light that was illuminated for 1 s and accompanied by a tone for correct responses or a red light illuminated for 1 s for incorrect responses.
Training and testing
Each subject completed two testing sessions, each consisting of three 192trial blocks, for a total of 1,152 trials per subject. Subjects were given a 10min break in between blocks. Each subject completed eight practice trials at the beginning of the first session.
Theory
We compared five models of behavior in this task. In the IL model, noise does not play a role. In the other four models, noise does play a role, which will require a model for how subjects integrate information from noisy measurements.
For simplicity, we mapped orientation space to the interval [0, 2π) by multiplying all orientations and orientationchange magnitudes by 2 before analysis. All equations in this article are consistent with this convention, but orientations and orientation changes in the figures are back in actual orientation space.
IL model
In the IL model (Cowan,
2001; Luck & Vogel,
1997; Pashler,
1988), observers cannot store more than
K items. When
N ≤
K, all items are stored. The probability of being correct is then 1 −
ε, where
ε accounts for lapses of attention and unintended responses. When
N >
K,
K randomly selected items from the sample display are stored. When the test display appears, there are three scenarios to consider:

Both test items correspond to stored sample items. This happens with probability
. The probability of being correct is then 1 −
ε.

One test item corresponds to a stored sample item and the other does not. This happens with probability
. The probability of being correct is then 1 −
ε.

Neither test item corresponds to a stored sample item. This happens with probability
. The observer then has to guess about which item changed, and the probability of being correct is 0.5.
The overall proportion correct is then
Storing all N items (K = N) yields the same proportion correct, namely 1 − ε, as storing only N − 1 items, since even if one test item is not stored, the trial can be answered correctly by using the other test item. As can be seen from the equation, in the IL model the proportion correct depends on set size but not on change magnitude.
Noisebased models
We now turn to models in which VWM is noisy (Ma et al.,
2014; Wilken & Ma,
2004). We assume that both orientations in the test display, which we denote
φ_{1} and
φ_{2}, are known noiselessly to the observer, because they remain on the screen until the subject responds. We model the memories of the orientations in the sample display as noisy. Noise can stem from encoding (presentation time was limited) or maintenance of memories; we do not distinguish between these sources. We model the noisy memory of the
ith item in the sample display, denoted
x_{i} (
i = 1, …,
N), as following a von Mises distribution (a circular analog of a Gaussian distribution, used because orientation space is periodic) centered at the true stimulus
θ_{i} with concentration parameter
κ_{i}:
where
I_{0} is the modified Bessel function of the first kind of order 0 (Mardia & Jupp,
1999). The concentration parameter controls the width of the noise distribution, and the Bessel function serves as a normalization. We have postulated previously that the role of precision is played by the Fisher information in this memory representation, denoted
J_{i} (Keshvari et al.,
2013; van den Berg et al.,
2012). Fisher information determines the best possible performance of any unbiased estimator through the Cramér–Rao bound (Cover & Thomas,
1991). When the measurement
x follows a Gaussian distribution, Fisher information is equal to inverse variance,
. When neural variability is Poissonlike, Fisher information is proportional to the gain of a population (Seung & Sompolinsky,
1993). Thus, our choice of using Fisher information for precision is consistent with an interpretation of neural activity as “memory resource” (Bays,
2014; Ma et al.,
2014; van den Berg et al.,
2012). For
Equation 1, Fisher information is related to the concentration parameter through
where
I_{1} is the modified Bessel function of the first kind of order 1. The relationship between precision and the concentration parameter is nearly the identity mapping, and none of our results would qualitatively change if we were to replace
J_{i} with
κ_{i}.
In the EP model (Bays & Husain,
2008; Palmer,
1990), the precision of each item is inversely related to set size through a power law:
where
J_{N}_{=1} is the precision with which a single item is stored. The precision of all items in a display is equal.
In the EPF model (also known as slotsplusresources; Zhang & Luck,
2008), no more than
K items can be stored. Thus the number of stored items is min(
N,
K). The precision of a stored item is inversely related to the number of stored items through a power law:
The precision associated with a nonstored item is zero. When
N ≤
K, the EPF model is equal to the EP model. The slotsplusaveraging model is very similar to this model (as was quantitatively shown by van den Berg, Awh, & Ma,
2014).
In the VP model (Fougnie et al.,
2012; Keshvari et al.,
2013; van den Berg et al.,
2012), precision exhibits fluctuations across both space and time. To be concrete, we assume that the precision values associated with the
N items are drawn independently from a gamma distribution with mean
J̄ and scale parameter
τ (a flexible family of distributions on the positive real line). We further assume that its mean is inversely related to set size through a power law:
where
J̄_{N}_{=1} is the mean precision of a single item (
Figure 2).
The VPF model (van den Berg & Ma,
2014) is equal to the VP model, but the number of stored items is min(
N,
K); thus, no more than
K items can be stored. The precision of a stored item is again drawn from a gamma distribution with mean
J̄ and scale parameter
τ, and the mean is inversely related to the number of stored items through a power law:
The precision associated with a nonstored item is zero.
The IL, EP, EPF, VP, and VPF models have two, two, three, three, and four free parameters, respectively.
Decision rules
So far, we have described the encoding stage: how stimuli give rise to noisy memories. What is also needed in each of the noisebased models is a description of how the observer makes the twoalternative changelocalization decision based on the noisy memories and the test display. We use an ideal (Bayesian) observer to describe this process. The resulting decision rule is similar to the idealobserver models of related
Nalternative change localization and changedetection tasks (Keshvari et al.,
2012,
2013; van den Berg et al.,
2012), but differs in the details.
We begin by describing the decision process for the EP and VP models. The relevant variables are the location
L of the change (1 or 2), the magnitude Δ of the change, the relevant sample orientations
θ_{1} and
θ_{2} (all other sample items are irrelevant to the decision), their noisy memories
x_{1} and
x_{2}, and the two test orientations
φ_{1} and
φ_{2}. The ideal observer responds that the change occurred at location 1 when the log posterior ratio is positive:
The derivation of this decision rule can be found in
Appendix A; we have assumed that the observer knows the values of
κ_{1} and
κ_{2} on each trial. The decision rule is valid for both the VP and EP models. In the VP model, precision per item is a random variable, and therefore
κ_{1} and
κ_{2} will generally not be equal to each other. However, in the case of the EP model, we have
κ_{1} =
κ_{2} and the inequality simplifies to
This rule is intuitive: The observer reports that the change occurred at location 1 when the angular distance between the noisy memory at location 2 and the test orientation at location 2 is smaller than the corresponding distance at location 1 (and thus the cosine is larger). There is then more evidence that the change occurred at location 1. One can think of
Equation 3 as a precisionweighted version of
Equation 4.
The EPF model is very similar to the EP model, but with one difference when
N >
K. Then, a noisy measurement has a probability of not being stored. This is equivalent to setting the concentration parameter of the corresponding memory to 0. Thus, we can immediately obtain the decision rule from the EPF model by taking special cases of
Equation 3:
The second and third inequalities may seem counterintuitive, since they only involve one memory. However, they make sense: Even when the observer has only the memory corresponding to one of the test items, the discrepancy between the memory and the test is still informative about whether or not the change occurred in that one item.
The VPF model is identical to the VP model when
N ≤
K. When
N >
K, just as in the EPF model, a noisy measurement has a probability of not being stored (precision = 0). But unlike in the EPF model, the concentration parameters
κ_{1} and
κ_{2} in the VPF model are independent. With these modifications, we can again take the special cases of
Equation 3 and obtain the decision rules for the VPF model:
Model predictions
If we had access to the observer's noisy memories
x_{1} and
x_{2} on each trial, the models would predict the observer's response exactly. Since we do not know
x_{1} and
x_{2}, the best we can do is to compute the
probability of being correct for a given stimulus condition. Under the assumptions in our generative model, the stimulus condition is determined completely by set size
N and change magnitude Δ, and the values of
θ_{1} and
θ_{2} are irrelevant. Thus, we are interested in the probability that the decision rule (
Equation 3 for VP,
Equation 4 for EP,
Equation 5 for EPF, and
Equation 6 for VPF) returns the correct location when the memories
x_{1} and
x_{2} follow their modelspecific distributions given
N and Δ. Without loss of generality, we compute the proportion correct by taking
θ_{1} =
θ_{2} = 0 and
L = 1, so that
φ_{1} = 0 and
φ_{2} = Δ.
For the EP model, then,
where VM(
μ,
κ) denotes the von Mises distribution with mean
μ and concentration parameter
κ, and we use the notation Pr(statement involving
X;
X ∼ distribution) to indicate the probability that the statement is true when
X follows the given distribution.
For the VP model, both the decision rule and the distributions of
x_{1} and
x_{2} are different:
where
κ_{i} is related to
J_{i} through
Equation 2.
For the EPF model, the proportion correct is computed as a sum across the four possibilities for which items were stored (see
Equation 5):
For the VPF model, the proportion correct is computed as a sum across the four possibilities for which items were stored (see
Equation 6):
Each of these proportions correct was determined through Monte Carlo simulation. For each (N, Δ) combination, we drew 10,000 random samples of x_{1} and x_{2} (and in the case of the VP and VPF models, of J_{1} and J_{2} first). For each sample, we evaluated the decision rule and then computed the proportion of correct responses across all samples.
Finally, for each model, we discretized parameter space (
Table A1) and calculated a lookup table in which each entry gave the predicted probability of a correct response at one (
N, Δ) combination for one parameter combination.
Methods: Model fitting and model comparison
Model fitting
Denoting all parameters of a model by a vector
t, the log likelihood of
t (the parameter log likelihood) is
where the product is over trials (from 1 to
n_{trials}) and correctness
_{i} is 1 if the subject was correct on the
ith trial and 0 if not. We can rewrite this as
where we grouped trials by set size
N, change magnitude Δ, and whether the observer was correct or incorrect, and
n(
N, Δ, correct) is the number of trials with a particular
N, Δ, and correctness.
For each subject data set, we used
Equation 8 and the precomputed lookup table of model predictions mentioned before to find the log likelihood of each parameter combination. The parameter combination on this grid that maximized the log likelihood gave the estimates of the parameters. The model predictions corresponding to that parameter combination were then used to compute the model fits to the psychometric curves. We denote the maximum of the parameter log likelihood LL(
t) by LL
_{max}.
Bayesian model comparison
To compare models, we used Bayesian model comparison (MacKay,
2003), which should not be confused with the Bayesian observer model that we used earlier. Bayesian model comparison is based on the
log marginal likelihood of a model
m given the data: LML(model) = log
p(datamodel). The attribute “marginal” refers to an integration (marginalization) over the parameters:
For the parameter prior
p(
tmodel), we chose a product of uniform distributions (one for each parameter), with their domains just covering the grid used for model predictions and parameter estimation (see before). We denote the size of the range of the
jth parameter by
R_{j}, where
j = 1, …,
k. We also peaknormalize the exponential term so as to avoid highly negative numbers in the exponent, which could cause numerical underflow; we add a correction to compensate for this. This gives
Finally, we approximate the integral through a Riemann sum (grid sum) over the same grid as used for model predictions and parameter estimation (see earlier). We denote the grid spacing of the
jth parameter by
δt_{j}. This leads to the equation we actually implemented:
The difference of the log marginal likelihood between two models is also called the log Bayes factor of those two models (Kass & Raftery,
1995).
Numerical values of the ranges
R_{j} are specified in
Table A1. Our choices for these ranges were initially guided by parameter estimates from previous publications (Keshvari et al.,
2013; van den Berg et al.,
2012). These ranges worked well for our human data. In the monkey data, however, we noticed that the parameter estimates of
J̄_{N}_{=1} and
τ tended to be much smaller that the upper limits of these ranges. Since the computational time required for numerically evaluating the parameter likelihood in the Riemann sum is determined by the number of grid points, we reduced the ranges for those parameters so that—keeping the number of grid values within each range constant—we could obtain a finer resolution for our parameter estimates. This more efficient use of computational resources comes at the cost of no longer being able to interpret the uniform distribution
p(
tmodel) as a prior, because it is now (albeit weakly) informed by the data. We will comment on the consequences of this choice in the
Results.
Parameter recovery and model recovery
To validate our methods, we applied them to data sets for which we knew the ground truth, namely synthetic data sets generated using one of the models. For each of the five models, we generated 10 synthetic data sets by independently drawing the parameter values from uniform distributions on the ranges specified in the “Humans” column of
Table A1. For each of these 50 data sets, we fitted all five models and computed their LMLs. We found that in each of the 50 data sets and for each of the three metrics, the model that was used to generate the data had the highest LML. In addition, for the correct model the parameter estimates were close to the parameters that were used to generate the synthetic data, with the exception of
J̄_{N=1} and
τ in the VP and VPF models. Those parameters were sometimes both overestimated or both underestimated, indicating that the data were approximately equally well fitted by a lower
J̄_{N=1} and a lower
τ as by a higher
J̄_{N=1} and a higher
τ. We conclude that we can trust the model comparison results but that the estimates of
J̄_{N=1} and
τ should be taken with a grain of salt.
Other model comparison metrics
We also used two other model comparison metrics, which are based not on marginalizing over the parameters but solely on the maximum of the parameter log likelihood LL
_{max}, with a correction for the number
k of free parameters in the model. These metrics are the corrected Akaike information criterion
(Akaike,
1974; Hurvich & Tsai,
1989) and the Bayesian information criterion
(Schwarz,
1978). In order to make these metrics comparable in magnitude to the marginal log likelihood LML(model), we report each of them multiplied by −0.5, so that the leading term is LL
_{max}: AICc* = −0.5AICc and BIC* = −0.5BIC.
Bootstrapping
Since we had only three monkey subjects, we used bootstrapping (Efron,
1993) for each monkey separately to estimate the standard errors on all summary statistics. The original data set for each monkey consisted of 11,520 rows (each row represented a trial) and three columns (set size, change magnitude of the changed item, and whether the trial was correct or incorrect). We sampled the rows (trials) with replacement from the original data set to create 11,520trial bootstrapped data sets. We repeated this process to create 100 bootstrapped data sets for each monkey. For each bootstrapped data set, we estimated the parameters, computed psychometric curves, calculated
R^{2}, and computed AICc*, BIC*, and LML. The means each of these was computed by averaging across all bootstrapped data sets from the same monkey, and the standard deviations served as estimates of the standard errors of the means.
Results
Data
For both species, the proportion correct decreased monotonically as a function of set size, with humans being substantially more accurate than monkeys (
Figure 3A). A more detailed representation of the data is provided by the proportion correct as a function of change magnitude for each of the four set sizes (
Figure 3B,
C). We found large effects of both set size and change magnitude on VWM performance in both species (humans: twoway repeatedmeasures ANOVA)—set size:
F(3, 27) = 64.05,
p < 0.001; change magnitude:
F(8, 72) = 80.36,
p < 0.001.
Model fitting
We used maximumlikelihood estimation to fit the parameters in each model. For humans, we fitted the data of individual subjects. For each monkey, we fitted the individual data sets that we sampled using bootstrapping from the monkey's raw data (this gives error bars on parameter estimates). Parameter estimates are given in
Appendix B (
Table A1). Model fits to the monkeys' actual data (without bootstrapping) are given in
Appendix C.
Model comparison
In spite of the large performance differences between species, it is possible that the underlying VWM mechanisms are the same. To test this possibility, we compared the four leading models of VWM limitations as well as a new hybrid model (VPF) for each individual monkey and human. We first used Bayesian model comparison, a likelihoodbased method that automatically corrects for the number of free parameters (see
Methods: Model fitting and model comparison). We found that the mean log marginal likelihoods of the VP and VPF models exceed those of the EPF, EP, and IL models for both species (
Figure 4;
Table 1); the VP and VPF models are not distinguishable. Moreover, the results are highly consistent across individual monkey and human subjects. Model comparison results on the monkeys' actual data (without bootstrapping) are given in
Appendix C.
Table 1 Model comparison. Notes: For model comparison metrics, we use scaled versions of the AICc and BIC defined by AICc* = −0.5AICc and so on, so that the leading term is the maximum log likelihood LL_{max} and these measures can be compared directly to the log marginal likelihood (LML). Values shown are the mean differences in the model comparison metrics between the IL, EP, EPF, and VPF models on the one hand and the VP model on the other hand. A negative value means that the VP model fits better. The standard error of the mean is the same between the AICc* and BIC* because these measures differ only in their penalty terms.
Table 1 Model comparison. Notes: For model comparison metrics, we use scaled versions of the AICc and BIC defined by AICc* = −0.5AICc and so on, so that the leading term is the maximum log likelihood LL_{max} and these measures can be compared directly to the log marginal likelihood (LML). Values shown are the mean differences in the model comparison metrics between the IL, EP, EPF, and VPF models on the one hand and the VP model on the other hand. A negative value means that the VP model fits better. The standard error of the mean is the same between the AICc* and BIC* because these measures differ only in their penalty terms.
Under
Methods: Model fitting and model comparison, we commented on the choices of the parameter ranges
R_{j} in
Equation 9, which for monkey subjects were (weakly) informed by the data. Fortunately, our qualitative results are reasonably robust to these choices. If we assume that the parameter log likelihood is zero outside the narrower [0, 30] ranges of the
and
τ parameters in the VP model chosen for monkey subjects, then the effect of changing these ranges to the wider [0, 100] ranges we used in humans is to reduce the VP log marginal likelihoods by −2log(30) − (−2log(100)) = 2.4, which would not change the finding that VP outperforms IL, EP, and EPF. Moreover, the effect of changing the [0, 30] ranges of the
and
τ parameters in the VPF model for monkeys to the wider [0, 200] ranges we chose in humans is to reduce the VPF log marginal likelihoods by 3.8, which would not change the finding that VP and VPF are indistinguishable. The reason for this robustness against the choice of parameter ranges arises from the fact that our differences in log marginal likelihoods are largely driven by the LL
_{max} term.
Our results for both monkeys and humans also remain unchanged when we use AICc or BIC as an alternative model comparison metrics (
Table 1). Unlike the log marginal likelihood, these model comparison metrics do not depend on parameter ranges. Again, this consistency follows from the model differences being dominated by differences in the LL
_{max} term.
Model checking
We substituted the fitted parameters into their respective models to create fits (predictions) for the summary statistics in
Figure 3. The model fits to the psychometric curve of performance as a function of set size were good for all models (
Figure 5A). The added manipulation of change magnitude, however, clearly separates these model fits (
Figure 5B and
Figure 5C). The psychometric curves from both species are best described by the two variableprecision models, VP and VPF, followed by the EPF model, the EP model, and the IL model (
Table 2).
Table 2 R^{2} values of the fits of the five models to the full psychometric curves (proportion correct as a function of set size and change magnitude) of both species.
Notes: We note that
R^{2} is a much less principled measure of goodness of fit than AICc, BIC, or LML. If any conflicts were to exist, the latter three should be preferred. However, results are consistent across measures. See
Figure 5B and
C.
Table 2 R^{2} values of the fits of the five models to the full psychometric curves (proportion correct as a function of set size and change magnitude) of both species.
Notes: We note that
R^{2} is a much less principled measure of goodness of fit than AICc, BIC, or LML. If any conflicts were to exist, the latter three should be preferred. However, results are consistent across measures. See
Figure 5B and
C.
Comparison between species
Our model comparison suggests that the fundamental nature of VWM limitations is the same in both species (
Figure 5), with quantitative differences reflected only in the parameter values within the same model (
Table A1). For example, mean precision
was much lower in monkeys than in humans, which might reflect attentional differences between the two species. The exponent
α in the relationship between mean precision and set size was similar across monkeys and somewhat higher in humans. In both species, however, the values were more negative than −1, indicating steep decreases in mean precision as set size increases. In the VPF model, the number of remembered items
K was fitted as 3.5 in monkeys and 4.1 in humans; while consistent with earlier reports of
K, this parameter should be interpreted with caution: In light of the finding that the VPF model is indistinguishable from the VP model, we cannot rule out the possibility that there is no item limit at all.
Models with lapse rate
We have seen that the EP and EPF models do not describe the data as well as the VP model. However, it might be that subjects randomly guess on some fixed proportion of trials. This would be different from guessing due to an item limit, because the proportion of those guesses depends on set size. Therefore, we tested the EP and EPF models augmented with a lapse rate (
Table 3). Both in monkeys and in humans, adding a lapse parameter improves the goodness of fit of the EP and EPF models. However, in both species, the VP model outperforms the EP and EPF models with lapse.
Table 3 The EP and EPF models with a lapse rate fit better than the corresponding models without a lapse rate; however, they still both fit worse than the VP model.
Table 3 The EP and EPF models with a lapse rate fit better than the corresponding models without a lapse rate; however, they still both fit worse than the VP model.
Discussion
We tested monkeys and humans in a nearly identical change localization paradigm and compared five models of VWM limitations. Like all previous change detection and change localization studies, both in humans (Keshvari et al.,
2012,
2013; van den Berg et al.,
2012; Wilken & Ma,
2004) and in monkeys (Buschman et al.,
2011; Elmore et al.,
2011; Heyselaar et al.,
2011; Lara & Wallis,
2012), we found a decrease in performance with set size. Following Keshvari et al. (
2012,
2013), Lara and Wallis (
2012), and van den Berg et al. (
2012), we systematically varied change magnitude to obtain a richer description of behavior, which we exploited to distinguish models that otherwise could not be distinguished.
Although change detection and change localization are classic paradigms in humans, formally comparing models on data from these paradigms is relatively new (van den Berg et al.,
2012; Wilken & Ma,
2004), and no previous study has compared models in parallel across species. We tested the itemlimit (IL) model, in which there is a fixed limit on the number of items that can be remembered and items are stored in an allornone fashion, as well as noisebased (or resource) models, in which items are encoded in VWM in a noisy way. The data from both species were well accounted for by a noisebased model in which memory precision is variable across items and trials (VP), but not by a noisebased model in which memory precision is equal across items and trials (EP), and not by the classic IL model. These findings are consistent with earlier ones in humans (Fougnie et al.,
2012; Keshvari et al.,
2012,
2013; van den Berg et al.,
2012; van den Berg et al.,
2014; van den Berg & Ma,
2014).
We also tested hybrid models that combine the concepts of noisy storage and an item limit. Adding an item limit to the EP model (as has been proposed by Zhang & Luck,
2008, and Anderson, Vogel, & Awh,
2011) helped, but not enough to make it fit as well as the VP model. Adding an item limit to the VP model did improve the fit, but not enough to convincingly exceed the penalty associated with adding an extra parameter to the model. Thus our model comparison neither yields any evidence for the existence of an item limit nor rules it out. This conclusion is consistent with a recent detailed model comparison on multiple data sets obtained using a delayedestimation paradigm (van den Berg et al.,
2014).
The success of the VP model brings to the fore the question of its mechanistic underpinnings. The essential components of the model are noisy storage, a decrease of average precision with increasing set size, and variability of precision across items and trials around this average. At the neural level, noisy storage could take the form of a Poissonlike neural population responding to the stimulus, in which case precision might correspond to either the gain or the total spike count in this population (Ma et al.,
2014; van den Berg et al.,
2012). A decrease of gain with set size has been observed in area LIP (Churchland, Kiani, & Shadlen,
2008) and superior colliculus (Basso & Wurtz,
1998), and might be implemented using divisive normalization (Bays,
2014; Ma & Huang,
2009). A Poissonlike population with gain fixed across items and trials (i.e., at a given set size) might already behave like a VP model (Bays,
2014). In addition, gain itself might be variable (Goris, Movshon, & Simoncelli,
2014), for example due to fluctuations in attention (Cohen & Maunsell,
2010) or to variability in memory decay rates (Fougnie et al.,
2012). Other factors are also expected to contribute to fluctuations in precision, such as eye movements, and stimulusrelated differences such as those due to cardinal orientations (Girshick, Landy, & Simoncelli,
2011) and configural grouping (Brady & Tenenbaum,
2013). Thus, although much more work is needed, the VP model is currently supported by a range of behavioral, physiological, and computational studies.
In the field of comparative cognition, much research has been devoted to comparing absolute performance differences across various species—including pigeons, rats, rhesus monkeys, baboons, and humans—on attention, visual search, spatial navigation, and categorization tasks (Wasserman & Zentall,
2009). However, in order to disentangle whether these performance differences are due to qualitative differences in the underlying mechanisms (differences in models) or simply quantitative in nature (differences in parameters), formal model comparison is needed. Here we have shown that despite interspecies performance differences, the same model fitted the data from both species best. This suggests qualitative similarity and evolutionary continuity of basic VWM mechanisms. This qualitative similarity supports the use of rhesus monkeys as a model system for studying the neural mechanisms of multipleitem VWM.
Acknowledgments
This research was supported by NIH grants MH091038 and MH072616 to AAW and NIH grant R01EY02095801 and ARO grant W911NF1210262 to WJM. We thank John Magnotti for useful discussions and assistance with programming.
Commercial relationships: none.
Corresponding author: Deepna T. Devkar.
Email: deepna.devkar@nyu.edu.
Address: Department of Psychology, New York University, New York, NY, USA.
References
Akaike,
H.
(1974).
A new look at the statistical model identification.
IEEE Transactions on Automatic Control,
19
(6),
716–723.
Anderson
D. E.,
Vogel
E. K.,
Awh
E.
(2011).
Precision in visual working memory reaches a stable plateau when individual item limits are exceeded.
The Journal of Neuroscience,
31
(3),
1128–1138.
Awh
E.,
Barton
B.,
Vogel
E. K.
(2007).
Visual working memory represents a fixed number of items regardless of complexity.
Psychological Science,
18
(7),
622–628.
Basso
M. A.,
Wurtz
R. H.
(1998).
Modulation of neuronal activity in superior colliculus by changes in target probability.
The Journal of Neuroscience,
18
(18),
7519–7534.
Bays
P. M.
(2014).
Noise in neural populations accounts for errors in working memory.
The Journal of Neuroscience,
34
(10),
3632–3645.
Bays
P. M.,
Husain
M.
(2008,
Aug 8). Dynamic shifts of limited working memory resources in human vision.
Science,
321
(5890),
851–854.
Brady
T. F.,
Tenenbaum
J. B.
(2013).
A probabilistic model of visual working memory: Incorporating higher order regularities into working memory capacity estimates.
Psychological Review,
120
(1),
85–109.
Buschman
T. J.,
Siegel
M.,
Roy
J. E.,
Miller
E. K.
(2011).
Neural substrates of cognitive capacity limitations.
Proceedings of the National Academy of Sciences, USA,
108
(27),
11252–11255.
Churchland
A. K.,
Kiani
R.,
Shadlen
M. N.
(2008).
Decisionmaking with multiple alternatives.
Nature Neuroscience,
11
(6),
693–702.
Cohen
M. R.,
Maunsell
J. H.
(2010).
A neuronal population measure of attention predicts behavioral performance on individual trials.
The Journal of Neuroscience,
30
(45),
15241–15253.
Cover
T. M.,
Thomas
J. A.
(1991).
Elements of information theory.
New York:
John Wiley & Sons.
Cowan
N.
(2001).
The magical number 4 in shortterm memory: A reconsideration of mental storage capacity.
Behavioral and Brain Sciences,
24
(1),
87–114.
Efron
B.
(1993).
An introduction to the bootstrap.
New York:
Chapman & Hall.
Elmore
L. C.,
Ma
W. J.,
Magnotti
J. F.,
Leising
K. J.,
Passaro
A. D.,
Katz
J. S.,
Wright
A. A.
(2011).
Visual shortterm memory compared in rhesus monkeys and humans.
Current Biology,
21
(11),
975–979.
Fougnie
D.,
Suchow
J. W.,
Alvarez
G. A.
(2012).
Variability in the quality of visual working memory.
Nature Communications,
3,
1229–1242.
Fukuda
K.,
Awh
E.,
Vogel
E. K.
(2010).
Discrete capacity limits in visual working memory.
Current Opinion in Neurobiology,
20
(2),
177–182.
Funahashi
S.,
Bruce
C. J.,
GoldmanRakic
P. S.
(1989).
Mnemonic coding of visual space in the monkey's dorsolateral prefrontal cortex.
Journal of Neurophysiology,
61
(2),
331–349.
Fuster
J. M.,
Alexander
G. E.
(1971,
Aug 13). Neuron activity related to shortterm memory.
Science,
173
(3997),
652–654.
Girshick
A. R.,
Landy
M. S.,
Simoncelli
E. P.
(2011).
Cardinal rules: Visual orientation perception reflects knowledge of environmental statistics.
Nature Neuroscience,
14
(7),
926–932.
Goris
R. L. T.,
Movshon
J. A.,
Simoncelli
E. P.
(2014).
Partitioning neuronal variability.
Nature Neuroscience,
17
(6),
858–865.
Heyselaar
E.,
Johnston
K.,
Paré
M.
(2011).
A change detection approach to study visual working memory of the macaque monkey.
Journal of Vision,
11
(3):
11,
1–10,
doi:10.1167/11.3.11. [
PubMed] [
Article]
Hurvich
C. M.,
Tsai
C.L.
(1989).
Regression and time series model selection in small samples.
Biometrika,
76
(2),
297–307.
Kass
R. E.,
Raftery
A. E.
(1995).
Bayes factors.
Journal of the American Statistical Association,
90
(430),
773–795.
Keshvari
S.,
van den Berg
R.,
Ma
W. J.
(2012).
Probabilistic computation in human perception under variability in encoding precision.
PLoS ONE,
7(6),
e40216.
Keshvari
S.,
van den Berg
R.,
Ma
W. J.
(2013).
No evidence for an item limit in change detection.
PLoS Computational Biology,
9(2),
NN.
Lara
A. H.,
Wallis
J. D.
(2012).
Capacity and precision in an animal model of visual shortterm memory.
Journal of Vision,
12
(3):
13,
1–12,
doi:10.1167/12.3.13. [
PubMed] [
Article]
Lara
A. H.,
Wallis
J. D.
(2014).
Executive control processes underlying multiitem working memory.
Nature Neuroscience,
17
(6),
876–883.
Luck
S. J.,
Vogel
E. K.
(1997).
The capacity of visual working memory for features and conjunctions.
Nature,
390
(6657),
279–281.
Luck
S. J.,
Vogel
E. K.
(2013).
Visual working memory capacity: From psychophysics and neurobiology to individual differences.
Trends in Cognitive Sciences,
17
(8),
391–400.
Ma
W. J.,
Huang
W.
(2009).
No capacity limit in attentional tracking: Evidence for probabilistic inference under a resource constraint.
Journal of Vision,
9
(11):
3,
1–30,
doi:10.1167/9.11.3. [
PubMed] [
Article]
Ma
W. J.,
Husain
M.,
Bays
P. M.
(2014).
Changing concepts of working memory.
Nature Neuroscience,
17
(3),
347–356.
MacKay
D. J. C.
(2003).
Information theory, inference, and learning algorithms.
Cambridge, UK:
Cambridge University Press.
Mardia
K. V.,
Jupp
P. E.
(1999).
Directional statistics.
London:
John Wiley and Sons.
Miller
E. K.,
Erickson
C. A.,
Desimone
R.
(1996).
Neural mechanisms of visual working memory in prefrontal cortex of the macaque.
The Journal of Neuroscience,
16
(16),
5154–5167.
Palmer
J.
(1990).
Attentional limits on the perception and memory of visual information.
Journal of Experimental Psychology: Human Perception and Performance,
16
(2),
332–350.
Pashler
H.
(1988).
Familiarity and visual change detection.
Perception & Psychophysics,
44
(4),
369–378.
Schwarz
G.
(1978).
Estimating the dimension of a model.
The Annals of Statistics,
6
(2),
461–464.
Seung
H.,
Sompolinsky
H.
(1993).
Simple model for reading neuronal population codes.
Proceedings of the National Academy of Sciences, USA,
90
(22),
10749–10753.
Shaw
M. L.
(1980).
Identifying attentional and decisionmaking components in information processing.
In
Nickerson
R. S.
(Ed.)
Attention and performance. (
pp.
277–296).
Hillsdale, NJ:
Erlbaum.
van den Berg,
R.,
Awh
E.,
Ma
W. J.
(2014).
Factorial comparison of working memory models.
Psychological Review,
121
(1),
124–149.
van den Berg
R.,
Ma
W.
(2014).
“Plateau”related summary statistics are uninformative for comparing working memory models.
Attention, Perception, & Psychophysics,
76
(7),
2117–2135.
van den Berg
R.,
Shin
H.,
Chou
W.C.,
George
R.,
Ma
W. J.
(2012).
Variability in encoding precision accounts for visual shortterm memory limitations.
Proceedings of the National Academy of Sciences, USA,
109
(22),
8780–8785.
Warden
M. R.,
Miller
E. K.
(2007).
The representation of multiple objects in prefrontal neuronal delay activity.
Cerebral Cortex, 17
(suppl 1),
i41–i50.
Wasserman
E.,
Zentall
T.
(2009).
Comparative cognition: Experimental explorations of animal intelligence.
Oxford, UK:
Oxford University Press.
Wilken
P.,
Ma
W. J.
(2004).
A detection theory account of change detection.
Journal of Vision,
4
(12):
11,
1120–1135,
doi:10.1167/4.12.11. [
PubMed] [
Article]
Zhang
W.,
Luck
S. J.
(2008).
Discrete fixedresolution representations in visual working memory.
Nature,
453
(7192),
233–235.
Appendix A: Derivation of the decision rule
Step 1: Generative model
Figure A1 shows the relevant variables: the location
L of the change (1 or 2), the magnitude Δ of the change, the relevant sample orientations
θ_{1} and
θ_{2} (all other sample items are irrelevant to the decision), their noisy memories
x_{1} and
x_{2}, and the two test orientations
φ_{1} and
φ_{2}. Each variable has an associated probability distribution.

Since both test locations are equally likely to contain the change, we have p(L) = 0.5.

The change magnitude Δ and each of the sample orientations have discrete distributions, but we approximate them by uniform distributions
and
. We chose continuous uniform distributions rather than discrete distributions at the 18 presented orientations (or change magnitudes) because we think that it is unlikely that an observer learns those exact orientations (or change magnitudes); the choice of continuous uniform distributions also allows for a closed form for the decision rule.

We assume that the noisy memories x_{1} and x_{2} are conditionally independent given the sample orientations θ_{1} and θ_{2}. Formally, p(x_{1}, x_{2}θ_{1}, θ_{2}) = p(x_{1}θ_{1})p(x_{2}θ_{2}).

We assume that
p(
x_{i}
θ_{i}) is a von Mises distribution (
Equation 1).

When the change happens in the first location (L = 1), then φ_{1} = θ_{1} + Δ and φ_{2} = θ_{2}. When the change happens in the second location (L = 2), then φ_{1} = θ_{1} and φ_{2} = θ_{2} +Δ. We can formally denote this by (φ_{1}, φ_{2}) = (θ_{1}, θ_{2}) + Δ1_{L}, where 1_{L} is equal to (1, 0) when L = 1 and (0, 1) when L = 2.
Step 2: Inference
Now that we have specified the generative model, we can do inference. The observer infers
L based on the noisy memories
x_{1} and
x_{2} and the test orientations
φ_{1} and
φ_{2}; we also assume that the observer knows the values of
κ_{1} and
κ_{2}. An ideal observer infers
L by computing the posterior distribution over
L,
p(
L
x_{1},
x_{2},
φ_{1},
φ_{2}). Since
L is binary, all information about the posterior is contained in the log posterior ratio, which can be rewritten using Bayes's rule:
since
p(
L = 1) =
p(
L = 2). We evaluate the likelihood of
L = 1 (the probability of the memories
x_{1} and
x_{2} if the change happened at the first location):
Similarly, the likelihood of
L = 2 (the probability of the memories if the change happened at the second location) is
Combining, we find the log posterior ratio
The ideal observer responds that the change occurred at location 1 when the log posterior ratio is positive:
Appendix B: Parameter estimates
Table A1 shows our approximations to the maximumlikelihood estimates of all parameters in all models in all subjects (but averaged over human subjects).
Table A1 Parameter ranges and parameter estimates. Notes: For monkeys, means and standard errors of the mean (SEMs) were estimated from 100 bootstrapped data sets. For humans, means and standard errors were computed across subjects.
Table A1 Parameter ranges and parameter estimates. Notes: For monkeys, means and standard errors of the mean (SEMs) were estimated from 100 bootstrapped data sets. For humans, means and standard errors were computed across subjects.
Appendix C: Model fits to nonbootstrapped monkey data sets
Figures A2 and
A3 show model fits and Bayesian model comparison on the nonbootstrapped data sets for each individual monkey. Our results are consistent with those on bootstrapped datasets.