**The article presents Bayesian hierarchical modeling frameworks for two measurement models for visual working memory. The models can be applied to the distributions of responses on a circular feature dimension, as obtained with the continuous reproduction (a.k.a. delayed estimation) task. The first measurement model is a mixture model that describes the response distributions as a mixture of one (Zhang & Luck, 2008) or several (Bays, Catalao, & Husain, 2009) von-Mises distribution(s) and a uniform distribution. The second model is a novel, interference-based measurement model. We present parameter recovery simulations for both models, demonstrating that the hierarchical framework enables precise parameter estimates when a small number of trials are compensated by a large number of subjects. Simulations with the mixture model show that the Bayesian hierarchical framework minimizes the previously observed estimation bias for memory precision in conditions of low performance. Unbiased and reasonably precise parameter estimates can also be obtained from the interference measurement model, though some parameters of this model demand a relatively large amount of data for precise measurement. Both models are applied to two experimental data sets. Experiment 1 measures the effect of memory set size on the model parameters. Experiment 2 provides evidence for the assumption in the interference model that the target feature tends to be confused with features of those nontargets that are close to the target on the dimension used as retrieval cue.**

*Pmem*, and the precision of the von-Mises, κ (higher precision implying a smaller standard deviation).

*Pnt*. For a memory set size of

*n*, this component is the sum of

*n −*1 von-Mises distributions centered on the

*n*− 1 nontarget features, respectively. For simplicity, the precision of these distributions is assumed to be the same as that of the target-related distribution, so that the model has three parameters:

*Pmem*and

_{,}Pnt_{,}*κ*.

*x̂*is the response,

**X**is the vector of memory features in the array, with

*x*

_{1}as the target and

*x*

_{2}to

*x*as the nontargets, and

_{n}*VM*is the von-Mises distribution: where

*I*

_{0}(

*κ*) is the modified Bessel function of order 0. The two-parameter model is a special case of the three-parameter model with

*Pnt*fixed to 0.

*Pmem*and

*Pnt*jointly determine the probabilities of each of these three states occurring for a person in a given experimental condition. The precision parameter κ reflects the precision of a feature represented in VWM, given that the feature is retrievable at all.

^{1}Proponents of discrete-capacity models of VWM (Cowan, 2005; Fukuda, Awh, & Vogel, 2010; Zhang & Luck, 2008) can use

*Pmem*to estimate the number of “slots”,

*K*, that characterizes a person's VWM capacity:

*K*=

*N*×

*Pmem*(for N >

*K*).

*x*generated by a retrieval cue at the target location

*L*in cue-feature space is given by: with

_{θ}*D*(

*L*,

_{i}*L*) for the distance between the cued target location

_{θ}*L*

_{θ}and the location

*L*of an array object

_{i}*i*, and

*s,*a free parameter for the generalization gradient on the cue-feature dimension reflecting the precision of bindings on that dimension.

*A*reflects cue-independent activation of the memory features of all objects in the memory set, and component

_{a}*A*reflects stimulus-independent background noise in the memory system. The background noise is uncorrelated with the features of all objects in VWM, and can therefore be modelled as a uniform distribution on the circle, although the model is not committed to the assumption that background noise is actually uniform.

_{b}*x̂*out of

*N*response options is obtained by Luce's choice rule: The IMM has five parameters:

*a*,

*b, c*,

*s*, and

*κ*. Parameter estimation requires that one of the three weight parameters (

*a*,

*b,*and

*c*) is fixed to set the scale of the other two. For mathematical convenience we fixed

*c*= 1. Thus, the IMM has four free parameters.

*s*is fixed to a very high value (e.g.,

*s*= 20 in our applications, measuring distance on the cue-feature dimension in radians), the gradient becomes so steep that only the target item receives any activation from the retrieval cue. As a consequence, nontarget items don't contribute to component

*A*; they are included only in

_{c}*A*, which sums activation across all items independent of the cue. In this model version (IMM-

_{a}*abc*), nontarget items can be recalled with above-chance probability, but their recall probability is independent of their proximity to the target on the cue-feature dimension. As a consequence, the IMM with

*s*fixed to a high value is equivalent to the three-parameter mixture model of Bays et al. (2009). Constraining the IMM further by setting

*a*to zero eliminates the

*A*component reflecting information from the nontargets. This further constrained model (IMM-

_{a}*bc*) is equivalent to the two-parameter mixture model of Zhang and Luck (2008). Appendix1 explains this equivalence in more detail. Finally, we will also consider a model version in which we constrain only

*a*to zero while leaving

*s*free, implying that nontarget intrusions are entirely governed by the cue-generalization gradient parameter

*s*(version IMM-

*bsc*). This model can be used to investigate whether the assumption of cue-independent memory information (i.e., component

*A*) is necessary.

_{a}*rjags*package for R that interfaces between R and JAGS. One obstacle was that JAGS has no built-in von Mises distribution. Therefore we built a JAGS extension, following the tutorial of Wabersich and Vandekerckhove (2014). The von Mises extension for JAGS can be downloaded from GitHub [https://github.com/yeagle/jags-vonmises/releases]. The modelling code and the data of the two experiments reanalyzed below can be found on the Open Science Framework (osf.io/wjg7y/).

*Pm*is the probability that the response comes from any of the objects in the memory set, including the target and all nontargets, and

*Pt*is the conditional probability that the response reflects the feature of the target, given that it reflects the feature of any memory object. Hence, The hierarchical mixture model (HMM) predicts each response

*x̂*of participant

_{i,j}*j*in trial

*i*as distributed according to a von Mises distribution: The mean of the von Mises,

*m*, is the true feature of the array object from which the response derives. The object determining the response is sampled from the array objects according to a categorical distribution with probabilities

_{i,j}**P**

_{i,j}**M**

_{i}_{,}

*is the vector of memory features in array*

_{j}*i*presented to person

*j*, the first element being the target, and

**P**

_{i}_{,}

*is the vector of probabilities of a response reflecting each of the array features.*

_{j}*m*with probability

_{i,j}*Pm*, whereas with probability 1 −

_{j}*Pm*the response is drawn from a uniform distribution. Conveniently, the von Mises distribution with precision κ = 0 is a uniform distribution on the circle. Therefore, we can multiply the precision parameter κ

_{j}_{j}, with a binary variable

*z*drawn from a Bernoulli distribution with parameter

_{i,j}*Pm*: When

_{j}*z*= 1 (with probability

_{i,j}*Pm*), the response is modeled as drawn from a von Mises centered on

_{j}*m*whereas when

_{i,j}*z*= 0 (with probability 1 −

_{i,j}*Pm*) the response is drawn from a uniform distribution.

_{j}*M*and

_{κ}*σ*, respectively, and derive scale and rate from them. Conversely, the means and the variances of the beta distributions of the probability parameters can be calculated as:

_{κ}*Pmem*is very small, such as when memory set size is large, or recall is from long-term memory (Brady, Konkle, Gill, Oliva, & Alvarez, 2013). The third simulation demonstrates the power of hierarchical modeling for pooling information from many subjects, each contributing only few trials, for estimating group-level parameters.

*N*(trials) = 60, 120, 180, or 360. Each simulation had true population mean parameters

*M*(

*Pm*),

*M*(

*Pt*), and

*M*(

*κ*). We varied these mean parameters in a fully crossed design, with Subject parameters for

*Pm*and

*Pt*were drawn from Beta distributions, with parameters

*A*= 10*M(

_{Pm}*Pm*) and

*B*= 10*(1 −

_{Pm}*M*(

*Pm*)) for

*Pm*, and analogously for

*Pt*. Subject parameters for κ were drawn from a Gamma distribution with mean =

*M*(

*κ*) and variance = 25.

*M*(

*Pmem*) and

*M*(

*Pnt*) were calculated from the true and estimated values, respectively, of

*M*(

*Pm*) and

*M*(

*Pt*).

*M*(

*Pmem*) and

*M*(

*Pnt*) is accurate (i.e., unbiased) and fairly precise already with 60 trials per person, and hardly improves with larger numbers of trials. Recovery of the precision parameter

*M*(

*κ*), and the corresponding values of

*M*(

*SD*), is less satisfactory at 60 trials per person, but improves with larger number of trials.

*M*(

*Pm*) = 0.8,

*M*(

*Pt*) = 0.8, and

*M*(

*κ*) = 10, corresponding to

*SD*= 18.6). Changing the values of

*M*(

*Pt*) had no noticeable impact on the recovery of

*M*(

*Pm*), and vice versa. Higher values of κ resulted in somewhat smaller HDIs for recovery of

*M*(

*Pm*) and

*M*(

*Pt*). In contrast, the values of

*M*(

*Pm*) and

*M*(

*Pt*) had a pronounced influence on how well

*M*(

*κ*), and hence,

*M*(

*SD*) was recovered.

*M*(

*κ*) estimates for varying levels of

*M*(

*Pm*), shows that the precision of estimating

*M*(

*κ*) decreases noticeably as M(

*Pm*) decreases. The reason for this is that with smaller

*M*(

*Pm*), a larger proportion of trials are drawn from the uniform component, and these trials are uninformative for estimating the precision of the von-Mises component. Figure 6 also shows that, as long as

*M*(

*Pm*) is not too small, reasonably precise estimates of M(κ) can already be obtained with as little as 60 trials per subject, assuming a subject sample size as is typical for experiments in cognitive psychology.

*Pmem*values of 0.4 or smaller, estimates of κ are biased downward, and hence, estimates of

*SD*are biased upward. This bias limits the usability of the mixture model in conditions of poor memory, such as VWM tests with large set sizes, or tests of visual long-term memory. Simulation 2 served to analyze whether the hierarchical Bayesian framework protects against such bias. We generated data from the three-parameter mixture model with

*M*(

*SD*) = 20,

*M*(

*Pt*) = 0.8 or 1, and varying

*M*(

*Pmem*) in small steps between 0.05 and 0.5. We simulated data for 100 trials from each of 20 subjects. Figure 7 (left panels) shows the parameter recovery for M(

*SD*). An upward bias in

*M*(

*SD*) is noticeable, but only for

*M*(

*Pmem*) < 0.1. With

*M*(

*Pmem*) = 0.2 or higher, the recovery of

*M*(

*SD*) was unbiased and had a reasonably narrow HDI.

*M*(

*Pm*) = 0.8,

*M*(

*Pt*) = 0.8,

*M*(

*κ*) = 10, and varied the number of subjects (from 10 to 200) and the number of trials per subject (from 10 to 150). Figures 8 to 10 show the parameter recovery for

*M*(

*Pmem*),

*M*(

*Pnt*), and

*M*(

*κ*), respectively. The perhaps surprising finding is that with a moderately large number of subjects (

*N*= 50 or more), accurate and reasonably precise parameter estimates can be obtained even with very small numbers of trials per subject. For instance, HDIs of parameter estimates for

*N*= 50 and 20 trials per subject are about as narrow as those for

*N*= 20 and 90 trials per subject. This means that researchers can use the hierarchical mixture model for experiments with many conditions, for which each subject can contribute only a small number of trials (Bae, Olkkonen, Allred, & Flombaum, 2015), or for data from online experiments, in which many subjects participate for only a short time.

*x̂*as the normalized distribution of activation over the response scale. This distribution is difficult to handle because it changes with the constellation of features in each memory array, and is often multimodal. Fortunately, we can break the activation distribution down into a weighted mixture of von-Mises distributions: The weights

*w*for

_{i}*i*between 1 and set size

*n*are given by the combination of the weights of cue-based retrieval (governed by parameters

*c*= 1, and

*s*), and of cue-independent memory (governed by parameter

*a*): The weight

*w*is the contribution of uniform background noise,

_{n+1}*b*. The von-Mises distribution with precision = 0 is the uniform distribution on the circle, so that we can model the entire activation distribution as a mixture of von-Mises distributions with different mean and precision parameters. The mixture weights can be translated into mixture probabilities by normalizing: We can now express the likelihood of the IMM as a probability mixture of von-Mises distributions:

*x̂*of participant

_{i,j}*j*in trial

*i*as distributed according to a von-Mises distribution: The mean of the von-Mises,

*m*, is the true feature of one array object, sampled from the array objects according to a categorical distribution with probabilities

_{i,j}**P**

*, which is the vector of mixture probabilities*

_{i,j}*p*to

_{1}*p*

_{n+}_{1}. The precision κ

_{j}is multiplied by an indicator variable

*z*that equals 1 for values of

_{i,j}*object*between 1 and setsize

_{i,j}*n*, and equals 0 in case

*object*=

_{i,j}*n*+1, so that precision is set to 0, and the von-Mises turns into a uniform distribution, whenever category

*n*+1 is sampled from the categorical distribution.

**M**

_{i}_{,}

*is the vector of memory features in array*

_{j}*i*presented to person

*j*;

**Z**is the vector of indicator variables, and

**P**

_{i}_{,}

*is the vector of mixture probabilities.*

_{j}*a*,

_{j}*b*, and

_{j}*κ*were drawn from Gamma distributions: The parameter

_{j}*s*of the exponential generalization gradient on the cue-feature dimension was drawn for each person from a normal distribution As priors for the group-level parameters we used moderately informative priors, informed by maximum-likelihood fits of the IMM to several data sets:

*a*) = 0.01, Var(

*b*) = 0.015, Var(

*s*) = 3, and Var(

*κ*) = 15. These parameter means and variances were chosen for representing the distributions of typical values that were estimated when we fitted the IMM separately to data from individual participants in a VWM experiment (set size = 4), using a maximum-likelihood algorithm. Each simulation generated data for 20 subjects for a variable number of trials, N(trials) = 60, 120, 180, or 360. Parameter values of individual subjects were drawn from Gamma distributions with means set to the values of M(a), M(b), M(s), and M(κ) chosen for the current simulation, and the corresponding between-subject variances.

*b*, but difficulties with recovering

*a*(in particular at low values of

*s*) and

*s*(in particular at high values of

*a*). This difficulty arises from the fact that both

*a*and

*s*account for nontarget intrusions, one (

*s*) in a distance-dependent, the other (

*a*) in a distance-independent way. The data from an experiment of typical size provide only very sparse information to distinguish between distance-dependent and distance-independent intrusions.

*abc*(

*s*constrained to 20) does not include distance-dependent intrusions of nontargets. Conversely, IMM version

*bsc*(

*a*constrained to 0) does not include distance-independent intrusions of nontargets. The most appropriate model out of these three can be determined via model comparison using a criterion such as the Deviance Information Criterion (DIC) (Spiegelhalter, Best, Carlin, & van der Linde, 2002) or the Widely Applicable Information Criterion (WAIC; Watanabe, 2010), which measure model fit for hierarchical models, taking model complexity into account. Following the advice of Gelman, Hwang, and Vehtari (2014), we recommend the WAIC because it is the only information criterion that is fully Bayesian. In addition, a simulation study comparing DIC and WAIC for the IMM found that WAIC is more stable by two orders of magnitude.

^{2}We also found that model selection using DIC recovered the true model (i.e., the full version of HIMM) in only half the simulation runs of Simulation 1, whereas model selection using WAIC recovered the true model in almost all simulation runs.

*a*= 0.2,

*b*= 0.35,

*s*= 3,

*κ*= 10) and varied the number of subjects (from 10 to 200) and the number of trials per subject (from 10 to 500). Figures 15 to 18 show the results. They confirm the difficulty of obtaining precise estimates of parameters

*a*,

*b*, and

*s*with 20 subjects and less than 100 trials. The situation becomes much better, however, with 30 or more subjects and more than 100 trials. Moreover, the hierarchical modeling framework enables a trade-off between number of trials and number of subjects: Decent estimates can even be obtained with 10 trials when the sample size is increased to 200 subjects.

*abc*version of the HIMM, which is mathematically equivalent to the HMM (see Appendix1). Apparently, the differences in how the models are parameterized affect their information criteria, as measured by WAIC. The WAIC is calculated from the likelihood of the data under the model, integrating over all possible parameter values weighted by their posterior probability. As such, it depends on the posterior distribution of parameter values, which in turn is influenced by their prior distributions. Because the two equivalent models are parameterized differently, we could not set equivalent priors on their parameters, and as a consequence, comparing their WAIC values could be biased in favor of one model. Researchers interested in whether their data provide evidence for distance-dependent nontarget intrusions (as incorporated in the s parameter of the IMM) or evidence for distance-independent nontarget intrusions (as incorporated in the

*a*parameter) should not compare the fit of the IMM to that of the mixture model but rather compare different versions of the IMM to each other.

*abc*(which eliminates the distance dependence of nontarget intrusions and is equivalent to the three-parameter mixture model) to each probing condition separately. For the full HIMM, we determined the distance between nontarget and target locations,

*D*(

*L*,

_{i}*L*

_{θ}), on the relevant cue-feature dimension for each condition: spatial location for the spatial-probe condition, and location in color space for the color-probe condition. Because both spatial location and colors varied on a circular scale, we measured them both in radians, so that the effects of distance on the two cue-feature dimensions can be compared with each other. For both probing conditions, the WAIC favored the full HIMM over the constrained HIMM-

*abc*(color probe: WAIC for full HIMM = 5354, for HIMM-

*abc*= 5398; for spatial probe: full HIMM = 6031, HIMM-

*abc*= 6091). This is due to the fact that, in this data set, there is a pronounced distance gradient of nontarget intrusions: As the distance of nontargets from the target on the cue-feature dimension increased, the contribution of nontarget features to the response distribution declined. This can be seen in the histograms of responses, centered on the target and on each of the five nontargets, ordered by their distance from the target on the cue-feature dimension (Figure 23). Figure 24 shows the posteriors of the parameters of the full HIMM.

*SD*parameter when

*Pmem*is low). Another advantage is that Bayesian parameter estimation provides information not only about the location of a parameter (i.e., its best point estimate) but also on the credible range of values and the precision of estimation, as reflected in the highest-density intervals of the posterior distributions. More generally, the JAGS extension for the von Mises distribution enables researchers to investigate other measurement models, as well as explanatory models of continuous reproduction of features on circular dimensions (e.g., the full gamut of models compared by van den Berg et al., 2014) in a Bayesian hierarchical framework.

*bac*and

*bc—are*mathematically equivalent to the three-parameter and the two-parameter mixture model, respectively. This equivalence implies that any data fit well by the mixture model can also be fit well by a version of the IMM. Researchers can use one or the other model depending on which parameters they want to measure.

*bsc*) makes one prediction that distinguishes it from the mixture model: Intrusions from nontargets should depend on their proximity to the target on the cue-feature dimension. This prediction has received support in some previous experiments (Bays, 2016; Rerko, Oberauer, & Lin, 2014), and again in the experiments reanalyzed here (in particular Experiment 2). At the same time, the recovery simulations for the HIMM show that the signal of distance-dependent nontarget intrusions is likely to be weak in data from experiments with conventional numbers of participants and trials, and this was also noticeable in Experiment 1. Researchers interested in precise measurement of the parameters of the full IMM will have to invest more trials and/or participants than required for estimating the mixture-model parameters.

*s*). Future research could use the IMM, for instance, to analyze data on memory for serial order of visual stimuli (Kool, Conway, & Turk-Browne, 2014). Experiments on serial-order memory for verbal stimuli have revealed a “locality constraint” on confusions between list items: The correct item in a given list position is more likely to be replaced by another item nearby in the list than by an item in a more distant position (Hurlstone, Hitch, & Baddeley, 2014). The locality constraint is an instance of distance-dependent nontarget intrusions on the dimension of list position. We could use the IMM—with list position as the cue-feature dimension—to ask whether serial recall of continuously varying visual features also shows this locality constraint. In this way, the IMM can help to bring the research traditions on working memory for visual and for verbal information closer together.

*Journal of Experimental Psychology: General*, 144, 744–763, doi:10.1037/xge0000076.

*Scientific Reports, 6*, 19203, doi:10.1038/srep19203.

*Journal of Vision*, 9 (10): 7, 1–11, doi:10.1167/9.10.7. [PubMed] [Article]

*Journal of Vision*, 11 (10): 6, 1–15, doi:10.1167/11.10.6. [PubMed] [Article]

*Psychological Science*, 24, 981–990, doi:10.1177/0956797612465439.

*Psychological Review*, 114, 539–576.

*Working memory capacity*. New York, NY: Psychology Press.

*Psychological Bulletin*, 53, 134–140.

*Current Opinion in Neurobiology*, 20, 177–182.

*Statistics and Computing*, 24, 997–1016, doi:10.1007/s11222-013-9416-2.

*Psychonomic Bulletin & Review*, 7, 185–207.

*Psychological Bulletin*, 140, 339–373. doi:10.1037/a0034221.

*Attention, Perception & Psychophysics*, 76, 1885–1901.

*Doing Bayesian data analysis: A tutorial with R and BUGS*. New York, NY: Academic Press.

*Journal of Mathematical Psychology*, 55, 1–7, doi:10.1016/j.jmp.2010.08.013.

*Bayesian modeling for cognitive science: A practical course*. Cambridge, UK: Cambridge University Press.

*Nature Neuroscience Reviews*, 17, 347–356, doi:10.1038/nn.3655.

*Journal of Experimental Psychology: Learning, Memory, and Cognition*, 10, 103–113.

*Psychological Review*, 124, 21–59.

*Mixed-Effect Models in S and S-Plus*. Berlin, Germany: Springer.

*Journal of Experimental Psychology: Human Perception and Performance*, 24, 261–282.

*Quarterly Journal of Experimental Psychology*, 67, 3–15, doi:10.1080/17470218.2013.789543.

*Journal of the Royal Statistical Society, B*, 64, 583–639.

*Attention, Perception & Psychophysics*, 76, 2071–2079, doi:10.3758/s13414-014-0690-7.

*Principles of memory*. New York, NY: Taylor & Francis.

*Psychonomic Bulletin & Review*, 23, 831–841, doi:10.3758/s13423-015-0937-x.

*Psychological Review*, 121, 124–149, doi:10.1037/a0035234.

*Behavioral Research Methods*, 46, 15–28, doi:10.3758/s13428-013-0369-3.

*Journal of Machine Learning Research*, 11, 3571–3594.

*Journal of Vision*, 4 (12): 11, 1120–1135, doi:10.1167/4.12.11. [PubMed] [Article]

*Psychological Review*, 114, 152–176.

*Nature*, 453, 233–236.

^{2}Both DIC and WAIC are calculated from the MCMC samples of the deviance (Gelman et al., 2014) and therefore are estimated with limited precision. We estimated the precision of DIC and WAIC by running the three versions of IMM on the same data (i.e., Experiment 1) 50 times, and calculating the

*SD*of the information criteria across the 50 replications. The

*SD*of DIC was 908, whereas the

*SD*of WAIC was 10.

*s*is fixed to a very high value, the resulting IMM version

*abc*is equivalent to the three-parameter mixture model of Bays et al. (2009). In the IMM-

*abc*, the distribution of activation is a sum of three components, weighted by parameters

*b*,

*a*, and

*c*. In this sum, activation of the color distribution of the target comes from component weighted with

*c*and the component weighted with

*a*; the weight of this component is

*c*+

*a*. Activation of each of the

*n*-1 nontargets comes only from component weighted by

*a*, its joint weight is (

*n*-1)

*a*. Uniformly distributed activation for every color is added with weight

*b*. To obtain the predicted probability distribution of responses, the activation distribution is normalized.

*Pmem*for target responses,

*Pnt*for nontarget responses, and

*Pg*= 1 −

*Pmem*−

*Pnt*for random guessing. Thus, the only difference between the three-parameter mixture model and the IMM version

*abc*is that in the former the normalization occurs before the weighted summation, whereas in the IMM it occurs afterwards. Therefore, the weights of the mixture model can be obtained from the weights of the IMM by normalization. To that end we need to divide the IMM weight parameters by their sum. For a trial with set size

*n*, the sum of the weights equals

*b*+

*na*+

*c*. Thus:

*κ*governs the dispersion of the activation distribution (IMM) or the response probability distribution (mixture model) over content space, which is unaffected by normalization. Therefore, the κ parameters of both models map directly onto each other.

*a*= 0 we obtain the IMM version

*bc*, which is equivalent to the two-parameter mixture model (Zhang & Luck, 2008). The correspondence of parameters is easily obtained by setting

*a*= 0 in the equations above. Parameter

*Pnt*, the probability of reporting a nontarget, drops out, resulting in the two-parameter mixture model in which

*Pg*= 1 −

*Pmem*.