Open Access
Article  |   July 2024
Visual working memory models of delayed estimation do not generalize to whole-report tasks
Author Affiliations
  • Benjamin Cuthbert
    Centre for Neuroscience Studies, Queen's University, Kingston, ON, Canada
    0bec@queensu.ca
  • Dominic Standage
    Centre for Neuroscience Studies, Queen's University, Kingston, ON, Canada
    Department of Biomedical and Molecular Sciences, Queen's University, Kingston, ON, Canada
    standage@queensu.ca
  • Martin Paré
    Department of Biomedical and Molecular Sciences, Queen's University, Kingston, ON, Canada
    Department of Psychology, Queen's University, Kingston, ON, Canada
    School of Computing, Queen's University, Kingston, ON, Canada
    martin.pare@queensu.ca
  • Gunnar Blohm
    Centre for Neuroscience Studies, Queen's University, Kingston, ON, Canada
    Department of Biomedical and Molecular Sciences, Queen's University, Kingston, ON, Canada
    Department of Psychology, Queen's University, Kingston, ON, Canada
    School of Computing, Queen's University, Kingston, ON, Canada
    gunnar.blohm@queensu.ca
Journal of Vision July 2024, Vol.24, 16. doi:https://doi.org/10.1167/jov.24.7.16
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Benjamin Cuthbert, Dominic Standage, Martin Paré, Gunnar Blohm; Visual working memory models of delayed estimation do not generalize to whole-report tasks. Journal of Vision 2024;24(7):16. https://doi.org/10.1167/jov.24.7.16.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Whole-report working memory tasks provide a measure of recall for all stimuli in a trial and afford single-trial analyses that are not possible with single-report delayed estimation tasks. However, most whole-report studies assume that trial stimuli are encoded and reported independently, and they do not consider the relationships between stimuli presented and reported within the same trial. Here, we present the results of two independently conducted whole-report experiments. The first dataset was recorded by Adam, Vogel, and Awh (2017) and required participants to report color and orientation stimuli using a continuous response wheel. We recorded the second dataset, which required participants to report color stimuli using a set of discrete buttons. We found that participants often group their reports by color similarity, contradicting the assumption of independence implicit in most encoding models of working memory. Next, we showed that this behavior was consistent across participants and experiments when reporting color but not orientation, two circular variables often assumed to be equivalent.Finally, we implemented an alternative to independent encoding where stimuli are encoded as a hierarchical Bayesian ensemble and found that this model predicts biases that are not present in either dataset. Our results suggest that assumptions made by both independent and hierarchical ensemble encoding models—which were developed in the context of single-report delayed estimation tasks—do not hold for the whole-report task. This failure to generalize highlights the need to consider variations in task structure when inferring fundamental principles of visual working memory.

Introduction
Many recent models of visual working memory (VWM) have been developed in the context of the delayed estimation task (Wilken & Ma, 2004), where participants report the value of a presented stimulus—such as the color of a square or orientation of a bar—after a blank delay period. In this task, report error is measured by taking the difference between presented and reported values, and error distributions over many trials are assumed to reflect the structure of stimulus encodings. Models that reproduce delayed estimation error distributions have therefore been used to support and refute numerous claims about VWM encoding (Bays & Husain, 2008; Brady & Alvarez, 2011; Nassar, 2018; Oberauer, 2017; Schurgin, Wixted, & Brady, 2020; Swan & Wyble, 2014; van den Berg, Shin, Chou, George, & Ma, 2012; Zhang & Luck, 2008). 
Recently, debate over the existence of a fixed storage limit in VWM led Adam and colleagues to introduce a novel delayed estimation variant: the whole-report task (Adam, Mance, Fukuda, & Vogel, 2015; Adam, Vogel, & Awh, 2017). Unlike typical delayed estimation tasks—where only one stimulus is reported—each trial of the whole-report task requires participants to report all presented stimuli. This has the advantage of measuring recall of the entire stimulus display and allowed Adam et al. to investigate item limits at the single-trial level. 
The whole-report task is now seeing more widespread use (Adam & Vogel, 2018; deBettencourt, Keene, Awh, & Vogel, 2019; Hao, Li, Zhang, & Ku, 2020; Killebrew, Gurariy, Peacock, Berryhill, & Caplovitz, 2018; Peters et al., 2018; Peters, Rahm, Kaiser, & Bledowski, 2019; Robison & Unsworth, 2019; Utochkin & Brady, 2020), and the original publicly available dataset has been used for model evaluation (Schneegans, Taylor, & Bays, 2020). Critically, these studies treat the whole-report task as an extension of single-report delayed estimation, focusing on error distributions accumulated across trials and reports. Only a few studies have reported within-trial effects such as inter-item interference and bias (Hao, Li, Zhang, & Ku, 2021; Udale, Gramm, Husain, & Manohar, 2021; Utochkin & Brady, 2020), and none has considered the joint distribution of reports made within the same trial. 
Within-trial joint distributions are a key affordance of the whole-report task because they allow us to test VWM model assumptions in ways that are not possible with single-report delayed estimation data. For example, many models assume that stimuli are encoded in VWM independently (Bays & Husain, 2008; Fougnie, Suchow, & Alvarez, 2012; Schneegans et al., 2020; van den Berg et al., 2012; Zhang & Luck, 2008). If this were the case, a given stimulus encoding would have no effect on other stimuli encoded within the same trial, and the joint distribution of reports would be uniform. Alternative models suggest that stimuli are encoded together in an “ensemble” (Alvarez & Oliva, 2009; Brady & Alvarez, 2011; Nassar, 2018; Orhan & Jacobs, 2013), which would result in dependencies in within-trial joint distributions. 
Here, we investigated previously uncharacterized within-trial behavior in the whole-report task and found that neither independent nor hierarchical ensemble encoding models can explain the results. We used two whole-report datasets: the original dataset recorded by Adam et al. (2017) and data from a whole-report task variant independently conducted in our lab (Cuthbert, Standage, Paré, & Blohm, 2018). In both datasets, we found strong evidence that color reports made in the same trial are not independent. In many cases, participants grouped consecutive within-trial reports by color similarity. In addition, reports made later in the trial—past the canonical capacity limit of three or four items—were consistently biased away from colors reported earlier in the trial. 
We determined that this effect was consistent across participants and set sizes when recalling color stimuli but was either weaker or absent in task conditions with orientation stimuli. This is surprising, because color and orientation are typically treated as equivalent circular variables (Schneegans et al., 2020; van den Berg et al., 2012), and aggregate error distributions for color and orientation stimuli are qualitatively very similar. 
Finally, we considered whether hierarchical ensemble encoding models might account for within-trial dependencies in color reports. In these models, dependencies arise due to Bayesian integration over a hierarchical prior (Brady & Alvarez, 2011; Orhan & Jacobs, 2013). We implemented multiple Bayesian ensemble models and found no evidence for any form of their predicted biases—even after restricting our analyses to specific set sizes or contexts where ensembles might confer a significant storage advantage. 
Taken together, our results suggest that whole-report task data cannot be entirely accounted for by VWM models that assume independent encoding, equivalence between circular stimuli, or hierarchical ensemble encoding. 
Methods
We analyzed two whole-report datasets, which we refer to as the discrete whole-report and continuous whole-report datasets. 
Discrete whole-report dataset
The discrete whole-report dataset was recorded at Queen's University and has been previously presented in conference proceedings (Cuthbert et al., 2018). 
Task and participants
The discrete whole-report task was adapted from Adam et al. (2015). Sixteen healthy adult participants (18–40 years of age; seven female) completed two experimental sessions each, with 12 blocks of 30 trials completed per session. Participants were required to memorize an array of colored squares (“stimuli”) and then report the color of each stimulus after a retention period. 
Each trial began with the simultaneous presentation of two, three, four, six, or eight stimuli for 500 ms. This was followed by a blank gray screen presented for 1000 ms, during which participants were required to maintain fixation on the central point. Subjects were then presented with “response matrices,” each comprised of eight selectable buttons (one for each possible color; see Figure 1). All response matrices were identical within each trial, but button colors were randomized between trials. Response matrices appeared at all locations previously occupied by stimuli, and participants were required to report the color of each stimulus. 
Figure 1.
 
Whole-report task comparison. (A) Continuous whole-report task. Colored squares were briefly presented, and after a blank delay period the participant reported all colors by clicking a color wheel. Report order was either participant selected or randomly generated. Stimuli were sampled from 360 “continuous” color values. (B) Discrete whole-report task. Same as (A), but participants reported by clicking a square in the response array (inset). Stimuli were sampled from eight “discrete” color values. (C) Continuous error distributions, where 360 possible error values are divided into 90 bins for visualization. Orange and blue histograms show errors from participant-selected and randomly generated report conditions, respectively. In the upper panel, all report errors are separated by set size; the lower panel shows set size six report errors, separated by report order. (D) Discrete error distributions. Same as (C), but bins correspond to all possible error values.
Figure 1.
 
Whole-report task comparison. (A) Continuous whole-report task. Colored squares were briefly presented, and after a blank delay period the participant reported all colors by clicking a color wheel. Report order was either participant selected or randomly generated. Stimuli were sampled from 360 “continuous” color values. (B) Discrete whole-report task. Same as (A), but participants reported by clicking a square in the response array (inset). Stimuli were sampled from eight “discrete” color values. (C) Continuous error distributions, where 360 possible error values are divided into 90 bins for visualization. Orange and blue histograms show errors from participant-selected and randomly generated report conditions, respectively. In the upper panel, all report errors are separated by set size; the lower panel shows set size six report errors, separated by report order. (D) Discrete error distributions. Same as (C), but bins correspond to all possible error values.
There were two report conditions (one session each): In the participant-ordered report condition, participants were instructed to report stimulus colors in order of confidence; in the randomly ordered report condition, the report order was random and cued with a black radial line. Color reporting was unspeeded, and each trial concluded when all stimuli had been reported. 
Experimental setup
Stimuli were generated with MATLAB R2016a (MathWorks, Natick, MA) and presented on a VIEWPixx /3D LCD monitor (VPixx Technologies, Saint-Bruno, QC) using the Psychophysics Toolbox (Kleiner et al., 2007). Compliance during stimulus presentation and the retention period of the task was verified using the EyeLink 1000 Tower Mount (SR Research, Ottawa, ON). The sampling rate was set to 1000 Hz, and eye movement traces of position and velocity were examined for exclusion criteria. Trials were excluded if subjects did not maintain fixation during stimulus presentation (blinks not permitted) and the retention period (blinks permitted). In total, 771 trials were excluded (13.4%). 
Stimulus generation
Colors were selected from eight equidistant points on a circle in CIE L*a*b* color space centered at L* = 70, a* = 20, and b* = 38, with a radius of 60. Luminance and chromaticity were calibrated prior to each experimental session using a ColorCAL MKII colorimeter (Cambridge Research Systems, Rochester, UK), and the coordinates above were selected to ensure that all colors fell within the gamut of the LCD monitor. Stimuli each measured 1.5 × 1.5° (visual angle) and were spaced equidistantly around a central fixation point with a 7.5° radius. 
Continuous whole-report dataset
The continuous whole-report dataset was recorded by Adam and colleagues at the Universities of Chicago and Oregon. The following is a brief summary of their methods, with particular emphasis on the differences between their task and the discrete whole-report task above. For complete experimental details, see Adam et al. (2017)
Task and participants
Here, we consider data from four task conditions recorded by Adam et al. (2017): participant-ordered report with color stimuli (22 participants), randomly ordered report with color stimuli (17 participants), participant-ordered report with orientation stimuli (23 participants), and randomly ordered report with orientation stimuli (21 participants). The continuous task used set sizes of one, two, three, four, and six (vs. two, three, four, six, and eight for the discrete task). There were also timing differences; continuous task stimuli were presented for 150 ms (vs. 500 ms), and the delay period lasted 1300 ms (vs. 1000 ms). 
Discrete versus continuous stimuli
Although color stimuli for both tasks were sampled from equidistant points around circles in CIE L*a*b* color space, the discrete task only used eight distinct colors and the continuous task used 360. The area of color space used was also different—the continuous task used a circle centered at L* = 54, a* = 18, and b* = –8—and the monitors were not calibrated before the continuous experiment, so the displayed colors cannot be considered perceptually equivalent to the discrete task stimuli. The continuous task also included conditions with orientation stimuli, which were sampled from 360 equally spaced angles. 
Discrete versus continuous report method
As illustrated in Figure 1, the two tasks employed different report methods. In the discrete task, eight possible report colors were presented in the form of clickable buttons at the stimulus location being probed. In the continuous task, 360 possible report colors were presented in the form of a clickable color wheel surrounding the entire display. Similarly, the continuous task with orientation stimuli required participants to click on a surrounding (uncolored) wheel to report remembered orientation. 
Analyses
Kullback–Leibler divergence estimation
To quantify within-trial dependence, we estimated the Kullback–Leibler (KL) divergence between within-trial relative distance distributions and a circular uniform distribution. KL divergence is a measure of the difference between two distributions that involves computing their relative entropy (in bits). A low KL divergence between a discrete distribution P and a nominal uniform distribution Q is evidence that P has high entropy and is close to uniformly distributed. A high KL divergence provides evidence that P is not uniformly distributed. This approach is useful because, unlike correlation, it does not assume a linear relationship between variables and can therefore capture higher order dependencies such as those evident in Figure 2
Figure 2.
 
Dependence of within-trial color reports. (A) Relative distance calculation, which was identical for both tasks. (B) Example joint distribution of reported stimulus values. Data from the participant-ordered continuous whole-report task (set size six; aggregated across all participants). Each panel plots the first color reported on each trial against a later report in the same trial. (C, upper) Distribution of relative distances for the continuous task (set size six). (C, lower) Median bootstrapped estimates of the KL divergence between each participant's relative distance distribution and size-matched samples from a uniform distribution. (D) Same as (C) but for the discrete whole-report task (set size six).
Figure 2.
 
Dependence of within-trial color reports. (A) Relative distance calculation, which was identical for both tasks. (B) Example joint distribution of reported stimulus values. Data from the participant-ordered continuous whole-report task (set size six; aggregated across all participants). Each panel plots the first color reported on each trial against a later report in the same trial. (C, upper) Distribution of relative distances for the continuous task (set size six). (C, lower) Median bootstrapped estimates of the KL divergence between each participant's relative distance distribution and size-matched samples from a uniform distribution. (D) Same as (C) but for the discrete whole-report task (set size six).
For each participant, report condition, and pair of within-trial reports, we performed 1000 bootstrapped KL divergence estimates between the empirical distribution P and a nominal circular uniform distribution Q: 
\begin{equation*}{{D}_{KL}}\left( {P,\ Q} \right) = \mathop \sum \limits_{x = 1}^n P\left( x \right){{\log }_2}\left( {\frac{{P\left( x \right)}}{{Q\left( x \right)}}} \right)\end{equation*}
where n is the number of equal bins used to define the probability space (n = 36 for the continuous task and n = 8 for the discrete task). 
To generate a baseline estimate, we repeated this process but replaced each empirical distribution P with a size-matched sample from a circular uniform distribution. Median estimates for all participants are shown alongside their respective baseline median estimates in Figures 2C and 2D. 
Hierarchical encoding models
Two-level hierarchical Bayesian model
The two-level hierarchical Bayesian model (HBM) is adapted from the simplest model introduced (Brady & Alvarez, 2011). In the HBM implemented here, stimuli are probabilistically represented in working memory at two levels: (1) individual stimulus color estimates and (2) a joint “ensemble” of all stimuli. Here, both color and orientation are considered circular stimulus spaces, so we substitute the original Gaussian distributions with the von Mises (circular normal) distribution: 
\begin{equation*}VM\left( {x{\rm{|\mu }},{\rm{\kappa }}} \right) = \frac{{{{e}^{{\rm{\kappa }}cos\left( {x - {\rm{\mu }}} \right)}}}}{{2{\rm{\pi }}{{I}_0}\left( {\rm{\kappa }} \right)}}\end{equation*}
where I0(•) is the modified Bessel function of order 0, µ is the mean circular direction, and κ is the concentration parameter (analogous to 1/σ2). 
The HBM assumes that on each trial, N stimuli \({\rm{\theta }}_{{\rm{i}} = 1}^{\rm{N}}\) are sampled from the same “ensemble” von Mises distribution:  
\begin{equation*}{{{\rm{\theta }}}_i} | {{{\rm{\mu }}}_e},\ {{{\rm{\kappa }}}_e}\ \sim VM\left( {{{\mu }_e},{{\kappa }_e}} \right)\ \ \ \ \ \ \ \ i = 1, \ldots ,N\end{equation*}
where µe and κe are the ensemble mean and precision, respectively, with the following priors:  
\begin{equation*}{{\mu }_e} \sim U\left( { - {\rm{\pi }},\pi } \right)\end{equation*}
 
\begin{equation*}{{\kappa }_e} \sim U\left( {0,100} \right)\end{equation*}
 
The HBM observer only has access to noisy von Mises observations of \({\rm{\theta }}_{{\rm{i}} = 1}^{\rm{N}}\):  
\begin{equation*}{{x}_i}| {{{\rm{\theta }}}_i} \sim VM\left( {{{\theta }_i},\ {{{\rm{\kappa }}}_{obs}}} \right)\ \ \ \ \ \ i = 1, \ldots ,N\ \ \ \ \end{equation*}
where κobs is the precision of noisy observations \(x_{i = 1}^N\) centered on \(\theta _{i = 1}^N\). κobs is therefore the only free parameter in this model. 
We estimated plausible values of κobs by fitting a von Mises function—via maximum likelihood estimation—to error distributions from set size one of the continuous whole-report task (see Results; Figure 3B). To generate a range of model predictions, all simulations were conducted in triplicate using κobs = 5, 10, and 20. 
Figure 3.
 
Continuous whole-report results for orientation stimuli. (A) Orientation error distributions, where 360 possible error values are divided into 90 bins for visualization. Green and purple histograms show errors from participant-selected and randomly generated report conditions, respectively. The upper panel shows all report errors, separated by set size; the lower panel shows set size six report errors, separated by report order. (B) Distribution of relative distances for orientation reports (set size six). Color results are overlayed in black for comparison. (C) Median bootstrapped estimates of the KL divergence between each participant's relative distance distribution and size-matched samples from a uniform distribution. Color results are overlayed in gray for comparison.
Figure 3.
 
Continuous whole-report results for orientation stimuli. (A) Orientation error distributions, where 360 possible error values are divided into 90 bins for visualization. Green and purple histograms show errors from participant-selected and randomly generated report conditions, respectively. The upper panel shows all report errors, separated by set size; the lower panel shows set size six report errors, separated by report order. (B) Distribution of relative distances for orientation reports (set size six). Color results are overlayed in black for comparison. (C) Median bootstrapped estimates of the KL divergence between each participant's relative distance distribution and size-matched samples from a uniform distribution. Color results are overlayed in gray for comparison.
Bayesian finite mixture model
We also implemented a more complex generalization of the HBM known as a Bayesian finite mixture model (BFMM) (Orhan & Jacobs, 2013). In addition to individual- and ensemble-level representations, the BFMM also includes an intermediate level where stimuli are generated from a set of K components, such that similar stimulus values are assumed to be generated from the same component. Performing inference with the BFMM can be thought of as finding a posterior distribution over all possible assignments of N stimuli to K clusters. 
Formally, each stimulus θi is assumed to be generated by one of K weighted von Mises components. The components are assumed to be generated by the following process:  
\begin{equation*}G \sim DP\left( {{{G}_0},{\rm{\alpha }}} \right)\end{equation*}
 
\begin{equation*}{{G}_0}\left( {{{{\rm{\mu }}}_k},{{{\rm{\kappa }}}_k}} \right) = U\left( { - {\rm{\pi }},{\rm{\pi }}} \right)\mathcal{G}\left( {1,{{{\rm{\beta }}}_\sigma }} \right)\end{equation*}
where G is a discrete distribution over the parameters of the K components. G0 is the prior distribution over the two-dimensional parameter space for µk and and κk, the mean and precision of component k. The βσ variable for the gamma prior is given \(\mathcal{G}( {1,1} )\) hyperprior. 
G can be expressed as a weighted sum of K “atoms” or points p in the two-dimensional parameter space G0:  
\begin{equation*}G\left( p \right) = \mathop \sum \limits_{k = 1}^K {{{\rm{\pi }}}_k}{\rm{\delta }}\left( {p = {{p}_k}} \right)\end{equation*}
where πk is the weighting of component k and δ is the Dirac delta function. The component weights π are drawn from a symmetric Dirichlet prior with concentration parameters α/k:  
\begin{equation*}{\rm{\pi }} \sim Dirichlet\left( {{\rm{\alpha }}/K,\ ...,\ \alpha /K} \right)\end{equation*}
 
The assumed stimulus-generating and observation processes are then:  
\begin{equation*}{{{\rm{\theta }}}_i}|{{\mu }_k},{{\kappa }_k}\ \sim VM\left( {{{\mu }_k},{{\kappa }_k}} \right)\ \ \ \ \ \ \ i = 1, \ldots ,N\end{equation*}
 
\begin{equation*}{{x}_i}|{{{\rm{\theta }}}_i}\ \sim \ VM\left( {{{{\rm{\theta }}}_i},{{\kappa }_{obs}}} \right)\ \ \ \ \ \ \ i = 1, \ldots ,N\end{equation*}
where \(x_{i = 1}^N\)are noisy von Mises observations centered on true stimulus values \(\theta _{i = 1}^N\)with precision κobs
The BFMM has two free parameters: κobs, which indicates the observation precision, and the Dirichlet clustering parameter α. As above, simulations were performed in triplicate with κobs = 5, 10, and 20. The clustering parameter α was fixed at 1 for all simulations, roughly corresponding to a uniform distribution over component weights (Orhan & Jacobs, 2013). 
Model simulations
Model simulations were performed in the same way for both the HBM and the BFMM. To simulate a whole-report trial at set size N, a noisy estimate was first generated for each stimulus by sampling from a von Mises distribution with precision κobs. The models do not have access to the true values of θ but instead perform inference based on these noisy observations. Inference and model specification were implemented using the probabilistic Python package PyMC3 (Salvatier, Wiecki, & Fonnesbeck, 2016). Model posteriors were sampled using a Hamiltonian Monte Carlo algorithm known as the No-U-Turn Sampler (Hoffman & Gelman, 2014). For each trial, 5000 samples were drawn (after 2000 tuning steps). To simulate reporting, the mode of each stimulus posterior was used as a point estimate (taking the mean did not change our results), and point estimates were rounded to one of either 360 or eight possible values for the continuous and discrete tasks, respectively. 
Data and code availability
All Python code required to reproduce data processing, analysis, modeling, and visualization is available at the GitHub repository for this project The repository also contains both datasets in their original raw .mat format, as well as pickled binary versions (stored in Pandas DataFrames) suitable for analysis with Python. Installation and usage instructions are provided, and all code is thoroughly documented. 
Results
We analyzed behavioral data from two independent whole-report experiments to characterize within-trial report behavior and evaluate the applicability of encoding models developed in the context of single-report tasks. These whole-report experiments used two subtly different tasks (Figures 1A and 1B; see also Methods), so we began by replicating the continuous whole-report results of Adam et al. (2017) with discrete whole-report data. We found that cumulative error distributions were qualitatively similar despite differences in experimental setup, stimuli, timing, and report method. 
We then analyzed within-trial behavior and found that participants tended to group consecutive reports by color similarity—reports made later in a trial were often biased away from early-trial reports. This pattern of dependence was consistent across datasets, participants, and set sizes when reporting color but not when reporting orientation. 
Finally, we implemented two hierarchical Bayesian ensemble encoding models to determine whether color report dependencies could be explained by the joint encoding of within-trial stimuli. We simulated both whole-report tasks using both models and found no empirical evidence for the biases that either model predicted. 
Continuous and discrete whole-report error distributions are qualitatively similar
To verify that differences in task timing, stimuli, and report method between the continuous and discrete tasks did not impact participant behavior, we replicated the aggregate results of Adam et al. (2017) with the discrete whole-report dataset. When comparing data between tasks, we considered both report order conditions (participant- and randomly ordered) and restricted continuous task data to conditions using color stimuli (the discrete experiment did not include orientation stimuli). 
Error distributions from both tasks—collapsed across trials and participants—are shown in Figure 1. A classic delayed estimation result is that recall performance decreases as a function of set size, and Adam et al. (2017) found that continuous whole-report error distributions follow the same trend. Error distributions were less precise at higher set sizes (Figure 1C, upper) in both the participant- and randomly ordered report conditions. 
Discrete whole-report error distributions are qualitatively similar (Figure 1D, upper). Following Adam et al. (2017), we quantified this effect using mean resultant vector length (MRVL), a measure of circular dispersion. MRVL decreased as a function of set size in both report conditions, and repeated-measures analyses of variance (ANOVAs) showed a significant effect of set size on MRVL in both the participant-ordered condition, F(4, 52) = 339.66, p = 1.40 × 10–36, and the randomly ordered condition, F(4, 52) = 268.22, p = 5.00 × 10–34
Whole-report error distributions can be further separated by the order in which stimuli were reported within each trial. This is illustrated for set size six in the bottom panels of Figures 1C and 1D (see Supplementary Figure S1 for all set sizes). Adam et al. (2017) found that, when report order is freely selected by participants, MRVL decreases as a function of report number (reproduced in Supplementary Figure S2A). 
Discrete whole-report error distributions remain similar to continuous distributions when separated by report order. In the participant-ordered condition, MRVL decreased as a function of report number (Supplementary Figure S2B), and repeated-measures ANOVA showed a significant effect of report order for all set sizes. Notably, in this condition discrete whole-report participants were instructed to report stimuli “in order of confidence,” and participants in the continuous task appear to have done the same despite receiving no instruction. 
Within-trial color reports are not independent
As discussed above, the whole-report task allowed us to analyze the joint distribution of stimulus reports within a trial. Several models of delayed estimation assume that stimuli are encoded independently—if this is the case, then the joint distribution of within-trial reports should be uniform. 
Figure 2B shows example within-trial joint distributions for set size six of the participant-ordered continuous whole-report task. Each panel corresponds to the joint distribution between the first and the nth color reported within the same trial. Note that the joint distribution of two circular variables lies on the surface of a torus. This means that a point in the upper-left or bottom-right of panels in Figure 2B represents two values very near each other in circular space. Two dependencies are apparent: Participants tended to report similar colors for the first and second reports in a trial, and they tended to avoid reporting colors similar to the first with later reports. This is not unique to the first report, as consecutive reports are similar regardless of when in the trial they occurred. 
Another way to visualize this is to compute the relative angular distance between two colors reported in the same trial (Figure 2A). The upper panel of Figure 2C shows relative distance distributions computed for the continuous whole-report task at set size six. The pattern of dependencies described above is evident for the participant-ordered condition. 
Deviations from uniformity are also evident for the randomly ordered report condition, even though the corresponding distributions of presented values were uniform. Critically, this effect was not driven by a few participants making idiosyncratic reports but was consistent across all participants (Supplementary Figure S3) and set sizes (Supplementary Figure S4). 
Deviations from uniformity are evident for the discrete whole-report task, and patterns of dependence were qualitatively similar to those evident in the continuous task data, with one key difference—in both conditions of the discrete task, participants rarely repeated their first report later in the trial. This suggests that participants were not simply reporting the same color repeatedly but explicitly avoiding previous colors while consecutively reporting similar colors. 
We quantified dependence between within-trial reports using KL divergence, a measure of the statistical distance between two distributions. For each participant, report condition, and pair of reports, we estimated the KL divergence between the relative distance distribution and a circular uniform distribution. We compared these KL divergence estimates to a baseline divergence obtained by repeating this process using a size-matched circular uniform sample. This process was bootstrapped to obtain an estimate of uncertainty (see Methods for details), and median KL divergence estimates for each participant are shown alongside corresponding baseline estimates in the lower panels of Figures 2C and 2D. This analysis confirmed that distributions of within-trial relative distance often deviated from uniformity for all participants of both tasks. Divergences tended to be higher in the participant-ordered condition than in the randomly ordered condition and were highest for consecutive reports. 
Repeated-measures ANOVAs using estimate type (i.e., data vs. baseline) and report pair as within-subject factors confirmed a significant effect of estimate type on KL divergence in both the participant-ordered, F(1, 21) = 70.49, p = 3.76 × 10–8, and randomly ordered, F(1, 16) = 10.55, p = 0.005, conditions of the continuous task. The same analysis confirmed a significant effect in both the participant-ordered, F(1, 13) = 55.13, p = 5.0 × 10–6, and randomly ordered, F(1, 13) = 32.74, p = 7.0 × 10–5, conditions of the discrete task. These analyses also showed significant effects of report order, as well as an interaction between report order and estimate type. 
Within-trial orientation reports do not show clear dependence
Many models of delayed-estimation tacitly assume that different types of “circular” stimuli—such as colored squares and orientated bars—are equivalently encoded into working memory. This is supported by qualitative similarities in error distributions in single-report delayed estimation tasks (Bays, Catalao, & Husain, 2009; van den Berg et al., 2012; Zhang & Luck, 2008). Adam et al. (2017) found that aggregate error distributions for orientation and color stimuli were qualitatively similar in the whole-report task. Error distributions for reported orientations—collapsed across reports and participants—are less precise at higher set sizes (Figure 3A, upper) in both the participant- and randomly ordered report conditions. They also found that, when report order was freely selected by participants, MRVL decreased as a function of report number, again mirroring results for color stimuli. 
Despite this similarity at the level of aggregate errors, we found marked differences between color and orientation when analyzing within-trial reports (Figures 3B and 3C). Relative distance distributions for orientation did not exhibit clear dependence between consecutive reports in the participant-ordered condition, nor did participants avoid repeating early reports later in the trial. As above, we quantified dependence between within-trial orientation reports using KL divergence, a measure of the difference between two distributions. Median KL divergence estimates for each participant are shown alongside corresponding baseline estimates in Figure 3C (results for color reports are overlayed in gray for comparison). Note that baseline KL divergence estimates are higher for the color distributions—this is because participants in the orientation task completed 200 trials per set size (vs. 99 for the color task). In contrast to the results for color (Figure 2), this analysis revealed that for most participants, deviations from uniform were either greatly reduced or absent. 
We performed repeated-measures ANOVAs using estimate type (i.e., data vs. baseline) and report pair as within-subject factors. This analysis found a significant effect of estimate type on KL divergence in both the participant-ordered, F(1, 29) = 39.10, p = 0.01 × 10−4) and randomly ordered, F(1, 18) = 34.12, p = 1.6 × 10–5, conditions, but no significant effects of report order or interaction between report order and estimate type. Although this suggests some dependence between orientation reports, this effect was greatly reduced in comparison to results for color. 
Hierarchical ensemble encoding predicts biases that are not present in whole-report data
As described above, we found clear evidence that within-trial color reports were not independent in the whole-report task. Participants appeared to group stimuli by color when reporting, raising the possibility that they were encoding higher order statistics of color displays. This is a proposal made by “ensemble” encoding models of VWM, where visual scenes are hierarchically encoded at multiple levels of abstraction and individual stimulus representations are marginally dependent due to shared ensemble statistics (Brady, Konkle, & Alvarez, 2009; Brady & Alvarez, 2011; Orhan & Jacobs, 2013). This introduces potentially unwanted biases but could reduce variance in stimulus estimates (Orhan & Jacobs, 2013) and improve encoding efficiency (Nassar, 2018). 
We implemented two hierarchical encoding models to simulate the whole-report task and test this possibility. The first was a simple adaptation of Brady and Alvarez (2011): a two-level HBM that encodes both individual stimuli and the ensemble statistics—mean and precision—of all stimuli in the display. This model makes the simple, testable prediction that color reports will be biased toward the mean color of the display. 
To address the possibility that participants may be encoding multiple groups of similar colors, we also adapted the hierarchical encoding model BFMM (Orhan & Jacobs, 2013). The BFMM has another level of abstraction where stimuli are grouped by similarity, and the statistics of each group are also encoded. This model predicts that color reports will be biased toward the mean of the group to which they belong. 
Hierarchical Bayesian model
The biases produced by a HBM with two levels of representation are illustrated in Figure 4A. In the HBM, an observer assumes that all stimuli in a trial are samples from the same underlying ensemble distribution, and this internal model effectively forms a prior over the values of all stimuli. When integrating over the uncertainty in individual stimulus encodings and this prior, the observer produces posterior estimates that are biased toward the ensemble mean. 
Figure 4.
 
Hierarchical Bayesian model. (A) Illustration of the HBM encoding three color stimuli (vertical dashed lines). The model infers that all colors are drawn from the same ensemble distribution (gray). The solid lines show posterior estimates for each stimulus value, and black arrows illustrate the resulting bias toward the ensemble mean. (B) Justification for free parameter κobs, the precision of noisy observations. Left: Error distributions for color reports at set size 1, with 360 error values divided into 90 bins for visualization. Maximum-likelihood von Mises distribution fit is overlayed. Right: Median bootstrapped estimate of the precision of a von Mises fit to set size one report errors for each participant.
Figure 4.
 
Hierarchical Bayesian model. (A) Illustration of the HBM encoding three color stimuli (vertical dashed lines). The model infers that all colors are drawn from the same ensemble distribution (gray). The solid lines show posterior estimates for each stimulus value, and black arrows illustrate the resulting bias toward the ensemble mean. (B) Justification for free parameter κobs, the precision of noisy observations. Left: Error distributions for color reports at set size 1, with 360 error values divided into 90 bins for visualization. Maximum-likelihood von Mises distribution fit is overlayed. Right: Median bootstrapped estimate of the precision of a von Mises fit to set size one report errors for each participant.
The HBM implemented here assumes that individual stimulus and ensemble encodings take the form of a von Mises (or “circular normal”) distribution, each parameterized by a mean µ and precision κ. These parameters are all inferred from presented stimuli via Bayesian inference (see Methods for full equations and priors). The model also assumes that stimulus values are only available as a noisy observation sampled from a von Mises distribution with precision κobs centered on the true value. κobs is the only free parameter of the HBM and is assumed to represent the combined effect of sensory and memory noise (Orhan & Jacobs, 2013). Intuitively, lower values of κobs result in noisier individual representations and a greater bias toward the ensemble mean. 
To estimate reasonable values for κobs, we took advantage of continuous whole-report error distributions at set size 1, where there are theoretically no higher order statistics to encode. We fit a von Mises distribution to each participant's report errors at set size one, giving us a range of plausible κobs values (Figure 4B). For all HBM and BFMM results presented here, simulations were conducted in triplicate using κobs = 5, 10, and 20. 
Here, we focus on simulation results from set size three, reasoning that the HBM would tend to infer a more precise generative distribution—and therefore predict stronger biases—with fewer stimuli. HBM simulations of the continuous and discrete whole-report tasks are presented in Figures 5A and 5C, respectively. Report biases can be visualized by comparing two angles: (1) the angular distance between a presented color and the trial ensemble mean, and (2) the error between presented and reported colors. A correlation between the signs of these two angles provides evidence that reported colors are biased toward the ensemble mean. 
Figure 5.
 
HBM simulation results for set size three. (A, upper) Biases predicted by HBM simulations of the continuous task. Mean report error plotted as a function of the reported color's distance to the ensemble mean. 360 error values are divided into 90 bins for visualization, and the shaded area shows standard deviations. (A, lower) OLS regression of simulated report error on distance to ensemble mean (dashed black lines). (B, upper) Empirical biases for the continuous task at set size three (collapsed across all reports). (B, lower) OLS regression of empirical report error on distance to ensemble mean. Each column shows results for a different report number. (C) Same as (A) but for simulations of the discrete task. (D) Same as (B) but for empirical results of the discrete task.
Figure 5.
 
HBM simulation results for set size three. (A, upper) Biases predicted by HBM simulations of the continuous task. Mean report error plotted as a function of the reported color's distance to the ensemble mean. 360 error values are divided into 90 bins for visualization, and the shaded area shows standard deviations. (A, lower) OLS regression of simulated report error on distance to ensemble mean (dashed black lines). (B, upper) Empirical biases for the continuous task at set size three (collapsed across all reports). (B, lower) OLS regression of empirical report error on distance to ensemble mean. Each column shows results for a different report number. (C) Same as (A) but for simulations of the discrete task. (D) Same as (B) but for empirical results of the discrete task.
As expected, HBM simulations of both tasks demonstrate a clear bias toward the ensemble mean. Bias magnitude was a function of the distance to the mean color—the farther a stimulus was from the ensemble mean, the greater the reported bias toward the mean. Bias magnitude was also dependent on the choice of κobs, as the lowest κobs value resulted in the greatest bias and vice versa. We quantified bias using ordinary least-squares (OLS) regression of report error on the distance to ensemble mean. This analysis was restricted to include only mean distances between –π/2 and π/2, where the bias function was approximately linear (see Methods for details). Regression was done at two levels: (1) all simulated errors for a given set size, collapsed across reports; and (2) simulated errors separated by report number. Note that reports were not simulated sequentially—neither the HBM nor the BFMM includes a temporal or sequential component. For visualization of both simulations and data, the bias functions in Figure 5 include all reports, but the simulation regression plots in Figures 5A and 5C include only the “first” simulated report to facilitate comparison with data regressions, which are separated by report order. 
OLS regression was significant for all κobs in continuous whole-report HBM simulations at set size three (collapsed across reports). The largest estimated effect was for κobs = 5 (R2 = 0.278, mean distance coefficient = 0.428, p = 2.97 × 10–124), and the smallest estimated effect was for κobs = 20 (R2 = 0.053, mean distance coefficient = 0.078, p = 3.63 × 10–63). OLS regressions were significant when simulated reports were separated by report order. 
OLS regression was also significant for all κobs in discrete whole-report HBM simulations at set size three (collapsed across reports). The largest estimated effect was for κobs = 5 (R2 = 0.252, mean distance coefficient = 0.454, p = 4.03 × 10–104), and the smallest estimated effect was for κobs = 20 (R2 = 0.024, mean distance coefficient = 0.053, p = 2.73 × 10–10). OLS regressions were significant when simulated reports were separated by report order. 
Empirical bias functions for the continuous and discrete whole-report tasks are shown in Figures 5B and 5D, respectively. Empirical report errors were more broadly distributed (less accurate) than simulated results, but report error was comparable when only the first report in each trial was considered (Figures 5B and 5D, first column). 
We repeated the regression analysis using experimental data from both report conditions of both tasks at set size three. Regression was significant for the participant-ordered continuous data (collapsed across reports), but the estimated mean distance coefficient was negative, indicating a bias away from the ensemble mean (R2 = 0.01, mean distance coefficient = –0.050, p = 0.004). We also found significant negative biases for the first and third reports when the data were separated by report order. There were no significant regression results for the randomly ordered condition of the continuous whole-report task at set size three or for either condition of the discrete whole-report task. 
We also simulated both tasks, reporting conditions at set size six, and repeated the OLS regression analysis for both simulated and experimental reports. In comparison to set size three simulations, the HBM tended to infer less precise generative distributions, resulting in smaller predicted biases. These biases were still significant for all continuous task simulations but were not significant for most discrete task simulations—likely because the predicted biases were smaller than the discrete stimulus increment (π/4) and were lost when estimates were rounded for report. 
There were several significant regression results for experimental data from both tasks at set size six. Similar to set size three, these regressions all estimated small negative mean distance coefficients. Finally, we repeated all analyses above with experimental data from the continuous task with orientation stimuli (Supplementary Figure S5). We found no significant regressions with positive coefficients in either report condition at set size three or set size six. In summary, we found no empirical evidence for a report bias toward within-trial mean stimulus values. 
Bayesian finite mixture model
The HBM presented above assumes that all stimuli are generated from a single distribution. The limitations of this approach are clear when considering the HBM illustration in Figure 4. Although encoding orange and red together is intuitive, it is not obvious that the blue stimulus should be included. It is possible that participants cluster similar stimuli together and encode ensemble statistics of these clusters in addition to statistics of the entire display (Brady & Alvarez, 2011; Orhan & Jacobs, 2013). 
To address this, we adapted a flexible generalization of Brady and Alvarez's HBM, the BFMM (Orhan & Jacobs, 2013). The BFMM assumes that each stimulus value is generated from one of K-weighted von Mises components (Figure 6A shows an example where K = 2). Component weights and parameters are determined via Bayesian inference, so sampling from the posterior of the BFMM can be thought of as performing data-driven clustering (see Methods for full model and priors). 
Figure 6.
 
Bayesian finite mixture model. (A) Illustration of a BFMM with K = 2 components encoding six color stimuli (vertical dashed lines). The model infers that colors were generated from a mixture of two von Mises components (gray). Solid lines show posterior estimates for each stimulus value, and black arrows illustrate resulting biases toward component means. (B) Impact of κobs on clustering in a BFMM with K = 6 components. Upper: Distribution of the number of components that the BFMM infers generated colors on each trial of the simulated continuous whole-report task at set size six. Higher values of κobs cause the BFMM to assume a higher number of more precise components. Lower: Distribution of distance to inferred component mean for all simulated reports.
Figure 6.
 
Bayesian finite mixture model. (A) Illustration of a BFMM with K = 2 components encoding six color stimuli (vertical dashed lines). The model infers that colors were generated from a mixture of two von Mises components (gray). Solid lines show posterior estimates for each stimulus value, and black arrows illustrate resulting biases toward component means. (B) Impact of κobs on clustering in a BFMM with K = 6 components. Upper: Distribution of the number of components that the BFMM infers generated colors on each trial of the simulated continuous whole-report task at set size six. Higher values of κobs cause the BFMM to assume a higher number of more precise components. Lower: Distribution of distance to inferred component mean for all simulated reports.
K (the maximum number of components) was fixed to equal the simulated set size, and the same values of κobs (the precision of noisy observations) were used. Here, κobs has a subtly different impact on simulation results than in the HBM (Figure 6B). At lower values of κobs, the model tends to infer that stimuli were generated from a small number of low-precision components, whereas at higher values of κobs the model tends to infer a greater number of higher precision components (up to a maximum of K). 
The BFMM simulations were very similar to HBM simulations described above (Figure 7). To visualize bias, we plotted report error as a function of the distance to the mean of the component that the model assumes is responsible for generating a given stimulus. As expected, simulated reports for both the continuous and discrete tasks demonstrate a clear bias toward their respective component means. Similar to the HBM simulations above, bias magnitude was greater for larger distances to the component mean and was greater for smaller values of κobs. Overall, BFMM report biases were greater than HBM biases. 
Figure 7.
 
BFMM simulation results for set size six. (A) OLS regression of simulated continuous report error on distance to component mean (dashed black lines). (B) Same as (A) but for discrete task simulations. (C) OLS regression of empirical report error on distance to component mean. Each column shows results for a different report number. (D) Same as (C) but for empirical results of the discrete task.
Figure 7.
 
BFMM simulation results for set size six. (A) OLS regression of simulated continuous report error on distance to component mean (dashed black lines). (B) Same as (A) but for discrete task simulations. (C) OLS regression of empirical report error on distance to component mean. Each column shows results for a different report number. (D) Same as (C) but for empirical results of the discrete task.
To quantify bias, we used ordinary least squares (OLS) regression of report error on the distance to component mean. As above, we restricted this analysis to include only mean distances between –π/2 and π/2 (in practice, this included nearly all reports) (Figure 6B). 
OLS regression was significant for all κobs in continuous whole-report BFMM simulations at set size six (collapsed across reports). The largest estimated effect was for κobs = 5 (R2 = 0.725, mean distance coefficient = 0.725, p < 1.0 × 10–300), and the smallest estimated effect for κobs = 20 (R2 = 0.046, mean distance coefficient = 0.429, p < 1.0 × 10–300). OLS regressions were also significant when simulated reports were separated by report order. OLS regression was also significant for all κobs in discrete whole-report BFMM simulations at set size six (collapsed across reports). 
The largest estimated effect was for κobs = 5 (R2 = 0.644, mean distance coefficient = 0.722, p < 1.0 × 10–300), and the smallest estimated effect was for κobs = 20 (R2 = 0.207, mean distance coefficient = 0.282, p = 1.41 × 10–247). OLS regressions were also significant when simulated reports were separated by report order. 
We repeated the OLS regressions using empirical data from both report conditions of both tasks at set size six. Regressions were not significant for either report condition of either task when data were collapsed across reports. 
When separated by report order, three regressions reached significance for the continuous whole-report task: the third and sixth report in the participant-ordered condition and the third report of the randomly ordered condition. These regressions all estimated small negative coefficients for the distance to the component mean. There were no significant regression results for either condition of the discrete whole-report task when data were separated by report order. 
Finally, we repeated all analyses above with experimental data from the continuous task with orientation stimuli (Supplementary Figure S6) and found no significant regressions with positive coefficients. Taken together, the many regressions we performed suggest that color reports were not consistently biased toward inferred ensemble means in any condition of the whole-report task. 
Discussion
Here, we analyzed data from two whole-report VWM experiments to characterize within-trial behavior and test the applicability of delayed estimation models. In contrast to models that assume independent encoding, we found that within-trial color reports were not independent. We also showed that this effect is either weaker or absent in task conditions using orientation stimuli, calling into question the assumption that color and orientation are equivalently encoded variables. Finally, we implemented two hierarchical Bayesian ensemble encoding models that explicitly include dependence between encoded stimuli and found no empirical evidence for the biases that they predict. 
We began by replicating the continuous whole-report results reported by Adam et al. (2017) with data from our own discrete whole-report dataset. Although we successfully replicated several key results, our analyses of within-trial behavior in both datasets raise a potential issue with their conclusions. Specifically, Adam et al. found that late-trial report distributions were best described as uniform and that participants typically self-reported that late-trial color reports were guesses. Reasoning that participants had no information about several items on each trial, they concluded that this was clear evidence for a fixed item limit in VWM. This has proven controversial, as some have argued that apparently uniform distributions could reflect very low-precision retrieval (Schneegans et al., 2020), and others have failed to reproduce late-trial uniform distributions altogether (Oberauer, 2022). Our results raise a different issue, as we found that late-trial color reports were consistently biased away from early reports, which suggests that participants had some information about the colors they were reporting—even when self-reporting guesses. 
Our analyses addressed the possibility that within-trial dependencies were indicative of hierarchical Bayesian ensemble encoding, but it is also possible that participants were using explicit or implicit task strategies that introduced dependency. For example, discrete task participants tended to consecutively report similar but not identical colors, suggesting that they were exploiting the absence of stimulus repeats within each trial. It has also been shown that certain whole-report stimuli can induce structured non-uniform guessing in late-trial reports (Ngiam, Foster, Adam, & Awh, 2023), and we consider it probable that whole-report error distributions reflect both “memory” and “non-memory” processes. Unfortunately, post hoc attribution of behavior to one process or another is not always straightforward: What if a participant intentionally guesses a color because they do not remember it? Also, distinguishing between behavior resulting from encoding structure versus task strategy raises challenges for the interpretation of VWM tasks that we will revisit below. 
Adam et al. (2017) also reported very similar results for whole-report task conditions using color stimuli and those using orientation stimuli and fit a set of models that assume a generic circular variable is encoded. Modeling color and orientation as equivalent is common (Schneegans et al., 2020; van den Berg et al., 2012), but recent work has challenged the treatment of color as a circular variable (Schurgin et al., 2020). Although our results do not directly address the nature of color representations, the behavioral differences we identified between color and orientation conditions highlight the need to consider stimulus-dependent effects. It is also worth noting that there is considerable variation in visual working memory stimuli even when using the “same” stimulus type—few studies use standardized screen calibration methods for color (Bae, Allred, Wilson, & Flombaum, 2014), and orientation stimuli include Gabor patches ranging from –π/2 to π/2 (van den Berg et al., 2012), as well as various oriented “clock hands” and triangles ranging from –π to π (Adam et al., 2017; Bae & Luck, 2017; Utochkin & Brady, 2020). 
Our focus on ensemble encoding as a potential cause of within-trial report dependencies was motivated by a rich body of VWM literature. The assumption of independent stimulus encoding has been challenged before using a variety of single-report working memory tasks (Bae & Luck, 2017; Brady et al., 2009; Brady & Alvarez, 2011; Jiang, Olson, & Chun, 2000; Kahana & Sekuler, 2002; Lew & Vul, 2015; Nassar, 2018; Orhan & Jacobs, 2013) and at least one whole-report task (Utochkin & Brady, 2020) so it was unsurprising that we found evidence for dependence in the whole-report task. More surprising was that the hierarchical encoding models we implemented predicted report biases that were not present in either whole-report dataset. Our regression analyses of empirical reports failed to reproduce the consistent positive regression coefficients (report bias toward inferred ensemble or component mean values) that were present in HBM and BFMM simulations. In fact, a small number of regressions resulted in significant negative coefficients, which could be interpreted as bias away from ensemble or component means, but the effect was not consistent enough for us to draw such a conclusion. These results contrast with previous studies showing report biases toward mean stimulus size (Brady & Alvarez, 2011), mean horizontal position (Orhan & Jacobs, 2013), and mean two-dimensional spatial position (Lew & Vul, 2015) that were accounted for by very similar hierarchical Bayesian encoding models. This discrepancy could possibly be explained by differences in stimulus type and task structure, but this interpretation is complicated by the results of Utochkin and Brady (2020), where reports were biased toward mean orientation in a whole-report task. 
It is important to note that ensemble encoding can take many forms and that our results only contradict ensemble encoding models that assume a specific hierarchical Bayesian generative process. There are types of report bias—such as repulsion from dissimilar colors (Nassar, 2018)—that are not accounted for by hierarchical encoding but can still be considered evidence for ensemble encoding. Furthermore, a recent analysis of the same continuous whole-report dataset from Adam et al. (2017) found evidence for systematic biases that were explained by a non-hierarchical chunking model (Chunharas & Brady, 2023). 
Specifically, Chunharas and Brady (2023) showed that, in trials identified by the model as chunkable, early color reports were biased toward the “gist” color and later color reports were biased away from the “gist” color. The “gist” in their analysis differs subtly from the ensemble and component means inferred by the hierarchical Bayesian models we implemented in that it involves the convolution of an empirically derived psychophysical similarity function (Schurgin et al., 2020) with presented stimulus values. This method could be considered an empirical or phenomenological approach to identifying ensemble effects, in contrast to hierarchical Bayesian encoding models (Brady & Alvarez, 2011; Lew & Vul, 2015; Orhan & Jacobs, 2013) that make specific assumptions about the encoding structure used by participants. In our view, our results are compatible with those reported by Chunharas and Brady (2023); although we did not find evidence for hierarchical encoding, the dependencies we identified are consistent with ensemble encoding more broadly, and we agree with their rejection of independent-item-based accounts of visual working memory. 
Conclusions
We analyzed data from two whole-report visual working memory tasks and identified behavior that cannot be accounted for by models that assume independent encoding, equivalence between color and orientation stimuli, or hierarchical ensemble encoding. Given that these models were developed to fit single-report delayed estimation data and do not include sequential report mechanisms, it might be tempting to attribute such discrepancies to the idiosyncratic nature of the whole-report task. After all, the whole-report task plausibly suffers from the “task impurity” problem (Burgess, 1997) because it involves too many cognitive processes (e.g., attention, decision-making, strategies) that are incidental to the structure of VWM encodings. 
This problem is not unique to the current study and raises a difficult dilemma. Should we reject more complex working memory tasks in the pursuit of “pure” working memory encoding models that may not generalize, or should we attempt to model the many processes responsible for idiosyncratic phenomena specific to the whole-report task? There is growing recognition in the field that the proliferation of diverse VWM tasks, models, and empirical findings is outstripping our ability to develop theory capable of synthesizing them (Ngiam, 2024; Oberauer et al., 2018; Popov, 2023). In this context, neither option seems appropriate. Instead, we echo calls for a more integrative approach to theory development (Ngiam, 2024) and caution against inferring fundamental principles of visual working memory using data from any individual laboratory task. 
Acknowledgments
The authors sincerely thank K. Adam, E. Vogel, and E. Awh for sharing their experimental data publicly and therefore for making this project possible. 
Financial support for this project was provided by the Natural Sciences and Engineering Research Council of Canada and the Canada Foundation for Innovation. 
Commercial relationships: none. 
Corresponding author: Benjamin Cuthbert. 
Email: b.cuthbert@queensu.ca. 
Address: Centre for Neuroscience Studies, Kingston, ON K7L 3N6, Canada. 
References
Adam, K. C. S., Mance, I., Fukuda, K., & Vogel, E. K. (2015). The contribution of attentional lapses to individual differences in visual working memory capacity. Journal of Cognitive Neuroscience, 27(8), 1601–1616, https://doi.org/10.1162/jocn_a_00811. [CrossRef] [PubMed]
Adam, K. C. S., & Vogel, E. K. (2018). Improvements to visual working memory performance with practice and feedback. PLoS One, 13(8), e0203279, https://doi.org/10.1371/journal.pone.0203279. [CrossRef] [PubMed]
Adam, K. C. S., Vogel, E. K., & Awh, E. (2017). Clear evidence for item limits in visual working memory. Cognitive Psychology, 97, 79–97, https://doi.org/10.1016/j.cogpsych.2017.07.001. [CrossRef] [PubMed]
Alvarez, G. A., & Oliva, A. (2009). Spatial ensemble statistics are efficient codes that can be represented with reduced attention. Proceedings of the National Academy of Sciences, USA, 106(18), 7345–7350, https://doi.org/10.1073/pnas.0808981106. [CrossRef]
Bae, G., Allred, S. R., Wilson, C., & Flombaum, J. I. (2014). Stimulus-specific variability in color working memory with delayed estimation. Journal of Vision, 14(4):7, 1–23, https://doi.org/10.1167/14.4.7.doi. [CrossRef]
Bae, G.-Y., & Luck, S. J. (2017). Interactions between visual working memory representations. Attention, Perception, & Psychophysics, 79(8), 2376–2395, https://doi.org/10.3758/s13414-017-1404-8. [PubMed]
Bays, P. M., Catalao, R. F. G., & Husain, M. (2009). The precision of visual working memory is set by allocation of a shared resource. Journal of Vision, 9(10):7, 1–11, https://doi.org/10.1167/9.10.7. [PubMed]
Bays, P. M., & Husain, M. (2008). Dynamic shifts of limited working memory resources in human vision. Science, 321(5890), 851–854, https://doi.org/10.1126/SCIENCE.1158023. [PubMed]
Brady, T. F., & Alvarez, G. A. (2011). Hierarchical encoding in visual working memory: Ensemble statistics bias memory for individual items. Psychological Science, 22(3), 384–392, https://doi.org/10.1177/0956797610397956. [PubMed]
Brady, T. F., Konkle, T., & Alvarez, G. A. (2009). Compression in visual working memory: Using statistical regularities to form more efficient memory representations. Journal of Experimental Psychology. General, 138(4), 487–502, https://doi.org/10.1037/a0016797. [PubMed]
Burgess, P. (1997). Theory and methodology in executive function research. In Rabbitt, P. (Ed.), Methodology of frontal and executive function (pp. 87–121). London: Routledge.
Chunharas, C., & Brady, T. (2023). Chunking, attraction, repulsion and ensemble effects are ubiquitous in visual working memory. PsyArXiv, https://doi.org/10.31234/osf.io/es3b8.
Cuthbert, B., Standage, D., Paré, M., & Blohm, G. (2018). Strategic working memory performance may confound the interpretation of cumulative task statistics. Journal of Vision, 18(10), 685, https://doi.org/10.1167/18.10.685.
deBettencourt, M. T., Keene, P. A., Awh, E., & Vogel, E. K. (2019). Real-time triggering reveals concurrent lapses of attention and working memory. Nature Human Behaviour, 3(8), 808–816, https://doi.org/10.1038/s41562-019-0606-6. [PubMed]
Fougnie, D., Suchow, J. W., & Alvarez, G. A. (2012). Variability in the quality of visual working memory. Nature Communications, 3(1), 1229, https://doi.org/10.1038/ncomms2237. [PubMed]
Hao, Y., Li, X., Zhang, H., & Ku, Y. (2020). Free-recall benefit, inhomogeneity and between-item interference in working memory. PsyArXiv, https://psyarxiv.com/b69m8/.
Hao, Y., Li, X., Zhang, H., & Ku, Y. (2021). Free-recall benefit, inhomogeneity and between-item interference in working memory. Cognition, 214, 104739, https://doi.org/10.1016/j.cognition.2021.104739. [PubMed]
Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(47), 1593–1623.
Jiang, Y., Olson, I. R., & Chun, M. M. (2000). Organization of visual short-term memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26(3), 683–702, https://doi.org/10.1037//0278-7393.26.3.683. [PubMed]
Kahana, M. J., & Sekuler, R. (2002). Recognizing spatial patterns: A noisy exemplar approach. Vision Research, 42(18), 2177–2192, https://doi.org/10.1016/S0042-6989(02)00118-9. [PubMed]
Killebrew, K. W., Gurariy, G., Peacock, C. E., Berryhill, M. E., & Caplovitz, G. P. (2018). Electrophysiological correlates of encoding processes in a full-report visual working memory paradigm. Cognitive, Affective, & Behavioral Neuroscience, 18(2), 353–365, https://doi.org/10.3758/s13415-018-0574-8. [PubMed]
Kleiner, M., Brainard, D., Pelli, D., Ingling, A., Murray, R., & Broussard, C. (2007). What's new in Psychtoolbox-3. Perception, 36(14), 1–16.
Lew, T. F., & Vul, E. (2015). Ensemble clustering in visual working memory biases location memories and reduces the Weber noise of relative positions. Journal of Vision, 15(4):10, 1–14, https://doi.org/10.1167/15.4.10.
Nassar, M. R. (2018). Chunking as a rational strategy for lossy data compression in visual working memory. Psychological Review, 125(4), 486, https://doi.org/10.1037/rev0000101. [PubMed]
Ngiam, W. X. Q. (2024). Mapping visual working memory models to a theoretical framework. Psychonomic Bulletin & Review, 31(2), 442–459, https://doi.org/10.3758/s13423-023-02356-5. [PubMed]
Ngiam, W. X. Q., Foster, J. J., Adam, K. C. S., & Awh, E. (2023). Distinguishing guesses from fuzzy memories: Further evidence for item limits in visual working memory. Attention, Perception, & Psychophysics, 85(5), 1695–1709, https://doi.org/10.3758/s13414-022-02631-y. [PubMed]
Oberauer, K. (2017). An interference model of visual working memory. Psychological Review, 124(1), 21, https://doi.org/10.1037/rev0000044. [PubMed]
Oberauer, K. (2022). Little support for discrete item limits in visual working memory. Psychological Science, 33(7), 1128–1142, https://doi.org/10.1177/09567976211068045. [PubMed]
Oberauer, K., Lewandowsky, S., Awh, E., Brown, G. D. A., Conway, A., Cowan, N., ... Ward, G. (2018). Benchmarks for models of short-term and working memory. Psychological Bulletin, 144(9), 885, https://doi.org/10.1037/bul0000153. [PubMed]
Orhan, A. E., & Jacobs, R. A. (2013). A probabilistic clustering theory of the organization of visual short-term memory. Psychological Review, 120(2), 297–328, https://doi.org/10.1037/a0031541. [PubMed]
Peters, B., Rahm, B., Czoschke, S., Barnes, C., Kaiser, J., & Bledowski, C. (2018). Sequential whole report accesses different states in visual working memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 44(4), 588–603, https://doi.org/10.1037/xlm0000466. [PubMed]
Peters, B., Rahm, B., Kaiser, J., & Bledowski, C. (2019). Differential trajectories of memory quality and guessing across sequential reports from working memory. Journal of Vision, 19(7):3, 1–13, https://doi.org/10.1167/19.7.3.
Popov, V. (2023). If god handed us the ground-truth theory of memory, how would we recognize it? PsyArXiv, https://doi.org/10.31234/osf.io/ay5cm.
Robison, M. K., & Unsworth, N. (2019). Pupillometry tracks fluctuations in working memory performance. Attention, Perception, & Psychophysics, 81(2), 407–419, https://doi.org/10.3758/s13414-018-1618-4. [PubMed]
Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in Python using PyMC3. PeerJ Computer Science, 2, e55, https://doi.org/10.7717/peerj-cs.55.
Schneegans, S., Taylor, R., & Bays, P. M. (2020). Stochastic sampling provides a unifying account of visual working memory limits. Proceedings of the National Academy of Sciences, USA, 117(34), 20959–20968, https://doi.org/10.1073/pnas.2004306117.
Schurgin, M. W., Wixted, J. T., & Brady, T. F. (2020). Psychophysical scaling reveals a unified theory of visual memory strength. Nature Human Behaviour, 4(11), 1156–1172, https://doi.org/10.1038/s41562-020-00938-0. [PubMed]
Swan, G., & Wyble, B. (2014). The Binding Pool model of VWM: A model for storing individuated objects in a shared resource pool. Journal of Vision, 14(10), 160, https://doi.org/10.1167/14.10.160.
Udale, R., Gramm, K., Husain, M., & Manohar, S. G. (2021). How does working memory store more information at larger set sizes? A composite code model. PsyArXiv, https://doi.org/10.31234/OSF.IO/NDQ9E.
Utochkin, I. S., & Brady, T. F. (2020). Individual representations in visual working memory inherit ensemble properties. Journal of Experimental Psychology: Human Perception and Performance, 46(5), 458–473, https://doi.org/10.1037/xhp0000727. [PubMed]
van den Berg, R., Shin, H., Chou, W.-C., George, R., & Ma, W. J. (2012). Variability in encoding precision accounts for visual short-term memory limitations. Proceedings of the National Academy of Sciences, USA, 109(22), 8780–8785, https://doi.org/10.1073/pnas.1117465109.
Wilken, P., & Ma, W. J. (2004). A detection theory account of change detection. Journal of Vision, 4(12):11, 1120–1125, https://doi.org/10.1167/4.12.11. [PubMed]
Zhang, W., & Luck, S. J. (2008). Discrete fixed-resolution representations in visual working memory. Nature, 453(7192), 233–235, https://doi.org/10.1038/nature06860. [PubMed]
Figure 1.
 
Whole-report task comparison. (A) Continuous whole-report task. Colored squares were briefly presented, and after a blank delay period the participant reported all colors by clicking a color wheel. Report order was either participant selected or randomly generated. Stimuli were sampled from 360 “continuous” color values. (B) Discrete whole-report task. Same as (A), but participants reported by clicking a square in the response array (inset). Stimuli were sampled from eight “discrete” color values. (C) Continuous error distributions, where 360 possible error values are divided into 90 bins for visualization. Orange and blue histograms show errors from participant-selected and randomly generated report conditions, respectively. In the upper panel, all report errors are separated by set size; the lower panel shows set size six report errors, separated by report order. (D) Discrete error distributions. Same as (C), but bins correspond to all possible error values.
Figure 1.
 
Whole-report task comparison. (A) Continuous whole-report task. Colored squares were briefly presented, and after a blank delay period the participant reported all colors by clicking a color wheel. Report order was either participant selected or randomly generated. Stimuli were sampled from 360 “continuous” color values. (B) Discrete whole-report task. Same as (A), but participants reported by clicking a square in the response array (inset). Stimuli were sampled from eight “discrete” color values. (C) Continuous error distributions, where 360 possible error values are divided into 90 bins for visualization. Orange and blue histograms show errors from participant-selected and randomly generated report conditions, respectively. In the upper panel, all report errors are separated by set size; the lower panel shows set size six report errors, separated by report order. (D) Discrete error distributions. Same as (C), but bins correspond to all possible error values.
Figure 2.
 
Dependence of within-trial color reports. (A) Relative distance calculation, which was identical for both tasks. (B) Example joint distribution of reported stimulus values. Data from the participant-ordered continuous whole-report task (set size six; aggregated across all participants). Each panel plots the first color reported on each trial against a later report in the same trial. (C, upper) Distribution of relative distances for the continuous task (set size six). (C, lower) Median bootstrapped estimates of the KL divergence between each participant's relative distance distribution and size-matched samples from a uniform distribution. (D) Same as (C) but for the discrete whole-report task (set size six).
Figure 2.
 
Dependence of within-trial color reports. (A) Relative distance calculation, which was identical for both tasks. (B) Example joint distribution of reported stimulus values. Data from the participant-ordered continuous whole-report task (set size six; aggregated across all participants). Each panel plots the first color reported on each trial against a later report in the same trial. (C, upper) Distribution of relative distances for the continuous task (set size six). (C, lower) Median bootstrapped estimates of the KL divergence between each participant's relative distance distribution and size-matched samples from a uniform distribution. (D) Same as (C) but for the discrete whole-report task (set size six).
Figure 3.
 
Continuous whole-report results for orientation stimuli. (A) Orientation error distributions, where 360 possible error values are divided into 90 bins for visualization. Green and purple histograms show errors from participant-selected and randomly generated report conditions, respectively. The upper panel shows all report errors, separated by set size; the lower panel shows set size six report errors, separated by report order. (B) Distribution of relative distances for orientation reports (set size six). Color results are overlayed in black for comparison. (C) Median bootstrapped estimates of the KL divergence between each participant's relative distance distribution and size-matched samples from a uniform distribution. Color results are overlayed in gray for comparison.
Figure 3.
 
Continuous whole-report results for orientation stimuli. (A) Orientation error distributions, where 360 possible error values are divided into 90 bins for visualization. Green and purple histograms show errors from participant-selected and randomly generated report conditions, respectively. The upper panel shows all report errors, separated by set size; the lower panel shows set size six report errors, separated by report order. (B) Distribution of relative distances for orientation reports (set size six). Color results are overlayed in black for comparison. (C) Median bootstrapped estimates of the KL divergence between each participant's relative distance distribution and size-matched samples from a uniform distribution. Color results are overlayed in gray for comparison.
Figure 4.
 
Hierarchical Bayesian model. (A) Illustration of the HBM encoding three color stimuli (vertical dashed lines). The model infers that all colors are drawn from the same ensemble distribution (gray). The solid lines show posterior estimates for each stimulus value, and black arrows illustrate the resulting bias toward the ensemble mean. (B) Justification for free parameter κobs, the precision of noisy observations. Left: Error distributions for color reports at set size 1, with 360 error values divided into 90 bins for visualization. Maximum-likelihood von Mises distribution fit is overlayed. Right: Median bootstrapped estimate of the precision of a von Mises fit to set size one report errors for each participant.
Figure 4.
 
Hierarchical Bayesian model. (A) Illustration of the HBM encoding three color stimuli (vertical dashed lines). The model infers that all colors are drawn from the same ensemble distribution (gray). The solid lines show posterior estimates for each stimulus value, and black arrows illustrate the resulting bias toward the ensemble mean. (B) Justification for free parameter κobs, the precision of noisy observations. Left: Error distributions for color reports at set size 1, with 360 error values divided into 90 bins for visualization. Maximum-likelihood von Mises distribution fit is overlayed. Right: Median bootstrapped estimate of the precision of a von Mises fit to set size one report errors for each participant.
Figure 5.
 
HBM simulation results for set size three. (A, upper) Biases predicted by HBM simulations of the continuous task. Mean report error plotted as a function of the reported color's distance to the ensemble mean. 360 error values are divided into 90 bins for visualization, and the shaded area shows standard deviations. (A, lower) OLS regression of simulated report error on distance to ensemble mean (dashed black lines). (B, upper) Empirical biases for the continuous task at set size three (collapsed across all reports). (B, lower) OLS regression of empirical report error on distance to ensemble mean. Each column shows results for a different report number. (C) Same as (A) but for simulations of the discrete task. (D) Same as (B) but for empirical results of the discrete task.
Figure 5.
 
HBM simulation results for set size three. (A, upper) Biases predicted by HBM simulations of the continuous task. Mean report error plotted as a function of the reported color's distance to the ensemble mean. 360 error values are divided into 90 bins for visualization, and the shaded area shows standard deviations. (A, lower) OLS regression of simulated report error on distance to ensemble mean (dashed black lines). (B, upper) Empirical biases for the continuous task at set size three (collapsed across all reports). (B, lower) OLS regression of empirical report error on distance to ensemble mean. Each column shows results for a different report number. (C) Same as (A) but for simulations of the discrete task. (D) Same as (B) but for empirical results of the discrete task.
Figure 6.
 
Bayesian finite mixture model. (A) Illustration of a BFMM with K = 2 components encoding six color stimuli (vertical dashed lines). The model infers that colors were generated from a mixture of two von Mises components (gray). Solid lines show posterior estimates for each stimulus value, and black arrows illustrate resulting biases toward component means. (B) Impact of κobs on clustering in a BFMM with K = 6 components. Upper: Distribution of the number of components that the BFMM infers generated colors on each trial of the simulated continuous whole-report task at set size six. Higher values of κobs cause the BFMM to assume a higher number of more precise components. Lower: Distribution of distance to inferred component mean for all simulated reports.
Figure 6.
 
Bayesian finite mixture model. (A) Illustration of a BFMM with K = 2 components encoding six color stimuli (vertical dashed lines). The model infers that colors were generated from a mixture of two von Mises components (gray). Solid lines show posterior estimates for each stimulus value, and black arrows illustrate resulting biases toward component means. (B) Impact of κobs on clustering in a BFMM with K = 6 components. Upper: Distribution of the number of components that the BFMM infers generated colors on each trial of the simulated continuous whole-report task at set size six. Higher values of κobs cause the BFMM to assume a higher number of more precise components. Lower: Distribution of distance to inferred component mean for all simulated reports.
Figure 7.
 
BFMM simulation results for set size six. (A) OLS regression of simulated continuous report error on distance to component mean (dashed black lines). (B) Same as (A) but for discrete task simulations. (C) OLS regression of empirical report error on distance to component mean. Each column shows results for a different report number. (D) Same as (C) but for empirical results of the discrete task.
Figure 7.
 
BFMM simulation results for set size six. (A) OLS regression of simulated continuous report error on distance to component mean (dashed black lines). (B) Same as (A) but for discrete task simulations. (C) OLS regression of empirical report error on distance to component mean. Each column shows results for a different report number. (D) Same as (C) but for empirical results of the discrete task.
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×