Perceptual confidence is thought to arise from metacognitive processes that evaluate the underlying perceptual decision evidence. We investigated whether metacognitive access to perceptual evidence is constrained by the hierarchical organization of visual cortex, where high-level representations tend to be more readily available for explicit scrutiny. We found that the ability of human observers to evaluate their confidence did depend on whether they performed a high-level or low-level task on the same stimuli, but was also affected by manipulations that occurred long after the perceptual decision. Confidence in low-level perceptual decisions degraded with more time between the decision and the response cue, especially when backward masking was present. Confidence in high-level tasks was immune to backward masking and benefitted from additional time. These results can be explained by a model assuming confidence heavily relies on postdecisional internal representations of visual stimuli that degrade over time, where high-level representations are more persistent.

^{1}(https://osf.io/6brmt/). Ethical approval was granted by the local ethics committee (CER U-Paris), and the protocols adhere to the Declaration of Helsinki. While the preregistrations were followed in terms of task methodology, sampling plan, and the modelling approach, we note that the preregistered analysis plan allowed for some flexibility with regards to the choice of statistical tests, which we explain in more detail under the analysis section.

^{2}), subtending 14° of visual angle. The original eyes were replaced with realistic counterparts to manipulate the direction of gaze and relative contrast of the irises according to precise angular coordinates and gray levels. The high-level task was to discriminate left from right gaze direction (with random contrast difference), and the low-level task was to discriminate whether the left or right eye was “darker” (higher contrast, with random gaze directions), using the left and right arrow keys of a standard keyboard.

_{s}, and the observer has a noisy sensory representation of this evidence,

*s*, on which to base their perceptual decisions

_{s}is drawn from a zero-mean Gaussian with variance \(\sigma _s^2\). The ideal confidence observer is defined as relying on this same evidence, normalised to the perceptual response criterion,

*c*, and sensory noise, σ

_{s}, to form the confidence evidence,

*w*, upon which a confidence judgment is made

_{c}, which is drawn from a zero-mean Gaussian with independent variance, \(\sigma _c^2\)

^{2}(a fitted parameter), attributes all the noise in the confidence evidence as confidence noise (forcing the confidence boost to be 1), so that confidence efficiency is defined as

_{c}, and confidence boost α) was estimated by bootstrapping (1,000 permutations of resampled participants).

*t*tests may be inappropriate, because the parameter values have a lower bound at 0. We preregistered that we would use Wilcoxon sign-rank tests, unless the data were sufficiently normally distributed, and Bayesian statistics would be used to examine evidence for the null. A Kolmogorov–Smirnov test suggested the data could be approximated by a normal distribution (in Experiment 1, confidence efficiency Kolmogorov–Smirnov statistic = 0.013,

*p*> 0.99; confidence noise Kolmogorov–Smirnov statistic = 0.022,

*p*= 0.72; confidence boost Kolmogorov–Smirnov statistic = 0.039,

*p*= 0.1; and Experiment 2, confidence efficiency Kolmogorov–Smirnov statistic = 0.014,

*p*> 0.99; confidence noise Kolmogorov–Smirnov statistic = 0.02,

*p*= 0.80; confidence boost Kolmogorov–Smirnov statistic = 0.02,

*p*= 0.78). So, we initially used

*t*tests; however, it became more obvious that it would be beneficial to make inferences about the probability of the null hypothesis of no difference between conditions, so we switched to Bayesian statistics across all comparisons for consistency. We note that the same conclusions were drawn from the

*t*tests and the Bayesian statistics (with the exception of accepting the null in some Bayesian comparisons, which produced large

*p*values for the

*t*tests).

*N*quantiles of the bootstrapped data, where

*N*is the number of participants. The models assumed a gamma distribution for confidence efficiency and confidence noise, and a beta distribution for confidence boost. The within-subjects comparison assumed that each subject had some effect,

*x*, such that their value in condition

*B*differs from condition

*A*by

*x*. Across subjects, the effects,

*a,*are normally distributed with mean µ

_{x}and variance \(\sigma _x^2\), such that the effect size, δ, is \(\delta = \ \frac{{{\mu _x}}}{{{\sigma _x}}}\). Between-subject comparisons modelled δ directly, where the difference in the group means,

*y*, is

*y*= δ × σ

_{AB}, with \(\sigma _{AB}^2\), computed as the combined variance across the two groups. The posterior was estimated by Markov chain Monte Carlo simulation (with 12,000 samples over three independent chains, 1,000 samples burn-in and three samples thinning), using the slice sampling method (Neal, 2003) implemented in MATLAB. For all parameters, uninformative priors were specified as a uniform distribution over all possible (or a large range) of parameter values. Evidence in favor of the alternative hypothesis (δ ≠ 0) was based on the 95% highest density interval of the posterior distribution of δ values. Bayes factors were computed based on the Savage–Dickey ratio (Wagenmakers et al., 2010) using a unit information prior (Kass & Wasserman, 1995).

*E*(δ |

*x*

_{high − low}) = 0.03 and the 95% highest density interval overlapped with 0 [−0.40; 0.46], suggesting little evidence for the alternative hypothesis of a difference in sensitivity. We computed a Bayes factor of evidence in favor of the alternative hypothesis using the Savage–Dickey ratio (Wagenmakers et al., 2010) with a unit information prior (Kass & Wasserman, 1995). We found BF

_{10}= 0.21; computed in this direction, the larger the number relative to 1, the more evidence against the null hypothesis (and the smaller the number relative to 1 the more evidence in favor of the null). In general, a Bayes factor of greater than 3.2 (<0.31) could be considered substantial evidence in favor of the alternative hypothesis (null hypothesis) (Kass & Raftery, 1995). We therefore show evidence in favor of the null hypothesis of no difference in sensitivity across the high-level and low-level tasks.

*E*(δ |

*x*

_{high − low}) = −0.18 [−0.38; 0.03] 95% highest density interval; BF

_{10}= 0.43). The average psychometric functions from Experiments 1 and 2 are presented in Figure 2A, with the slopes of individual participants shown in Figure 2B. In addition, we performed an exploratory analysis examining median reaction times across the high-level and low-level tasks and found no evidence for a difference (

*E*(δ |

*x*

_{high − low}) = −0.13 [−0.08; 0.34]; BF

_{10}= 0.23; Figure 2C).

*E*(δ |

*x*

_{high − low}) = 0.75 [0.27; 1.24] 95% highest density interval; BF

_{10}= 32.26) (Figure 2D and E, top left). Comparisons of confidence noise and confidence boost suggested that, although low-level confidence efficiency was impaired by relatively less noise (

*E*(δ |

*x*

_{high − low}) = 1.46 [0.89; 2.08]; BF

_{10}>1,000) (Figure 2D and 2E, middle left), low-level confidence also showed little sign of confidence boost, which substantially contributed to high-level confidence efficiency (

*E*(δ |

*x*

_{high − low}) = 1.95 [1.24; 2.7]; BF

_{10}1,000) (Figure 2D and E, bottom left). In this experiment, participants were presented with the same sets of stimuli across the two tasks, they showed similar sensitivity with their perceptual decisions across the two tasks, and yet confidence efficiency was significantly superior for the high-level perceptual decisions.

*E*(δ |

*x*

_{high − low}) = −0.66 [−0.42; −0.88]; BF

_{01}>1,000) (Figure 2D and E, top right).

*E*(δ |

*x*

_{Exp 1 − 2}) = 0.31 [−0.25; 0.8]; BF

_{10}= 0.55) (Figure 2D), although slightly lower in the low-level task (

*E*(δ |

*x*

_{Exp 1 − 2}) = 0.48 [0.09; 0.8]; BF

_{10}= 4.35). Confidence efficiency was greater in the low-level task of Experiment 2 (

*E*(δ |

*x*

_{Exp 1 − 2}) = −1.22 [−0.5; −1.92]; BF

_{10}= 25; but not substantially dissimilar in the high-level task;

*E*(δ |

*x*

_{Exp 1 − 2}) = 0.49 [−0.12; 1.08]; BF

_{10}= 0.93) (Figure 2F, top). Performance metrics were not overall worse in Experiment 2, this finding suggests that the difference in the results is not due to differences in stimulus presentation or task engagement in the online format.

*E*(δ |

*x*

_{mask − nomask}) = −0.73 [−0.51; −0.97]; BF

_{10}>1,000) (Figure 3A, left), whereas perceptual sensitivity was unaffected (

*E*(δ |

*x*

_{mask − nomask}) = 0.05 [−0.25; 0.16]; BF

_{10}= 0.12) (Figures 3E and F). This was mainly due to a decrease in confidence noise in the no-mask condition (

*E*(δ |

*x*

_{mask − nomask}) = 0.54 [0.32; 0.76]; BF

_{10}>1,000) (Figure 3A, middle) with some evidence for a decrease in confidence boost (

*E*(δ |

*x*

_{mask − nomask}) = 0.27 [0.06; 0.47]; BF

_{10}= 3.13; although both conditions showed close to no boost) (Figure 3A, right).

*E*(δ |

*x*

_{mask − nomask}) = 0.02 [−0.19; 0.22]; BF

_{10}= 0.11) nor confidence efficiency (as predicted;

*E*(δ |

*x*

_{mask − nomask}) = 0.1 [−0.29; 0.1]; BF

_{10}= 0.16) (Figure 3A). The confidence noise and boost parameters did appear to differ between mask and no-mask conditions in the high-level task (despite the effects cancelling to give equal confidence efficiency; there was more confidence noise in the no-mask condition;

*E*(δ |

*x*

_{mask − nomask}) = −1.71 [−2.02; −1.38]; BF

_{10}> 1,000; and more confidence boost;

*E*(δ |

*x*

_{mask − nomask}) = −1.56 [−1.88; −1.25]; BF

_{10}>1,000) (Figure 3A).

*E*(δ |

*x*

_{Exp 1 − 2}) = −1.22 [−1.91; −0.41]; BF

_{10}= 25), compared with the within-subject effect

*E*(δ |

*x*

_{mask − nomask}) = −0.73 in Experiment 3 (Figure 3B); we would expect the effect size to be at least as large (e.g., Figure 4C).

*E*(δ |

*x*

_{short − long}) = 0.46 [0.25; 0.68]; BF

_{10}>10,001) (Figure 4A, left), and underlying this was a decrease in noise (

*E*(δ |

*x*

_{short − long}) = 0.56 [0.34; 0.77]; BF

_{10}>1,000) (Figure 4A, middle), and a greater decrease in boost (

*E*(δ |

*x*

_{short − long}) = 0.82 [0.58; 1.06]; BF

_{11}>1,000) (Figure 4A, right). In the high-level task, there was a significant increase in confidence efficiency with longer duration between stimulus offset and the response cue (

*E*(δ |

*x*

_{short − long}) = −0.75 [−0.98; −0.53]; BF

_{11}> 1,000) (Figure 4A, left), with increased noise (

*E*(δ |

*x*

_{short − long}) = −0.48 [−0.7; −0.26]; BF

_{10}>1,000) (Figure 4A, middle), and a larger increase in boost (

*E*(δ |

*x*

_{short − long}) = −0.75 [−0.99; −0.52]; BF

_{10}> 1,000) (Figure 4A, right). This finding was in line with our prediction that response cue timing interacts with the effect of task on confidence efficiency, increasing confidence efficiency with more time in the high-level task and decreasing in the low-level task.

*s*is the accumulated evidence up to time

_{t}*t*, which is described as evolving over small time steps, Δ

*t*, with added signal, µ

_{s}, and noise, ε

_{s,t + Δt}, drawn from independent identically distributed Gaussian distributions with zero mean and variance \(\sigma _s^2\). Example evidence accumulation traces are shown in Figure 5A (left), simulating the different evidence strengths (signal-to-noise ratios) used in these experiments. The observer commits to a decision when the accumulated evidence reaches a decision bound (deciding “right” when the evidence reaches the upper bound, or “left” at the lower bound; black curves on Figure 5A describe collapsing decision bounds, which account for how the observer commits to a decision based on little evidence without very long decision times). The response time includes additional nondecision time (e.g., the time from committing to a decision to planning and executing the button press response to report the decision).

_{01}= 1.98), suggesting that those participants who took longer to respond were accumulating evidence with a lower signal-to-noise ratio for a longer duration. In other words, participants who took longer to respond had a lower predecision signal-to-noise ratio and so should experience relatively less benefit from ongoing accumulation in the high-level task, and relatively less harm from ongoing accumulation in the low-level task. The number of trials in each group was only sufficient to estimate confidence efficiency, not to fit the full model with confidence noise and boost (Mamassian & de Gardelle, 2022). The difference between confidence efficiency in the high-level and low-level tasks was dependent on the median reaction time (Figures 5C, 5D): for the fast responders, confidence efficiency was greater in the high-level task (

*E*(δ |

*x*

_{high − low}) = 0.71 [0.33; 1.12]; BF

_{10}> 1,000); for the median responders, confidence efficiency was slightly greater in the low-level task (

*E*(δ |

*x*

_{high − low}) = −0.44 [−0.81; −0.07]; BF

_{10}= 3.03); and for the slow responders, confidence efficiency was even greater in the low-level task (

*E*(δ |

*x*

_{high − low}) = −0.89 [−1.30; −0.49]; BF

_{10}> 1,000). This pattern of effects was predicted by simulating a model (Figure 5E) in which confidence efficiency in the high-level task was predicted by continued accumulation with the same signal-to-noise ratio as prior to decision commitment, whereas in the low-level task observers only accumulated additional noise (the red bars of Figure 5A correspond with the medium response time observers, the middle dot of Figure 5E). Across RT groups, confidence efficiency decreased in the high-level task with increasing reaction time, whereas in the low-level task confidence efficiency slightly increased with increasing reaction time.

*E*(δ |

*x*

_{long − short}) = 0.81 [0.36; 1.23]; BF

_{10}> 1,000), whereas observers with median and long response times showed the opposite effect (median responders:

*E*(δ |

*x*

_{long − short}) = −1.03 [−1.49; −0.59]; BF

_{10}>1 0,00) slow responders:

*E*(δ |

*x*

_{long − short}) = −0.92 [−1.35; −0.48]; BF

_{10}> 1,000) (summary in Figure 5G). The opposite pattern was visible in the high-level task: fast responders showed worse confidence efficiency in the short condition (

*E*(δ |

*x*

_{long − short}) = −1.54 [−2.01; −1]; BF

_{10}> 1,000), whereas median and slow responders showed better confidence efficiency in the short condition (median responders:

*E*(δ |

*x*

_{long − short}) = 2.20 [1.55; 2.88]; BF

_{10}> 1,000; slow responders:

*E*(δ |

*x*

_{long − short}) = 0.56 [0.167; 0.94]; BF

_{10}= 6.54) (Figure 5G). Note that a simple effect of “poststimulus time” would predict no difference in confidence efficiency in the long condition dependent on decision time in the short condition (poststimulus time is the same for the three groups of participants). Instead, we find confidence efficiency is modulated across decision-time groups in the long condition, suggesting the effect is driven by underlying “postdecision time.”

*E*(δ |

*x*

_{long − short}) = 0.31 [−0.06; 0.69]; BF

_{10}= 0.66); median responders:

*E*(δ |

*x*

_{long − short}) = −0.27 [−0.66; 0.1]; BF

_{10}= 0.50; slow responders: (

*E*(δ |

*x*

_{long − short}) = −0.422 [−0.79; −0.05]; BF

_{10}= 2.17). In the high-level task, the switch from short condition showing better confidence efficiency to long condition showing better confidence efficiency was delayed to the median and slow responders (Figure 5G, bottom; confidence efficiency was the same across conditions for fast responders:

*E*(δ |

*x*

_{long − short}) = −0.01 [−0.38; 0.37]; BF

_{10}= 0.21; median responders: (

*E*(δ |

*x*

_{long − short}) = −0.44 [−0.80; −0.05]; BF

_{10}= 2.00; and slow responders: (

*E*(δ |

*x*

_{long − short}) = 0.85 [0.42; 1.27]; BF

_{10}> 1,000).

**Author contributions:**Conceptualization: T. Balsdon, V. Wyart, P. Mamassian; Methodology: T. Balsdon, V. Wyart, P. Mamassian; Software: T. Balsdon, P. Mamassian; Formal Analysis: T. Balsdon, P. Mamassian; Investigation: T. Balsdon; Resources: V. Wyart, P. Mamassian; Data Curation: T. Balsdon; Writing – Original Draft: T. Balsdon; Writing – Review & Editing: T. Balsdon, V. Wyart, P. Mamassian; Visualization: T. Balsdon, V. Wyart, P. Mamassian; Funding Acquisition: V. Wyart, P. Mamassian.

*Trends in Cognitive Sciences,*8(10), 457–464. [PubMed]

*Visual masking: Studying perception, attention, and consciousness*. Academic Press.

*Attention Perception & Psychophysics,*79, 1993–2006. [PubMed]

*eLife,*10, e68491. [PubMed]

*Nature Communications*, 11(1), 1–11. [PubMed]

*Journal of Experimental Psychology: General*, 148(3), 437. [PubMed]

*Spatial Vision*, 10(4), 433–436. [PubMed]

*Current Biology,*17(1), 20–25.

*Current Biology,*21(21), 1817–1821.

*Nature,*395, 896–900. [PubMed]

*Psychological Science*, 25(6), 1286–1288. [PubMed]

*PloS One,*10(3), e0120870. [PubMed]

*Nature Communications,*13(1), 1–12. [PubMed]

*Journal of Experimental Psychology: General*, 129, 481–507. [PubMed]

*Proceedings of the National Academy of Sciences of the United State of America,*114(43), E9115–E9124.

*Cerebral Cortex*, 1(1), 1–47. [PubMed]

*Psychological Review,*124(1), 91. [PubMed]

*Nature Human Behaviour,*6, 294–305. [PubMed]

*Journal of Neuroscience*, 28(10), 2539–2550. [PubMed]

*Neuron,*36(5), 791–804. [PubMed]

*Eye, brain, and vision*. Scientific American Library.

*Cognition,*27(2), 117–143. [PubMed]

*Neuron,*15, 843–856. [PubMed]

*Journal of the American Statistical Association,*90(430), 773–795.

*Journal of the American Statistical Association*, 90(431), 928–934.

*Journal of Consumer Research,*15(4), 411–421.

*Perception,*36(14), 1–16.

*Developmental Science*, 14(5), 1075–1088. [PubMed]

*Current Opinion in Neurobiology,*8, 529–535, https://doi.org/10.1016/S0959-4388(98)80042-1. [PubMed]

*Trends in Neurosciences*, 23(11), 571–579. [PubMed]

*The Cognitive Neurosciences*. MIT Press.

*Scientific Reports*, 10(1), 1–11. [PubMed]

*Vision Research*, 190, 107963. [PubMed]

*Inattentional Blindness,*MIT Press.

*Perception,*49(6), 616–635. [PubMed]

*Psychological Review*. 129(5), 976–998, https://doi.org/10.1037/rev0000312. [PubMed]

*Attention, Perception, & Psychophysics,*78(3), 923–937. [PubMed]

*Annual Review of Neuroscience*, 10(1), 363–401. [PubMed]

*Journal of Experimental Psychology: General*, 149(9), 1788. [PubMed]

*Psychological Science,*12, 9–17. [PubMed]

*Annals of Statistics,*31(3), 705–767.

*Extrastriate Cortex in Primates*(pp. 205–241). Springer.

*Science,*292(5516), 510–512. [PubMed]

*Spatial Vision*, 10(4), 437–442. [PubMed]

*Behavior Research Methods*, 51(1), 195–203, https://doi.org/10.3758/s13428-018-01193-y. [PubMed]

*Psychological Review*, 117(3), 864. [PubMed]

*Journal of Experimental Psychology. Human Learning,*2, 509–522.

*Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences*, 358(1431), 435–445. [PubMed]

*Neuroscience of Consciousness,*2022(1), niac014. [PubMed]

*Nature,*387, 281–284. [PubMed]

*Science,*173, 1104–1107.

*Trends in Cognitive Sciences,*25(1), 12–23. [PubMed]

*Nature,*381(6582), 520–522. [PubMed]

*Scientific American*, 255(5), 114B–125.

*Cognitive Psychology*, 12(1), 97–136. [PubMed]

*Journal of Vision*, 2(5), 2–2.

*Brain and Neuroscience Advances*, 2, 2398212818810591. [PubMed]

*Cognitive Psychology*, 60(3), 158–189. [PubMed]

*Journal of Experimental Psychology: General,*144(2), 489. [PubMed]

*Journal of Neuroscience,*16(22), 7376–7389.