Human ability to simultaneously track multiple items declines with set size. This effect is commonly attributed to a fixed limit on the number of items that can be attended to, a notion that is formalized in limited-capacity and slot models. Instead, we propose that observers are constrained by stimulus uncertainty that increases with the number of items but use Bayesian inference to achieve optimal performance. We model five data sets from published deviation discrimination experiments that varied set size, number of deviations, and magnitude of deviation. A constrained Bayesian observer better explains each data set than do the traditional limited-capacity model, the recently proposed slots-plus-averaging model, a fixed-uncertainty Bayesian model, a Bayesian model with capacity limit, and a simple averaging model. This indicates that the notion of limited capacity in attentional tracking needs to be revised. Moreover, it supports the idea that Bayesian optimality of human perception extends to high-level perceptual computations.

*K,*who perfectly encodes

*K*attended items and does not encode other items, if any, at all (Hulleman, 2005); (b) the slots-plus-averaging model, originally proposed for working memory (Zhang & Luck, 2008), in which

*K*items are attended (and others not at all) but their encoding is noisy; (c) an unconstrained (fixed-uncertainty) Bayesian observer; (d) a Bayesian observer with capacity limit

*K*instead of a resource constraint; and (e) an observer who extracts a global motion signal by averaging over all trajectories. In this paper, we mean by “Bayesian” that the observer takes into account probabilities over variables in making a decision rather than only best estimates, even though the prior distribution might be flat (for a concrete example, see the paragraph belowEquation 18).

*N*dots moved from left to right in linear trajectories (Figure 1). Of these,

*D*dots changed direction while on the vertical midline (1 ≤

*D*≤

*N*). In conditions where multiple dots deviate, they do so by the same angle and in the same direction. Subjects reported whether the deviation was counterclockwise (

*C*= 1) or clockwise (

*C*= −1).

*D*=

*N*). The percentage of “counterclockwise” responses was measured as a function of the magnitude of deviation, when all trajectories deviate. The 84.1% correct (

*d*′ = 1) threshold was computed as a function of

*N*and found to be virtually independent of

*N*(Tripathy & Barrett, 2004). In Experiment 2, one trajectory deviated (

*D*= 1). Threshold was found to increase steeply with

*N,*suggesting a very low attentional capacity (Tripathy & Barrett, 2004). In Experiment 3,

*D*= 1 but fixed, suprathreshold deviations of magnitudes Δ = 19°, 38°, 76° were used. As

*N*was varied, performance decreased, but the curves for the different angles of deviation were clearly separated (Tripathy et al., 2007). In Experiment 4, the same suprathreshold deviations were used, but

*N*was fixed at 6 or 8, and

*D*was varied between blocks. Effective number tracked was defined as the capacity of a hypothetical limited-capacity observer achieving the measured percentage correct. To the surprise of the investigators, this number was found to depend strongly on deviation magnitude, but only weakly on

*N*and

*D*(Tripathy et al., 2007). In Experiment 5, different values of Δ and

*D*were interleaved within a block, making it impossible for subjects to know the difficulty of a trial beforehand. The values

*N*= 10,

*D*= 1, 2, and Δ = 19°, 38°, 57° were used (Tripathy et al., 2007). Effective number tracked was again strongly dependent on Δ, and only weakly on

*D*.

*N*items is roughly independent of

*N*. For a single item, the gain

*g,*which is the mean amplitude of the population pattern of activity, is then roughly proportional to 1/

*N*. (An exactly equal division of spikes over locations is as unlikely as it is unnecessary. The allocation proportions are flexible and will be influenced by spatial attention.) This could be implemented through divisive normalization, already a key operation in many models of attention (Reynolds & Heeger, 2009). Under Poisson-like variability, gain is proportional to Fisher information,

*I*(

*s*), which in turn is inversely proportional to the stimulus uncertainty

*σ*squared,

*g*∝

*I*(

*s*) ∝ 1/

*σ*

^{2}(Seung & Sompolinsky, 1993) (Figure 2). It follows that uncertainty increases as

*σ*∝ 1/√

*g*∝ √

*N*.

*σ*∝ √

*N*relationship (Bays & Husain, 2008; Wilken & Ma, 2004) might be due to correlations. However, the argument above allows for correlations and still predicts

*σ*∝ √

*N*. Instead, deviations that are still of power-law form, i.e.,

*σ*∝

*N*

^{α}with

*α*≠

*N*. In 3, we comment on the consequences of

*α*≠

*σ*∝ √

*N*has been proposed earlier based on a sampling argument (Palmer, 1990; Shaw, 1980), but without neural justification. If a fixed total number of

*S*samples is available, then on average,

*S*/

*N*samples will be available per item. By averaging these observations, a better estimate of a single item is obtained. Since the standard deviation of an average of a number of observations generated by the same random process is inversely proportional to the number of terms in the average, we find

*σ*∝ 1/

*N*. While intuitive, this argument does not specify the nature of these samples. Moreover, it is independent of the form of neural variability, while we claim that a different form of variability would produce a different increase of uncertainty with

*N*. For example, if neural variability were additive and Gaussian, then

*I*(

*s*) ∝

*g*

^{2}and

*σ*∝

*N,*a completely different relationship.

*C*= 1 (or C = −1) is inferred based on

*N*noisy observations of pre- and post-midline motion directions, which we denote by

**x**= (

*x*

_{1},…,

*x*

_{N}) and

**y**= (

*y*

_{1},…,

*y*

_{N}). Since

*C*is a binary variable, this probability is uniquely specified by the log odds,

*d*in terms of the observations

**x**and

**y**. To do this, we need to specify the generative model of the task, i.e., a description of the stochastic processes through which the observations are generated by the task-relevant variables. The generative model is depicted graphically in Figure 3b. Besides

*C, D,*

**x**, and

**y**, this diagram contains the following variables: Δ, the angle of deviation;

**I**, the indices of deviating trajectories (a subset of 1,…,

*N*);

**and**

*θ***, the vectors of pre- and post-midline directions, respectively (known to the experimenter, but not to the observer). The vectors**

*φ***x**and

**y**consist of sensory observations generated by

**and**

*θ***, respectively; they are known to the observer but not to the experimenter.**

*φ***x**given the actual stimuli

**is then drawn from the following product of Gaussians:**

*θ**p*(

**y**∣

**) but with variance**

*φ**σ*

_{post}

^{2}.

**=**

*φ***+**

*θ**p*(

**∣**

*φ***,**

*θ**δ*(

**−**

*φ***−**

*θ**δ*is the Dirac delta distribution. In turn, the vector of deviations

*C,*and the indices of the deviating trajectories,

**I**:

*p*(

**I**) =

*δ*(

*C*Δ

**1**

_{ I}), where

**1**

_{ I}is a vector of length

*N*with 1's at the indices in

**I**and 0's everywhere else.

**I**, is a randomly chosen subset of size

*D*of the set 1,…,

*N*. Since there are (

^{ N}

*) subsets of size*

_{D}*D,*the probability of

**I**given

*D*is equal to 1/(

^{ N}

*) if ∣*

_{D}**I**∣ =

*D*and 0 otherwise. Throughout,

*N*is assumed fixed and known to the observer.

*C, D,*Δ, and

**. What remains is to specify the probability distributions over these variables. Since they are top-level, they do not depend on other variables. Therefore, their distributions are prior distributions, reflecting assumed or learned knowledge about the statistics of the stimuli. Throughout, we will assume flat prior distributions for**

*θ**C*and

**, that is,**

*θ**p*(

*C*= 1) =

*p*(

*C*= −1) =

*p*(

*θ*

_{ i}) = constant for all

*i*.

*D*and Δ depend on the experiment. The distribution over

*D*is a delta distribution (when there is a fixed number of deviating trajectories on each trial, as in Experiments 1–4) or a sum of delta distributions (when the number of deviating trajectories takes one of multiple possible values, as in Experiment 5). The distribution over Δ is uniform (in the threshold Experiments, 1 and 2), a delta distribution (in the suprathreshold Experiments 3, 4, and 6), or a sum of delta distributions (when trials with multiple different deviation angles are interleaved, as in Experiment 5).

*C*to report on each trial. Specifically, the observer computes the posterior probabilities

*p*(

*C*= 1∣

**x**,

**y**) and

*p*(

*C*= −1∣

**x**,

**y**) and reports “

*C*= 1” if

*p*(

*C*= 1∣

**x**,

**y**) >

*p*(

*C*= −1∣

**x**,

**y**). We will derive this decision rule in terms of

**x**and

**y**for each of the five experiments. Once the decision rule is known, average behavior over a large number of trials can be simulated (or sometimes computed analytically), so that the Bayesian observer can be compared with behavioral data.

*C,*it is convenient to consider the log posterior ratio (log odds), Equation 1, which can, using Equation 3, be rewritten as the sum of a log likelihood ratio and a log prior ratio:

*C,*the log odds reduce to the log likelihood ratio. The difficulty in computing the likelihoods lies in the fact that although

*C*is the only task-relevant variable, the probability of

**x**and

**y**is also influenced by unknowns which are themselves not of interest, such as

**, Δ (in the threshold experiments), and**

*θ**D*(when the number of deviating trajectories is unknown). A Bayesian observer solves this marginalization problem by averaging over these random variables, as we will now examine case by case.

*N*of

*N*

**x**and

**y**conditioned on both

*C*and the scalar Δ is computed by integrating out the vector

**x**and

**y**depend on

*C*and Δ only through

**1**, where

**1**is a vector of length

*N*consisting of only 1's. Next, we integrate out

**and**

*θ***:**

*φ***(**

*θ**p*(

**) is then a constant,**

*θ**α*) and integrate over the real line (strictly speaking, motion direction is a periodic variable and lives on the circle, but there is little difference if the variability distributions are relatively narrow, as they are here). Then we have

*σ*

^{2}=

*σ*

_{pre}

^{2}+

*σ*

_{post}

^{2}. Inserting this result back into Equation 5, we find

*C*

^{2}= 1. Finally, we obtain the log odds from Equation 4:

_{ i}(

*x*

_{ i}−

*y*

_{ i}) > 0, then the integrand in the numerator is larger than the integrand in the denominator for any Δ (since Δ/

*σ*

^{2}> 0). Moreover, both integrands are non-negative functions on the entire domain of Δ. It follows that

*d*> 0. Similarly, if Σ

_{ i}(

*x*

_{ i}−

*y*

_{ i}) < 0, then

*d*< 0. From this, we conclude that

*d*> 0 is equivalent to

*p*(Δ). Note that we kept the constant factor 1/

*σ*

^{2}, in anticipation of situations where uncertainty might differ between dots (1/

*σ*

_{ i}

^{2}), a case of which we will discuss in the Predictions section. The decision rule is to report “

*C*= 1” when the average post-midline motion direction is larger than the average pre-midline one. That this is the optimal strategy is intuitive and could have been guessed without doing any calculations: since each trajectory deviates by the same amount, maximum information about the deviation is obtained by averaging all

*N*observations. However, the same calculation method will be used in more complex conditions.

*x*) is the error function). To relate the Bayesian model to human performance, we have to apply the Bayesian decision rule to a large number of trials. Usually, this requires simulation, but in this particular case we can do it analytically. In the Bayesian model, probability correct for a given value of Δ is the probability that

*d*> 0 when

*C*= 1 (or that

*d*< 0 when

*C*= −1, which is the same). When

*C*= 1, each random variable

*y*

_{ i}−

*x*

_{ i}follows a Gaussian distribution with mean Δ and standard deviation

*σ,*and their average follows a Gaussian distribution with mean Δ and standard deviation

*σ*/√

*N*. Therefore, the probability that their average is positive is equal to

*σ*·

_{thr}=

*σ*/√

*N*.

*N*:

*σ*=

*σ*

_{1}√

*N,*where

*σ*

_{1}is the combined pre- and post-midline uncertainty when only 1 trajectory is present. It follows that Δ

_{thr}=

*σ*

_{1}, indicating that the threshold deviation is independent of

*N*. In other words, the benefit gained from averaging

*N*observations is exactly undone by the increase in uncertainty due to the spread of attention over

*N*items. This also means that, in this task, it does not make a difference how many items are tracked since tracking additional items does not improve performance.

*N*

*D*= 1 and

**I**is a single index

*j*randomly chosen from {1,…,

*N*}. As a consequence, in Equation 6,

*p*(

*C,*Δ) needs to be computed as an average over

**I**:

*j*th trajectory is deviating, while all others are not. As in Experiment 1, we write this as an average over

**,**

*θ**j*th factor is different from all others and therefore needs a separate factor:

*β*on a large interval. Then

*σ,*in an essential way (in Experiment 1, it was irrelevant if uncertainty was equal across items). Therefore, it requires that a neural population encoding motion direction also encodes, on a single trial, the uncertainty about a stimulus, and that this information is used in downstream computations. This utilization of implicit knowledge of one's uncertainty is what we mean by Bayesian inference (even though the prior distribution is flat in this case). Probabilistic population codes (Ma et al.,2006) provide a concrete neural implementation of a Bayes-optimal computation (cue combination).

**x**or

**y**.

**p**, and starting points of second halves,

**q**, from normal distributions centered at common actual positions

**L**. Those positions are specified by the experiment, i.e., they are placed equidistantly on the real line and then corrupted by uniform jitter (Tripathy & Barrett, 2004; Tripathy et al., 2007). The standard deviations of the normal distributions are free parameters and are assumed to be equal. Like the observations of direction, the observations of position are subject to the spike constraint and therefore these standard deviations increase with √

*N*. We will denote the positional uncertainty at set size 1 by

*σ*

_{pos,1}; it is √2 times the positional uncertainty in

**p**or

**q**separately.

**p**and second-half starting points

**q**is established by picking the most likely pairing. This is the pairing in which the sorted version of

**p**corresponds, entry by entry, to the sorted version of

**q**: the smallest entry in

**p**corresponds to the smallest entry in

**q**, etc. ( Figure 4b). Specifically, we denote by

*S*

_{ p}(

**p**) the permutation that sorts

**p**. For example, if

**L**= (8, 23, 27, 45), then

**p**could be (10.5, 25.1, 24.6, 42).

*S*

_{ p}(

**p**) is then the permutation (1, 2, 3, 4) → (1, 3, 2, 4). Similarly,

*S*

_{ q}(

**q**) is the permutation that sorts

**q**. We define permuted sets of motion direction observations by applying the same permutations,

*S*

_{ p}and

*S*

_{ q}, to

**x**and

**y**, respectively:

**x**

_{new}=

*S*

_{p}(

**x**) and

**y**

_{new}=

*S*

_{ q}(

**y**).

**x**

_{new}and

**y**

_{new}are entered into the decision rule, instead of

**x**and

**y**. For example, in Experiment 2, the final decision rule becomes

*N*factorial.

*N*

*N*trajectories deviates, where the angle of deviation is relatively large. It might seem that the generative model is the same as that of Experiment 2. There however, Δ takes on values over a wide range, whereas here, Δ is fixed within a block. A Bayesian observer incorporates this knowledge through

*p*(Δ). This means that

*p*(

**x**,

**y**∣

*C*) =

*p*(

**x**,

**y**∣

*C,*Δ). Combined with Equations 4 and 8, this leads to the decision rule

*y*

_{ i}−

*x*

_{ i}is much larger for one value of

*i*than for all others, and similarly for

*x*

_{ i}−

*y*

_{i}, then the sums on both sides are dominated by the largest terms, and this decision rule simplifies to the so-called signed-max rule (Baldassi & Verghese, 2002):

*N*.) However, this rule is not optimal outside of this limit or in other conditions. The optimal rule, Equation 20, can be regarded as a soft version of the signed max rule, as the exponential function preferentially amplifies larger observed deviations

*y*

_{i}−

*x*

_{i}(or

*x*

_{i}−

*y*

_{i}). It is intuitive that this is optimal, as larger observed deviations provide more conclusive evidence about the true direction of deviation than smaller ones.

*D*of

*N,*blocked

^{ N}

*) of them. The decision rule is in 1.*

_{D}*D*of

*N,*interleaved

*D*were fixed in a given block and only varied between blocks, whereas in Experiment 5, trials with different values of Δ and

*D*were interleaved within a single block. In a Bayesian model, these designs correspond to different assumed distributions over Δ and

*D*. In Experiment 5, the observer is not sure of the values of Δ and

*D,*and therefore marginalizes over these variables. The decision rule is in 1.

**x**and

**y**by drawing independently from Gaussian distributions centered at these motion directions. After generating the observations, we apply the Bayesian decision rule to the observations

**x**and

**y**on each trial. Then, we determine the proportion of trials on which the model response is correct. In terms of the log odds,

*d,*this is the same as calculating a histogram of

*d*for (

**x**,

**y**) pairs drawn under

*C*= 1, and counting what portion of the histogram satisfies

*d*> 0 (equivalently, C = −1 and

*d*< 0). The log odds do, in general, not follow a normal distribution (see Figure 5a).

_{thr}. There is no reason why the psychometric curve should have a cumulative normal shape, and in many cases it does not. However, it is a reasonable approximation, and our analysis of model data parallels that of the behavioral data.

*D*is described in the next section. Since capacity is an integer, we interpolate percentage correct as a function of capacity using an exponential fit, again following (Tripathy et al.,2007).

*σ*

_{1}, and in Experiments 2–5, the positional uncertainty at set size 1,

*σ*

_{pos,1}. The values of these free parameters were taken to be consistent across experiments. In Experiment 1, we expect the constant deviation threshold to be approximately equal to

*σ*

_{1}. Moreover, threshold for discriminating the deviation sign in a single trajectory decreases was measured separately as a function of dot speed (Tripathy & Barrett, 2004); we used approximately those values for

*σ*

_{1}, keeping in mind that different experiments used different dot speeds.

*σ*

_{pos,1}was fitted by hand. The experiment-specific parameters are listed in 2. In all simulations, we used at least 10,000 trials per condition.

*N,*i.e.,

*σ*=

*σ*

_{1}and

*σ*

_{pos}=

*σ*

_{pos,1}.

*K*(a positive integer) is assumed, meaning that on each trial,

*K*trajectories are randomly selected; if

*K*≥

*N,*all are selected. If the deviating trajectory is among these, the observer will report the correct sign of the deviation. If none of the

*K*selected trajectories deviates, then the observer guesses about the sign of the deviation, picking “clockwise” or “counterclockwise” each with probability

*N*> 1.46

*K*. We will prove this below. Another characteristic is that performance is independent of deviation angle, which is relevant in Experiments 3–5.

*N*of

*N*

*K*or

*N,*since the selected trajectory or trajectories will always be deviating and their sign of deviation will be known with absolute certainty.

*N*

*N*trajectories is deviating. The proportion of trials on which the observer responds correctly is then 1 if

*N*≤

*K,*and

*N*≥

*K*. This is shown as a function of

*N,*for different values of

*K,*in Figure 6a. The limited-capacity model does not take into account the angle of deviation, Δ. For any set size,

*N,*and any value of the capacity,

*K,*percentage correct is independent of deviation angle (see Figure 6b). Since deviation threshold is defined as the smallest value of the deviation angle for which proportion correct exceeds 0.841, the limited-capacity model predicts that only two possible threshold values are possible: zero and infinity, depending on whether PC in Equation 22 randomly is larger or smaller than 0.841, respectively. We can compute for which values of

*N*the deviation threshold is infinite by solving the equation PC(

*N,*1) <

*K*= 3, the threshold deviation according to the limited-capacity model will be infinite whenever

*N*≥ 5. This leads to threshold-versus-set size curves that look like those in Figure 6c.

*D*of

*N*

*D*of

*N*trajectories deviate, the limited-capacity model predicts that PC(

*N, D*) = 1 when

*N*<

*K*+

*D*(since then at least one deviating trajectory is attended to). When

*N*≥

*K*+

*D,*the model predicts, through a basic combinatorial argument, the following proportion correct:

*D*and for different values of

*K*in Figures 6d and 6e (with

*N*= 6 and

*N*= 8, respectively). Again, percentage correct is independent of deviation angle ( Figure 6f). Equation 24 is valid regardless of whether different values of

*D*are blocked or interleaved.

*D*and

*N,*the effective capacity of the human observer is defined as the capacity of a hypothetical limited-capacity observer with the same percentage correct, where PC is interpolated between integer values of

*K*using an exponential fit. We applied the same method to the simulated responses of different models, to obtain model effective capacity.

*K*items can receive resource. The model was developed for short-term memory but can also be applied to attentional tracking. The shortcomings of the limited-capacity model for visual-short term memory were pointed out in a set of two-interval suprathreshold feature change detection experiments (Wilken & Ma, 2004). This work suggested instead that short-term memory limitations originate from the variability in the sensory encoding of items combined with a finite but continuous resource. A direct estimation task was introduced, in which subjects estimated in the second interval the feature value of one item, which was among multiple items present in the first interval. This confirmed that precision with which items are maintained in memory smoothly decreases with set size. In response, Zhang and Luck proposed the slots-plus-averaging model, which attempted to address the decline of precision with set size by postulating that resources come in a small number of discrete chunks,

*K*(slots). When there are fewer slots than items (

*N*≥

*K*), the number of slots is equivalent to the capacity in the traditional limited-capacity model. However, when there are more slots than items (

*N*<

*K*), multiple slots will be allocated to the same item, thereby increasing the quality of its encoding, in a way similar to the sampling argument mentioned in “Neural constraint on uncertainty.” Items that do not receive any slot are not maintained at all. Therefore, this model is a hybrid between the limited-capacity model and a continuous-resource model like the constrained Bayesian observer. Here, we only describe aspects of the slots-plus-averaging model that can be directly applied to the tracking task we study.

*N, D*) to denote predicted proportion correct at deviation angle Δ, when

*D*of

*N*trajectories are deviating. For the deviation threshold when

*D*of

*N*trajectories are deviating, we use the notation Δ

_{thr}(

*N, D*).

*N*of

*N*

*K*. When

*N*≥

*K,*each attended item will be encoded with a certain standard deviation

*σ*

_{ K}(one slot each). This standard deviation corresponds to the combined pre- and post-midline uncertainty. Averaging

*K*such observations will produce standard deviation

*σ*

_{ K}/√

*K*. When

*N*≤

*K,*each item will be encoded with standard deviation

*σ*

_{ N}=

*σ*

_{ K}

*N*such observations will produce standard deviation

*σ*

_{ K}/√

*K*. We conclude that the deviation threshold will be Δ

_{thr}(

*N, N*) =

*σ*

_{ K}/√

*K,*independent of

*N*. That this is the same as in the Bayesian model is not surprising, since the naïve averaging operation happens to be optimal when all trajectories deviate.

*N*

*N*≥

*K,*there is a probability of

*K*/

*N*that the deviating trajectory is allocated a slot. When this happens, its internal representation will have standard deviation

*σ*

_{ K}. Then, probability correct equals the probability that an observation drawn from a normal distribution with mean Δ (which is positive) and standard deviation

*σ*

_{ K}is itself positive (so that the correct deviation sign will be reported).

*σ*

_{ K}√2)). On the other hand, when the deviating trajectory is not allocated a slot, performance will be at chance. Consequently, the predicted proportion of correct responses is, after simplification,

*σ*

_{ K}, the error function will tend to 1, and PC is the same as in the traditional limited-capacity model, Equation 22. It follows from Equation 25 that proportion correct is bounded by PC(Δ = ∞,

*N,*1) =

*K*/

*N*). Just as in the traditional limited-capacity model, PC never reaches 0.841 if

*N*> 1.46

*K*(see Equation 23). For these values of

*N,*Δ

_{thr}(

*N,*1) = ∞. For other values of

*N,*i.e., those which satisfy

*K*≤

*N*< 1.46

*K,*we can compute the threshold deviation from Equation 25 as:

*N*≤

*K,*each item will be attended and have at least one slot allocated to it. The average number of slots allocated will be

*K*/

*N*. As a consequence, the standard deviation of its internal representation will be reduced by a factor

*σ*

_{ N}=

*σ*

_{ K}

*K*=

*N,*this is equal to Equation 25, and when Δ is large compared to

*σ*

_{ K}, it is equal to 1 as in the traditional limited-capacity model. Thus, for

*N*≤

*K,*threshold is Δ

_{thr}(

*N*≤

*K,*1) =

*σ*

_{ K}

*N*in Figure 7a. The psychometric function for

*N*= 1 is identical to the one when

*N*of

*N*trajectories are deviating (regardless of

*N*). Performance saturates at a level far below 100% for any sufficiently large

*N,*a distinctive feature that casts serious doubt on any limited-capacity model (Bays & Husain, 2009).

*N*and for

*N-*of-

*N,*in Figure 7b. In order to give each value of

*K*a fair chance to fit the data, we chose

*σ*

_{ K}=

*σ*

_{ K = 1}√

*K,*with

*σ*

_{ K = 1}= 3°. (Note that

*σ*

_{ K = 1}=

*σ*

_{ N = 1}.) With this choice, Δ

_{thr}(1, 1) = 3° for any

*K,*close to the measured value.

*D*of

*N*

*N*<

*K*). The predictions are in 1.

*K,*but inference is optimal within the selected subset of

*K*items. We consider this model in order to determine whether a capacity limit on a Bayesian ideal observer can describe the data better than a continuous resource constraint. Predictions are derived in 1.

*C*= 1” when

*σ*

^{2}factor for generality). This rule is not optimal in the other experiments. However, since it relies solely on the computation of a single global motion signal, it does not require the tracking of individual dots. Therefore, if human observers would be following this strategy, one could even question whether the task under study is a tracking paradigm.

*all*items to extract a global signal, while in the latter, averaging is over the observations provided by multiple discrete slots allocated to the

*same*item.

*N*of

*N*

*N,*with a value of about 3° ( Figure 8a, green line). The slots-plus-averaging model ( Figure 8c) and the constrained Bayesian model ( Figure 8f) predict the same. The latter is because the benefit of averaging over

*N*observations is canceled by the increase of uncertainty with

*N,*as explained in the Theory and methods section. The traditional limited-capacity model predicts that threshold is exactly zero for any

*N*( Figure 8b). The unconstrained Bayesian model and the averaging model predict that threshold decreases as 1/√

*N*( Figures 8d and 8g), reflecting only the benefit of averaging without an increase in uncertainty. The Bayesian model with capacity limit predicts the same decrease for

*N*≤

*K,*and a constant threshold for

*N*>

*K*( Figure 8e). The traditional limited-capacity model, the unconstrained Bayesian model, the Bayesian model with capacity limit, and the averaging model can be ruled out based on this experiment.

*N*

*N,*taking values of more than 30° at

*N*= 4 ( Figure 8a, red line). The three limited-capacity models (traditional capacity limit, slots-plus-averaging, unconstrained Bayesian) predict that threshold will rise to infinity at set sizes exceeding 1.46

*K*( Equation 23; Figures 8b, 8c, and 8e). This occurs because performance, even for very large Δ, is limited by the fact that a subset of size

*K*is chosen and all other items are ignored. Indeed, asymptotic proportion correct is (1 +

*K*/

*N*) / 2 ( Equations 22 and 25, Figures 6b and 7a). Moreover, for smaller values of

*N,*the limited-capacity model and the slots-plus-averaging model predict no or only a slow increase of threshold with set size, because they ignore uncertainty in stimulus representation ( Figure 8b) or uncertainty about which trajectory deviates ( Figures 8b and 8c).

*N*for

*N*≤

*K*( Equation 27) and predicts Δ

_{thr}= 9° when

*K*= 3 and

*N*= 4 ( Equation 26), which is far from the observed value. The effects of changing

*K*were explored in Figures 6c and 7b. According to the averaging model, threshold grows as √

*N*throughout ( Figure 8g).

*N*= 5. In the Bayesian models, an increased jitter in the initial motion directions leads to an increase in threshold, as was found but not expected in (Tripathy & Barrett,2004).

*N*

*K*≥

*N,*and thus the difference between performance and chance drops as 1/

*N*(see Equations 25 and A12). They predict an abrupt transition at

*N*=

*K*for Δ = 76°, contrary to the data. Both also predict a much smaller separation between the curves at Δ = 38° and Δ = 76° than is indicated by the data. In the unconstrained Bayesian model and the averaging model, performance decline is too slow to fit the data ( Figures 9d and 9g); this is because uncertainty is constant with set size. Only the constrained Bayesian model describes the dependence of performance on

*N*and Δ reasonably well ( Figure 9f) because of the increase of uncertainty with set size. All models, except for the otherwise implausible limited-capacity model, underpredict performance at

*N*= 1 and Δ = 19°.

*D*of

*N,*blocked

*K*( Figure 10b). The slots-plus-averaging model ( Figure 10c) and the Bayesian model with capacity limit ( Figure 10e) are again virtually identical in this experiment, but both predict too little separation between the curves at Δ = 38° and Δ = 76°, as does the unconstrained Bayesian model. Limited-capacity models predict that performance is independent of Δ at least for sufficiently large Δ ( Figures 10b, 10c, and 10e), while the unconstrained Bayesian model and the averaging model overestimate performance ( Figures 10d and 10g). The constrained Bayesian model reproduces the correct values and dependencies ( Figure 10f), notably only with two free parameters. Since

*N*is fixed in this experiment, the differences between models predictions arise from the numerical value of

*σ*

_{1}rather than on the dependence of

*σ*on

*N*.

*σ*

_{1}was taken to be 11.3°, in rough accordance with a separate experiment (see 2). A larger value would allow the slots-plus-averaging model to fit better than it does now but be less consistent with that experiment.

*D*= 3 and Δ = 76° in Figure 11f indicates that the model overestimates performance there.

*D*of

*N,*interleaved

*D*(1, 2) were interleaved, with

*N*= 10. The limited-capacity and slots-plus-averaging models behave the same as in Experiment 4. The Bayesian models do not, since Bayesian observers make use of the statistical structure of the task. In Experiment 5, the observer does no longer know Δ or

*D*on a trial-by-trial basis. The key point of this experiment is again that effective capacity depends on the magnitude of the change, contrary to the prediction of the traditional limited-capacity model. The values and the separation of the data are best matched by the constrained Bayesian model ( Figure 12).

Model | Number of free parameters | Free parameters |
---|---|---|

Traditional limited capacity | 1 | K |

Slots plus averaging | 2 | K, σ _{ K} |

Unconstrained Bayes | 2 | σ _{1}, σ _{pos,1} |

Bayes with capacity limit | 3 | K, σ _{1}, σ _{pos,1} |

Constrained Bayes | 2 | σ _{1}, σ _{pos,1} |

Averaging | 1 | σ _{1} |

*D*of

*N*

*D*at fixed

*N*will exhibit a rapid decrease of threshold between

*D*= 1 and

*D*= 2. The slots-plus-averaging model predicts a very different pattern than the constrained Bayesian model, with the largest differences occurring at

*D*= 1 ( Figure 13b).

*σ,*is the same for all trajectories. However, the theory is by no means restricted to this. Since Bayesian theory assumes that a probability distribution over each stimulus is encoded

*on a single trial,*a Bayesian observer should be able to take into account uncertainty when making a perceptual decision (Knill & Pouget, 2004; Knill & Richards,1996). In other words, if different trajectories within a single display come with different amounts of uncertainty, the more uncertain ones should be weighted less. A powerful test of Bayesian optimality is therefore to vary the reliability of different objects in the same display and predict performance. This is routinely done in cue combination studies.

*N*trajectories deviating (Experiment 3). We assume that each trajectory comes with its own uncertainty,

*σ*

_{ j}for the

*j*th trajectory. The Bayesian decision rule is now a modified version of Equation 20, namely

*σ*

_{ i}for each trajectory). From this expression, it is clear that the extent to which a trajectory contributes to a decision is inversely proportional to its variance. This is interesting when different values

*σ*

_{ i}are present within the same display.

*N*trajectories deviating (

*N*fixed), where on each trial, we randomly assign high uncertainty (low contrast) to

*H*items and low uncertainty (high contrast) to the remaining

*N*−

*H*ones. This leads to displays in which the number of high-uncertainty items can be anywhere from 0 to

*N*. We take

*N*= 6 and Δ = 38° and call the single deviating trajectory the target. Then the blue line in the inset in Figure 14a indicates percentage correct as a function of

*H,*according to the constrained Bayesian model (parameters were taken identical to those in Experiment 3). It is not surprising that overall performance declines monotonically with

*H,*as high-uncertainty items contain less information. However, if we divide the trials into two classes by the uncertainty of the target, then class-conditioned performance

*increases*with

*H*. The reason this happens is that because in each class, since target uncertainty is fixed, all that changes when

*H*increases is the number of high-uncertainty

*distractors*. The Bayesian decision rule, Equation 29, suppresses evidence from high-uncertainty items, so if those items are distractors, this increases the relative contribution of the target and therefore the performance on the discrimination task.

*H,*the weights in the averaging are determined by the relative numbers of observations in each class at that value of

*H,*but these proportions depend on

*H*. The higher

*H,*the higher the probability that the target has high uncertainty. This probability is

*H*/

*N,*giving increasing weight to the percentage for the high-uncertainty class as

*H*increases. In an equation:

*H*= 3 and

*H*= 5, conditioned performance increases while combined performance decreases or stays constant. Moreover, percentage correct is distinctly non-flat as a function of

*H*.

*K*is given,

*K*<

*N,*and that the internal representation of a particular item has standard deviation

*σ*

_{ K,low}when uncertainty is low and

*σ*

_{ K,high}when uncertainty is high. Conditioned performance PC

_{low}and PC

_{high}are obtained by using

*σ*

_{ K,low}and

*σ*

_{ K,high}in Equation 25, respectively. Neither depends on

*H,*since this model assumes that the observer has perfect knowledge about which trajectory, if any, changed among the attended ones. Unconditioned performance is again a weighted average of the conditioned performances. This produces the prediction of Figure 14b, in which conditioned performance is independent of

*H*. Conducting this experiment could provide further evidence to distinguish the constrained Bayesian model from the slots-plus-averaging model and possibly from other models.

*N*. However, any alternative model can now be tested against the benchmark set by the constrained Bayesian model. (Results not reported here suggest that the above alternatives are also inadequate.)

*D*of

*N*trajectories deviate ( Figure 13). We predict the occurrence of Simpson's paradox when dots are allowed to vary in contrast ( Figure 14). These experiments are expected to yield additional evidence against several alternative models.

*N*. A different population should encode the Bayesian decision variable,

*d,*and therefore the posterior distribution over the binary variable

*C*.

*C*here). Our results provide new evidence that the brain also uses Bayesian inference in such judgments, however sometimes under a constraint of uncertainty increasing with set size. Moreover, our findings emphasize the parallel importance of feature and positional uncertainty. An important open question is why the attentional tracking task studied here is subject to an increase of uncertainty with set size, while visual search tasks do not seem to be (see also Palmer,1990).

*σ,*appears in many optimal decision rules, such as those in cue combination (Knill & Richards, 1996). In the present paper, Equations 18 and 20 are examples. Encoding uncertainty on a trial-to-trial and item-to-item basis is especially critical when uncertainty differs between items (e.g., some dots have lower contrast than the others) or between trials (e.g., contrast is changed between trials). This was not explored in the experiments modeled here, but we made a prediction about it (Prediction 2) and it constitutes an important future direction. (Point estimates of uncertainty might be used instead of posterior distributions, but this proposal requires more details, and clues as to how it generalizes to non-Gaussian distributions.)

*N*= 1 and Δ = 19° is predicted to be below 100% in Experiment 3 ( Figure 9f), in contrast to the data. Effective capacity at higher deviation angles is overestimated in Experiment 4 ( Figure 11f).

*N*increase of uncertainty with set size can be implemented using divisive normalization, a mechanism believed to play a central role in attentional processing (Reynolds & Heeger,2009). A full neural model will have to integrate this mechanism, which presumably acts at the input level, with the mechanism of the probabilistic inference process.

*D*of

*N,*blocked

**I**is a set of trajectory indices, ∣

**I**∣ is the number of elements in this set, and Σ

_{∣ I∣= D}is shorthand for “sum over all index sets

**I**of size

*D*.” The log odds are

*D*of

*N,*interleaved

*A*possible values, Δ

_{1},…, Δ

_{ A}, all with equal probability, 1/

*A*. Similarly,

*D*is drawn from a discrete distribution with

*B*possible values,

*D*

_{1},…,

*D*

_{ B}, also all with equal probability, 1/

*B*. The derivation starts with Equation 5 applied to a discrete distribution,

*p*(Δ):

*D*:

**x**

_{new}and

**y**

_{new}, as discussed in the Theory and methods section. When

*A*= 1 and

*B*= 1, Equation A7 is the same as Equation A3.

*p*(Δ). The result of this is:

*N*≥

*K*. Then each trajectory can receive no more than one slot, and we have to calculate the probability that

*M*deviating trajectories receive a slot. This is the same problem as: you draw

*K*balls from a vase that contains

*D*blue and

*N*−

*D*red balls. What is the probability of drawing

*M*blue balls and

*K*−

*M*red ones? The answer is given by the hypergeometric probabilities, (

^{ D}

*)(*

_{M}^{ N − D}

_{K}_{−}

*) / (*

_{M}^{ N}

*). Here,*

_{K}*M*is restricted to be max(

*K*+

*D*−

*N,*0) ≤

*M*≤ min(

*K, D*). When

*M*deviating trajectories receive a slot, uncertainty is

*σ*

_{ K}/√

*M,*therefore proportion correct is

*σ*

_{ K}

*M,*with weight factors given by the hypergeometric probabilities:

*D*= 1, this is equal to Equation 25. Next, we consider the case

*N*≤

*K*. In this case, each trajectory receives at least one slot. Specifically, the average number of slots per trajectory is

*K*/

*N*. As a consequence, the uncertainty in the representation of a single trajectory will be

*σ*

_{ N}=

*σ*

_{ K}

*D*deviating trajectories will be represented with this uncertainty. Therefore, averaging over these

*D*observations yields a representation with standard deviation

*σ*

_{ K}

*D*are blocked (as in Experiment 4) or interleaved (as in Experiment 5).

*N*≤

*K,*it follows from Equation A11 that Δ

_{thr}(

*N*≤

*K, D*) =

*σ*

_{ K}

*N*≥

*K,*we have Equation A10, so Δ

_{thr}(

*N*≤

*K, D*) cannot be computed directly. Instead, we simply find the value of Δ for which percentage correct exceeds 84.1%. If this never happens (as always in a limited-capacity model when

*N*is sufficiently large), then threshold is infinite. Figure 13b was obtained in this way, with

*σ*

_{ K = 1}= 3°.

*N*≤

*K,*all items are attended and performance is identical to that of the unconstrained Bayesian model (denoted by a subscript “UB”), Δ

_{thr}(

*N*≤

*K, D*) = Δ

_{thr,UB}(

*N*≤

*K, D*). We therefore only consider the case

*N*≥

*K*.

*K*of them are attended, the deviation threshold at

*N*≥

*K*should be equal to that at

*N*=

*K,*which is obtained from the unconstrained Bayesian model: Δ

_{thr}(

*N*≥

*K, N*) = Δ

_{thr,UB}(

*K, N*).

*K*/

*N*. When it is attended, proportion correct is equal to that in the Bayesian model at set size

*K*. Otherwise, performance is at chance. This yields the following expression for proportion correct for the Bayesian model with limited capacity:

_{UB}(Δ,

*N*=

*K*) is proportion correct in the unconstrained Bayesian model at deviation angle Δ and set size

*K*. Just like in all limited-capacity models, asymptotic performance is PC(Δ = ∞,

*N,*1) =

*K*/

*N*) and Δ

_{thr}(

*N,*1) = ∞ when

*N*> 1.46

*K*. When

*N*/

*K*∈[1, 1.46), threshold is finite but higher than in the unconstrained Bayesian model. Since PC(Δ = ∞,

*N,*1) < 1, we can no longer fit a cumulative normal distribution to PC(Δ) in order to estimate threshold. Instead, we have to fit a rescaled function, PC(Δ) =

*K*/

*N*·erf(Δ/(

*α*√2)), which has the correct asymptote. Comparing this with Equation A12, we find PC

_{UB}(Δ,

*K,*1) =

*α*√2)), from which it is clear that the parameter

*α*is nothing but the deviation threshold of the unconstrained Bayesian model at set size

*K,*Δ

_{thr,UB}(

*K,*1). We find threshold from Equation A12:

_{thr}(

*N,*1) = Δ

_{thr,UB}(

*N,*1) if

*N*=

*K*; this is correct, since the capacity limit has no effect when all items are attended.

*D*of

*N*trajectories are deviating, the logic is very similar to that of the slots-plus-averaging model.

*K*trajectories are picked at random to be attended, and of these,

*M*will be deviating, where

*M*is constrained by max(

*K*+

*D*−

*N,*0) ≤

*M*≤ min(

*K, D*). When

*M*of

*K*trajectories are attended, performance is given by that of the unconstrained Bayesian model for

*M*-of-

*K*deviating trajectories:

*D*= 1, but it does. (Note that PC

_{UB}(Δ,

*N,*0) =

*N*.) Asymptotic performance is PC(Δ = ∞,

*N*≥

*K, D*) = 1 −

^{N-D}

_{K})/(

^{ N}

*), again below 100%.*

_{K}*N*= 1, 2, 3, 4, 6, 8. All trajectories deviated,

*D*=

*N*. Directional uncertainty at set size 1 was chosen

*σ*

_{1}= 2.8° (

*σ*

_{pre,1}=

*σ*

_{pre,1}= 2°); this value is comparable to the single-trajectory threshold reported in Figure 2b of Tripathy and Barrett (2004) at the dot speed used, 32 deg/s. Positional uncertainty and the jitter in the initial motion directions are not relevant since the decision rule is to average over all trajectories.

*N*= 1, 2, 3, 4, 5. One trajectory deviated,

*D*= 1. Directional and positional uncertainty were

*σ*

_{1}= 2.8° and

*σ*

_{pos,1}= 21′, respectively. Vertical distances between the midpoints (relevant for positional uncertainty) were taken to be 10′, with 10′ uniform jitter, as in the experiment. Initial motion directions were drawn from a uniform distribution on [−32°, 32°]. In the model, the mean initial motion direction is irrelevant.

*N*= 1, 2, 3, 4, 6, 8;

*D*= 1; Δ = 19°, 38°, 76°. For directional uncertainty at set size 1, we took

*σ*

_{1}= 11.3°. This is larger than in Experiments 1 and 2 because dot speed is lower, 4 deg/s. The value is again comparable to, though somewhat higher than the single-trajectory threshold reported in Figure 2b of Tripathy and Barrett (2004) (observer DB). Positional uncertainty at set size 1 was chosen

*σ*

_{pos,1}= 21′. Vertical distances between the midpoints: 40′ with 5′ jitter. Initial motion directions were drawn from a uniform distribution on [−80°, 80°].

*N*= 6,8;

*D*= 1, 2, 3, 5, 6, 8; Δ = 19°, 38°, 76°;

*σ*

_{1}= 11.3°; and

*σ*

_{pos,1}= 21′. Vertical distances between the midpoints: 30′ with 5′ jitter. Initial motion directions were drawn from a uniform distribution on [−80°, 80°]. In Experiment 5, parameters were identical except for

*N*= 10;

*D*= 1, 2.

*D*= 1,…, 5. Other parameters were as in Experiment 2. In Prediction 2,

*N*= 6 and Δ = 38°. The number of high-uncertainty trajectories took values

*H*= 0, 1, 2,…, 6. Other parameters:

*σ*

_{1,low}= 11.3° (directional uncertainty of low-uncertainty trajectory at

*N*= 1);

*σ*

_{1,high}= 22.6°;

*σ*

_{pos,1,low}= 21′; and

*σ*

_{pos,1,high}= 42′. The vertical distances between midpoints and the initial motion directions were chosen as in Experiment 3.

*K,*was chosen to be 3.

*K,*which was chosen to be

*K*= 3 or

*K*= 4 in this paper.

*K*= 3 often fits the data best (though still not very well), and

*K*= 4 is believed to be the capacity limit in standard multiple-object tracking (Pylyshyn & Storm, 1988). The effect of varying

*K*is explored in Figures 6 and9b.

*K*= 3. The effect of changing

*K*is explored for the near-threshold experiments in Figure 7b. For the suprathreshold experiments, changing

*K*does not improve the resemblance of the model predictions to the data in Figures 9, 10, 11, and 12. We adjusted the uncertainty parameter of the model,

*σ*

_{ K}, for different

*K*to give the model a fair chance, as explained below Equation 27. In all of these except Figure 7b,

*σ*

_{ K = 1}= 11.3°, as in the Bayesian model (keep in mind that

*σ*

_{ K = 1}=

*σ*

_{ N = 1}). Therefore,

*σ*

_{ K = 3}= 11.3°·√3 = 19.6°. For Experiments 1 and 2 ( Figures 7 and 8) and Prediction 1 ( Figure 13),

*σ*

_{ K = 1}= 3°, and

*σ*

_{ K = 3}= 3°·√3 = 5.2°. In Prediction 2, we used

*N*= 6;

*K*= 3; Δ = 38°. The number of high-uncertainty items took values

*H*= 0, 1, 2,…, 6. Directional uncertainty was

*σ*

_{ K,low}= 11.3°·√3 = 19.6° and

*σ*

_{ K,high}= 22.6°√3 = 39.1°.

*σ*

_{1}, was chosen

*σ*

_{1}= 2.8° in Experiments 1 and 2, and

*σ*

_{1}= 11.3° in Experiments 3–5.

*σ*∝ √

*N*. This relationship was derived from a neural argument, and it accounts well for the behavioral data. However, in a recent short-term memory experiment, a power law with a higher exponent was found,

*σ*∝

*N*

^{ α}with

*α*≈ 0.74 (Bays & Husain, 2008). A higher exponent might also be consistent with earlier data (Wilken & Ma, 2004). A higher power could arise from several causes, such as (a) feature uncertainty might be confounded with positional uncertainty; (b) the total amount of spikes expended decreases with set size; and (c) the form of neural variability is different from Poisson-like, leading to a different relationship between neural gain and uncertainty. The value of the exponent, and its origins, deserve further study. No data about

*α*are available for attentional tracking. Here, we only examine the consequences of a higher power on the constrained Bayesian model.

*σ*∝ √

*N*caused threshold to be independent of set size, since this increase in uncertainty exactly canceled out the benefit from averaging

*N*independent observations. In this scenario, the number of trajectories a subject attends to does not affect performance. This changes when

*σ*∝

*N*

^{ α}, with

*α*> 0.5. Attending to all trajectories would lead to an increase in threshold with set size, namely Δ

_{thr}=

*σ*/√

*N*∝

*N*

^{ α}/√

*N*∝

*N*

^{ α−}

*D*of

*N*trajectories deviate (

*D*<

*N*).

*N*= 5 and

*D*= 3 in the near-threshold paradigm. Attending to a subset is equivalent to imposing a capacity limit which may depend on

*N*and

*D*. We compute performance of an observer with a capacity limit

*K*≤

*N,*who follows the constrained Bayesian model for the items selected on each trial. This is done in a manner analogous to the Bayesian model with capacity limit (see Theory and methods section), except that base performance is now from the Bayesian model with constraint. Analogous to Equation A14, we have

*K*< 5 exceeds performance for

*K*= 5 (attending to all items) over a range of Δ. However, this slight improvement comes at the cost of deteriorated performance in a different range of Δ. We have also ignored the fact that unattended items may get confused with attended ones because of positional uncertainty. Overall, we cannot say that attending to a subset is beneficial. Whether it is beneficial or not also depends on

*N*and

*D,*which might make this strategy impractical. Using the above parameters, attending to a subset always hurts when

*α*= 0.5 ( Figure C1b). We conclude that even if uncertainty grows faster than the square root of set size, this does, within limits, not greatly affect the optimal strategy.