Where the eyes fixate during search is not random; rather, gaze reflects the combination of information about the target and the visual input. It is not clear, however, what information about a target is used to bias the underlying neuronal responses. We here engage subjects in a variety of simple conjunction search tasks while tracking their eye movements. We derive a generative model that reproduces these eye movements and calculate the conditional probabilities that observers fixate, given the target, on or near an item in the display sharing a specific feature with the target. We use these probabilities to infer which features were biased by top-down attention: Color seems to be the dominant stimulus dimension for guiding search, followed by object size, and lastly orientation. We use the number of fixations it took to find the target as a measure of task difficulty. We find that only a model that biases multiple feature dimensions in a hierarchical manner can account for the data. Contrary to common assumptions, memory plays almost no role in search performance. Our model can be fit to average data of multiple subjects or to individual subjects. Small variations of a few key parameters account well for the intersubject differences. The model is compatible with neurophysiological findings of V4 and frontal eye fields (FEF) neurons and predicts the gain modulation of these cells.

*top-down attention*refers to the mechanism of selectively biasing the weights of specific feature channels regardless of spatial location (Desimone & Duncan, 1995; Itti & Koch, 2001; Muller, Heller, & Ziegler, 1995; Palmer, 1994). We construct a computational model in close analogy to the neuronal processes underlying saccade planning that generates realistic eye movements and compare these against our own data.

*x*and

*y*axes, respectively. Noise was added to each grid position (uniformly distributed between ±1° and ±0.5° on the

*x*and

*y*axes, respectively). There were two types of distractor items, each sharing one feature with the target. Each of the distractor items was present 24 times. Thus, a given search display consisted of three unique items out of the four possible. The left-out item shares no features with the target. Each item occupied between 0.5° and 1.0° (see Figure 1). Items were presented on a light grey background to reduce contrast.

*SD*). Thus, the number of fixations to find the target was equal (up to a scale factor) to the time it takes to find the target. This was the case for all subjects except one who was excluded (different fixation durations are dependent on task; thus, this relationship does not hold).

*x*in the

*j*′ feature map,

*I*

^{ j}(

*x*), is set to either 0 (absent) or

*R*(present), where

*R*= 10 is the baseline rate. The feature maps are combined linearly to yield a rate of a Poisson process at each location:

*λ*can be thought of as a simplified saliency map representation (Itti & Koch, 2001) of the search array, modulated by knowledge about the target. The weights

*w*

_{j}are set based on the knowledge about the target (Wolfe, 1994). By default

*w*

_{j}= 1, except for the two feature maps which define the target (red and horizontal in Figure 3). The feature map that represents the

*primary*feature (primary with regard to the hierarchy of features that we report; red in the example of Figures 1A and 3) is set to

*p*+ 1, and the feature map that represents the other feature of the target (

*secondary*; here horizontal) is set to

*sp*+ 1. Thus,

*w*

_{1}=

*p*+ 1 and

*w*

_{3}=

*sp*+ 1 for the example in Figure 3;

*p*and

*s*are positive parameters. We refer to the weighted sum of all feature maps

*λ*(

*x*) as the

*target map*. It combines top-down target knowledge with bottom-up visual information and is the basis for all further processing. Each element in the target map is the mean of a Poisson process from which a sample is drawn every time the target map for

*x*is used. In the following, reference to

*λ*(

*x*) refers to sampling a number from a Poisson process with mean

*λ*(

*x*). Such a random sampling is an important property of any neurobiological circuit and has important consequences.

*F*(

*x*) ( Equation 2) for every element

*x*and choosing the item

*x*that has the highest value of

*F*(

*x*).

*x*= 1…49. The saccade planning process has a memory that has the capacity to store the last

*m*fixated locations (with

*m*= 0 corresponding to no memory). If the item with the highest

*F*(

*x*) is currently stored in memory, the next highest value is chosen (iteratively).

*SD*), which is larger than the mean interitem distance of (2.1° ± 2.1°). To reproduce saccades with such an amplitude distribution, we use a Gaussian “energy” constraint,

*E*(

*r*) = exp(

*r*the distance, in degrees, between the current fixation and the location of

*x*.

*E*(

*r*) is maximal at the preferred saccade length

*m*

_{SAD}and is smaller for all other values.

*m*

_{SAD}is set equal to the median of the measured distribution of the saccade amplitudes ( Figure 6C). Furthermore, saccades tend to continue in the same direction as the previous saccade. To reproduce this inertia, we added the term

*δ*(

*x,*

*c*

_{ t},

*c*

_{ t−1}), corresponding to the angular difference in orientation of two lines: (i) the line connecting the previously fixated location

*c*

_{ t−1}and the current fixation and (ii) the line between the current fixation location

*c*

_{ t}and

*x*. It is normalized, such that the maximal difference (180°) is equal to 1. That is,

*δ*(

*x,*

*c*

_{ t},

*c*

_{ t−1}) =

*α*

_{1}≤ 180 and

*α*

_{2}≤ 180 are the orientation of the two lines relative to the horizontal line measured at the origin of the saccades (

*c*

_{ t}and

*c*

_{ t−1}).

*k*

_{0},

*k*

_{1}, and

*k*

_{2}are constants that set the weight of the two mechanistic terms, relative to the target map.

*C*items within

*D*degrees (radius) of the current fixation (both are parameters). If there are more than

*C*items to evaluate within the given area, it evaluates the

*C*items with the highest values of

*λ*(

*x*). The capacity limitation does not imply either a parallel or a serial detection process. Whether the detection process has a capacity limitation at all is determined by the value of

*C*: if

*C*is bigger or equal to the number of items within the radius

*D,*the capacity of the process is effectively infinite. We here assume that the target detection process has no memory; that is, it does not take into account any information from previous fixations.

*p*and

*s*of the model such that both the number of fixations as well as the conditional probabilities are reproduced as best as possible. The other parameters were kept constant (see the Results section). To find the values of

*p*and

*s,*we calculated the number of fixations

*N*(

*p,*

*s*) and the conditional probabilities

*P*(

*p,*

*s*) for all combinations of

*p*and

*s*between 0.1 and 1 in steps of 0.1. We then used a least squares error measurement to simultaneously fit the two functions

*N*(

*p,*

*s*) and

*P*(

*p,*

*s*).

*N*) was statistically the same for the two tasks where color was available (CO and CS) (8.19 ± 0.85 and 7.85 ± 0.75;

*P*= .48,

*t*test), whereas the SO condition required significantly more (13.18 ± 1.72;

*P*< 10

^{−6},

*t*test). Although this might indicate that orientation cannot be used at all as a feature, additional controls confirm that it can be used if necessary. The two control conditions were rotated Ts (letter T) and rotated corners with equal length of both edges (see Figure 2A and the Methods section). In the “rotated Ts” condition, only orientation was available. Although performance was worse than during a conjunctive search, it was clearly better than in the “corners” condition where search appears to be random (none of the three features available). Also, if any of the three features uniquely defined the target (e.g., pop-out), the item was quickly found (right three bars in Figure 2A). This demonstrates that the subjects were able to detect all targets, no matter how they are defined. Note, however, that even in pop-out a substantial number of fixations were required until the target was found (on average 4.45 ± 0.45 fixations, ±

*SE*with

*n*= 445 trials, for all three pop-out tasks).

*P*(share color ∣ Target) and

*P*(share orientation ∣ Target) for the CO condition). These two conditional probabilities are not independent because (by design) if an item does not share the color with the target, it shares its orientation and vice versa. We find that subjects primarily used one of the two features in all three search conditions ( Figure 2). If color was available, it was strongly preferred (CO and CS condition, Figures 2B and 2C, respectively); if color defined the target, most eye movements were close to targets whose color was identical to the color of the target. On the other hand, there was a preference for size over orientation in the SO condition ( Figure 2D).

*P*= .01,

*t*test; Figures 2B and 2C). If it were ignored, these probabilities should be equal. It thus seems that the primary feature is not the sole factor for determining the fixation probability. Also note that the search difficulty (number of fixations) for the two conditions was not different, despite the different conditional probabilities. On the other hand, the conditional probability was approximately equal for the CS and SO tasks, but their difficulty was very different (8 vs. 13 fixations). This is a further indication that multiple features are used to guide the search. We will explore this seemingly contradictory fact with our computational model.

*t*test between data and chance control for CO, CS, SO, respectively, for third nearest neighbor:

*P*= .34, .45, .14). Thus, only the nearest neighbor seems to induce a strong bias and therefore saccades seem to be primarily targeted toward specific items (see the Discussion section).

*p*and

*s*), memory capacity (

*m*), detection capacity (

*C*), and detection radius (

*D*). The first two parameters (

*p*and

*s*) modulate the target map. Memory capacity determines how many fixations, relative to the current one, the saccade planning process remembers (and thus does not revisit). The other two parameters influence the detection process (

*C,*

*D*). The saccade planning process is thus parameterized by three parameters only (

*p,*

*s,*and

*m*). Below, we evaluate each parameter in terms of search performance ( Figure 4), quantified by number of fixations (

*N*) as well as conditional probabilities (

*P*). We only plot the conditional probability of the primary feature

*P*=

*P*(primary∣T) for each condition because the conditional probability of the secondary feature is approximately 1 −

*P*(except for blank fixations, which are <1%).

*p*and

*s*each from 0 (no modulation) to 1 (doubling the object's rate in the target map) while keeping all other parameters constant (

*m*= 1,

*C*= 2,

*D*= 6). We specify

*s*as a fraction of

*p*to ensure that primary attention is always stronger than secondary attention; thus,

*s*= 0.5 means that

*s*is equal to 50% of

*p*. For example, if

*p*= 0.6 and

*s*= 0.5, the weights for primary and secondary attention are set to 1.6 and 1.3, increasing the mean amplitude of these two features in the target map by 60% and 30%, respectively.

*p*and

*s*have a strong, but differential, influence on search performance ( Figure 4A). If only one feature is modulated (i.e.,

*s*= 0), the model cannot be fitted to the data (red line in Figure 4A). For example, for

*p*= 1 and

*s*= 0,

*N*> 15 whereas subjects routinely required less fixations for any of the three tasks (e.g., ∼8 for the CO and CS task). This is the case despite the very high conditional probability of

*p*= 1.0 (higher than measured). Two conclusions can be drawn at this stage: Attentional modulation to only one feature is not sufficient and strictly fixating on elements that share the primary feature with the target does not guarantee high performance. This also implies that it is necessary to combine the feature maps with a sum instead of a Max operation: The Max method only allows deployment to one feature. In our model, replacing the sum with a Max operation in Equation 1 is equal to setting

*s*= 0. As can be seen in Figure 4A, realistic levels of performance are impossible to reach with this setting.

*s*while keeping

*p*constant ( Figure 4B) increases the number of fixations

*N*and decreases the conditional probability

*P,*reproducing our data. Note that

*s*is a fraction of

*p*and not an absolute value to ensure that

*p*>

*s*at all times (see above). The tradeoff between low and high values of

*s*can be visualized in terms of the target to distractor ratio (TDR) and target visibility. Here, TDR is defined as the ratio between the mean value of the target divided by the mean value of all distractors. The mean value of an item

*x*is equal to the mean value of the Poisson process

*λ*(

*x*) in the target map. No secondary attention (

*s*= 0) results in maximal difference between the values of the items that share the primary feature with the target and those who do not, but the target is indistinguishable from any other of the 23 distractors of the attended type (TDR = 1). On the other hand, increasing

*s*reduces the difference between the two dimensions of the primary feature map but makes the target more visible (TDR > 1). This tradeoff explains why lower conditional probabilities for the primary feature do not necessarily result in lower search performance (e.g., CO vs. CS).

*m*only affects the planning process. Increasing the memory capacity from

*m*= 0 to

*m*= 1 improves performance strongly ( Figure 4C). However, further increases in memory capacity (

*m*> 1) have a negligible influence on performance. Even perfect memory (

*m*= 49) hardly improves performance. See the Discussion section for an explanation of this finding.

*N*but not

*P*. First, we explored the influence of

*C*and

*D*on performance under the assumption that the detection process is capacity limited. Increasing detection capacity while keeping the detection radius constant (

*D*= 6) increases performance ( Figure 4E), but only until

*C*= 3. The detection radius with constant capacity (

*C*= 2) only has a minor influence ( Figure 4D). Both results can be explained by the properties of the search array: Given even a moderately sized search radius, there are more elements than can be processed within the detection radius. However, only a few (<4) of those candidates are likely to be the target (e.g., share the primary feature) and thus search performance does not increase if

*C*and

*D*are increased further.

*C*and

*D*because we assume that the serial nature of the detection process does not allow more covert attentional shifts within the time of a single fixation. Next, we explored performance in cases of a more capable (possibly parallel, see the Discussion section) detection process (

*C*> 2,

*D*> 6). We find ( Figure 5) that a higher capacity process leads to large performance improvements if the detection radius is sufficiently large. Realistic detection thresholds are typically below 10° (

*D*≤ 10, see the Discussion section). Assuming an infinite detection capacity (

*C*> number of items within radius

*D*), we find that increasing the detection radius from 2 to 10 increases search performance dramatically ( Figure 5,

*C*= 49). Thus, the model is capable of reproducing a wide range of possible fixation and detection behaviors.

*m*= 1), a detection capacity of

*C*= 2 and a search radius of

*D*= 6. As shown in Figure 4, the exact choice of these parameters is not critical.

*k*

_{0}= 15,

*k*

_{1}= 10, and

*k*

_{2}= 5; these constants are adjusted such that the eye movements have realistic properties (saccade length and angle). Their exact values are not crucial and changing them only affects the mechanistic properties of the generated eye movements.

*N*and

*P*for all tasks (CO, CS, SO) by finding values for the parameters

*p*and

*s*that fit the experimental data (see the Methods section). We find that the averaged data can be fit well: Running the model with the identified parameters on the same data that the subjects saw reproduces both the number of fixations as well as the conditional probabilities observed ( Figures 6A and 6B). Also, the distribution of the saccade amplitudes is similar to the experimental data ( Figure 6C), indicating that the model produces realistic eye movements. Note that the shape of the SAD is not explicitly specified by the model description ( Equation 2). Rather, it is a result of the interaction of the term

*E*(

*x*) in Equation 2 and the target map. If evidence for a given item (given by

*λ*(

*x*)) is strong, long saccades are made. This leads to the long tail of the SAD.

*p*/

*s*values for the three tasks (CO, CS, and SO) were 0.9/0.7, 0.9/0.8, and 0.45/0.7. Thus, the increases in firing rate (mean of Poisson process in model) for the primary and secondary features are 90%/63%, 90%/72%, and 45%/32%, respectively. These values, which were fitted independently for each task, confirm the hierarchy of features that we observed experimentally. Color is the strongest feature (with

*P*= .9), followed by size (

*P*= .45). Orientation is never the primary feature.

*SE*over subjects). In a second experiment, we reran all tasks on three additional subjects on an eye tracker with higher temporal resolution (1,000 Hz, see the Methods section). This enabled us to individually fit the model to each subject and to compare parameters. We find that the individual subjects required a comparable number of fixations for the three tasks CO, CS, and SO. This is illustrated in Figure 8A with error bars as ±

*SEM*. The number of trials for each task that each subject completed successfully varied between 60 and 70. The comparison with the population average (black, from Figure 2A) shows that the individual subjects behaved similarly to the mean of nine subjects (note that the three additional subjects are not part of the population average). One difference, however, is that all our individual subjects were significantly faster for the CS task compared to the CO task (CO vs. CS,

*P*< .02 for all 3 subjects,

*t*test). Although the same was the case for the population average, this difference was not significant (number of fixations 8.19 ± 0.85 vs. 7.85 ± 0.75;

*P*= .48,

*t*test). This indicates that the population average masked an important difference due to intersubject variability. The conditional probability of fixating on the primary feature in each task (color, color, size for CO, CS, and SO, respectively) was more variable between subjects ( Figure 8B). However, the mean values of each conditional probability are well compatible with the population average (black). Next, we will explore differences between individual subjects with our model.

*p*and

*s*(strength of top-down modulation of primary and secondary feature) to each task of each subject ( Figures 8C and 8D). All other parameters were kept constant with the values established above:

*D*= 6,

*C*= 2, and

*m*= 1. The mean absolute error between the experimental data and the data produced by the model was small: The difference for the number fixations (

*N*) was 0.16 ± 0.55 fixations (±

*SD*) and 0.02 ± 0.02 (±

*SD*) for the probabilities (

*P*). We find that the individual differences between subjects ( Figures 8A and 8B) can be accounted for with small variations of the parameters ( Figures 8C and 8D). This confirms that individuals are highly consistent for each task, generally having differences in

*p*and

*s*of 10–20%. For example, Subjects 2 and 3 (red, green) have an approximately 55% probability of fixating color instead of size in the CS task ( Figure 8B, middle). This is reflected in the parameters in an approximately equal increase for the primary and secondary feature ( Figures 8C and 8D), which leads to a lower probability of fixating the primary feature (because both features are modulated equally). However, primary attention is slightly larger in both cases, reflecting the above chance probability of >.5 for both. On the other hand, Subject 1 (blue) had a high preference for color in the same task (CS, Figure 8B), which in turn is reflected in a higher value of

*p*than

*s*( Figures 8C and 8D; blue in CS). This confirms that our model can be fitted to individual subjects as well as to populations of subjects.

*p*and

*s*). The strength of attention to the primary feature dominates the number of fixations it takes to find a target, whereas the strength of secondary attention primarily determines the conditional probability of fixation.

*p*and

*s*. There is good neurophysiological evidence for multiplicative gain control by top-down attentional modulation of visual cortical neurons (McAdams & Maunsell, 1999a, 1999b; Reynolds, Pasternak, & Desimone, 2000; Treue & Martinez Trujillo, 1999).

*C*) in parallel at every fixation, as long as they are within the radius of detection (

*D*parameter). Planning where to fixate next does assume that focal attention shifts to the new location. These assumptions are supported by a number of neurophysiological studies. In particular, neurons located in the FEF are known to be closely related to the initiation of eye movements (Bruce & Goldberg, 1985; Bruce et al., 1985; Schall, Hanes, et al., 1995). The response of FEF neurons is dominated by the visual input that is task relevant, whereas all other input is only weakly represented (regardless of their visual features). The firing rate of an FEF neuron is higher if the item in the RF shares features with the target compared to an item that shares no features with the target (Bichot & Schall, 1999). FEF neurons thus signal, for each item, the estimated probability that this position contains the target. Based on this, it has been proposed that FEF represents an integration of the visual input together with top-down information about the task (Thompson & Bichot, 2005; Thompson, Bichot, & Schall, 2001). Looking at our model, FEF neurons can be thought of as implementing our target map (Figure 3). Thus, each value

*λ*(

*x*) in the target map corresponds to the mean firing rate of neurons coding for a particular movement vector (relative to the current fixation). Also, the process of making a saccade where

*λ*(

*x*) is maximal has a close neuronal analogy: FEF neurons integrate their input until a threshold is reached (race-to-threshold model; Hanes & Schall, 1996). Thus, the neuron with the largest

*λ*(

*x*) will (on average) reach threshold fastest and evoke a saccade.

*I*

^{j}(

*x*), used in the feature maps of our model.

*λ*(

*x*).

*d*′) are the same for both detection (at fixation) and saccades over a wide range of signal-to-noise values as well as tasks (Beutter, Eckstein, & Stone, 2003).

*what*information about the target is used to bias search? Another approach to answer this question is to embed the targets in 1/

*f*noise and extract a small patch of the image around each fixation. This patches are then prewhitened (de-correlated) and averaged to yield a classification image (Rajashekar et al., 2006). The classification image shares some (but not all) features with the target. This supports our contention that some features of the target are used preferentially to bias the search.

*D*around the current fixation and that no more than

*C*items can be considered. For large values of

*C*(

*C*larger than the number of items that fit within

*D*), this is equal to a parallel model (Eckstein, Thomas, Palmer, & Shimozaki, 2000; Palmer, 1994; Palmer et al., 2000). For values of

*C*that are smaller than the number of items within the radius

*D,*the model is serial in the sense that it only processes a subset of all possible items. However, we do not make assumptions as to whether the

*C*items processed at every fixation are processed serially (covert attentional shifts) or in parallel. The restricted search radius

*D*is motivated by the fact that the ability to correctly detect the presence or absence of items decreases as a function of distance from the current fixation (Bouma, 1970; Carrasco, Evert, Chang, & Katz, 1995). Depending on the size of the items and signal-to-noise ratio, detection accuracy degrades quickly over a few degrees and is typically smaller than 10°.

*C*). Indeed, we find (Figure 7C) that the incidence of such trials is related to

*C*. Assuming that the underlying detection model is a capacity limited model, values of

*C*can be found that produce the same incidence of “return to target” trials as observed experimentally. We used

*C*= 2, which somewhat overestimates the incidence of on-target trials.

*C*> 7 drops to zero ( Figure 7C). This occurs because our model does not have an activation threshold (Wolfe, 1994). Thus, arbitrary small activation values in the target map still attract attention. Thus, it is necessary to introduce an activation threshold below which an item is never considered for detection to account for all aspects of the data with a large (possibly infinite) capacity model. The value of this activation threshold could be fit to the data such that the same “on-target fixation” incidence is reproduced by the model. Our data do not allow us to conclude which is the valid model of detection.

*d*′ as a function of eccentricity (Najemnik & Geisler, 2005).

*N*th nearest neighbor (e.g.,

*N*= 1 is nearest,

*N*= 2 is second nearest) quickly approaches chance for

*N*> 1 ( Figures 2B– 2D, insets). For

*N*= 3 the bias is entirely erased and both types of distractors are equally likely to be the third nearest neighbor. Although this does not exclude targeting of saccades toward more complicated ways of grouping, it shows that saccades in our search displays were primarily targeted to single items and not to the center-of-gravity of multiple items. This is particularly relevant in the context of the optimal searcher (Najemnik & Geisler, 2005), which predicts center-of-gravity fixations. However, a suboptimal searcher that always fixates the location with maximum posterior was likewise found to be nearly equal in performance (Najemnik & Geisler, 2005). One extension of GS that takes into account eye movements is the area activation model (Pomplun, Reingold, & Shen, 2003). Here, the activation map is convolved with a 2-D Gaussian to account for the typical “area” that can be processed at every fixation. This model is built on the premise that fixations are guided by groups of items that share a feature with the target (rather than single elements). The number of fixations is equal to the number of peaks in the area activation map. This model assumes that only one feature guides the search (e.g., color). Here, we argue that guidance by both features is necessary to simultaneously account for both the number of fixations and the conditional probability.

*m*= 49 hardly improves performance. This counterintuitive result can be explained by treating search as a random search from an array of

*n*elements with (

*m*= 0) or without (

*m*=

*n*) replacement. The expected number of draws to find a specific item is

*n*in the former and

*n*/2 in the later case. Because the number of fixations to find the target is typically <10, the probability that an element that is currently in memory is revisited is small. Thus, memory is not important for this particular search array size and configuration. However, this calculation also indicates that for search arrays with many fewer items that still require a substantial number of fixations, memory might well become important. Note that the constraints of the eye movements (smoothness, typical saccade length) also act as an implicit form of memory. Thus, because saccades tend to follow each other with minimal change in angle (see Figure 1), items previously fixated tend to be avoided even if no explicit memory exists. This additionally decreases the probability of revisiting items without having an explicit form of memory. This might explain some of the previous contradictory results (Horowitz & Wolfe, 1998; Klein, 1988; McCarley et al., 2003). Our model recomputes the entire target map for every fixation and does not have memory for any of the decisions made at the previous fixation. Thus, no trans-saccadic memory integration takes place, in agreement with a recent Ideal Bayesian Observer model (Najemnik & Geisler, 2005).

*p*= 1 in our model. Using more complex items (cartoons of objects) (Chelazzi, Miller, Duncan, & Desimone, 2001) found a modulation by the search target of between 39% and 63% (in terms of our model, this corresponds to a variation of

*P*of .39–.63). Others (Bichot et al., 2005) discovered that the modulation for V4 neurons responsive to color was much stronger than modulation of neurons responsive to shape (e.g., mostly orientation). Another study found that V4 neurons tuned to orientation increased firing by on average only 20% (McAdams & Maunsell, 1999a), which is much less compared to color. Although it seems puzzling that some features are modulated less than others, our model predicts exactly this situation. We also show that this is the optimal strategy to use: Increasing firing for one feature more than the other results in better performance than strongly modulating only one feature. It remains an open question why color rather than, say, orientation has primacy in terms of strength of top-down modulation.