How do reward outcomes affect early visual performance? Previous studies found a suboptimal influence, but they ignored the non-linearity in how subjects perceived the reward outcomes. In contrast, we find that when the non-linearity is accounted for, humans behave optimally and maximize expected reward. Our subjects were asked to detect the presence of a familiar target object in a cluttered scene. They were rewarded according to their performance. We systematically varied the target frequency and the reward/penalty policy for detecting/missing the targets. We find that 1) decreasing the target frequency will decrease the detection rates, in accordance with the literature. 2) Contrary to previous studies, increasing the target detection rewards will compensate for target rarity and restore detection performance. 3) A quantitative model based on reward maximization accurately predicts human detection behavior in all target frequency and reward conditions; thus, reward schemes can be designed to obtain desired detection rates for rare targets. 4) Subjects quickly learn the optimal decision strategy; we propose a neurally plausible model that exhibits the same properties. Potential applications include designing reward schemes to improve detection of life-critical, rare targets (e.g., cancers in medical images).

- the decision criterion was more conservative than the optimal decision criterion (i.e., shifted toward fewer target detections),
- changing the target frequency, but not reward, lead to near-optimal shifts in the decision criterion.

r _{00} (Correct rejection) | r _{01} (False alarm) | r _{10} (Miss) | r _{11} (Correct detection/hit) | |
---|---|---|---|---|

(a) Neutral | +1 | −50 | −50 | +1 |

(b) Airport | +1 | −50 | −900 | +100 |

(c) Gain | +1 | −50 | −50 | +950 |

*τ*to optimize reward.

*p*-value < 0.05, two-tailed

*t*-test) from close to 75% down to 55% ( Figure 1, light gray bars; mean ± standard error in detection rates for 50% and 10% target frequencies are 73 ± 4% and 56 ± 4%). This replicates the finding by Wolfe et al. (2005) that detection rates drop as the target frequency decreases. In Experiment 1B, we studied the effect of changing the reward structure from Neutral to Airport (Table 1b) when the target frequency decreases to 10% (infrequent). We found that the detection rate increases but not significantly (

*p*-value > 0.05, two-tailed

*t*-test; Figure 1, red bars; detection rates for 10% target frequency is 62 ± 7%).

*d*(see Comparing subjects and ideal observers section for details of the decision variable). The observer maintains a fixed internal representation of decision variables in target present and absent displays (Comparing subjects and ideal observers section explains how this internal representation may be obtained from subject's data; for now, let us assume these are known). The observer uses a linear discrimination threshold as a decision criterion (

*τ*) to distinguish between likelihood of target presence (

*P*(

*d*∣

*T*= 1)) vs. absence (

*P*(

*d*∣

*T*= 0)) in the display (Palmer et al., 2000). Upon seeing a new display, the observer decides whether the target is present or absent by checking whether the observed decision variable

*d*exceeds the decision criterion

*τ*. Detection performance varies as the decision criterion changes and is visualized by the Receiver Operating Characteristic (ROC) curve that plots the probability of correct detection (hit) (

*P*

_{CD}) vs. false alarm (

*P*

_{FA}) for all possible values of the decision criterion

*τ*(Figure 2b).

*τ*that maximizes the expectation of reward at each trial. The expected reward per trial

*E*[

*R*] can be computed for different points on the ROC curve (

*P*

_{ FA},

*P*

_{ CD}) explicitly by summing the four possible outcomes, each weighted by the probability of the corresponding event as described in Table 1. Refer to Table 1 for notation.

*P*

_{1}is the probability of target presence (

*P*

_{0}= 1 −

*P*

_{1}is that of target absence):

*P*

_{ CD}and

*P*

_{ FA}are functions of

*τ*. The ideal observer will choose a value of

*τ*that produces

*P*

_{ CD}and

*P*

_{ FA}such that the previous expression is maximized.

*E*[

*R*]) is higher in the middle left region of the ROC curve. Hence, the optimal behavioral strategy is to operate at the middle left on the ROC curve. This is better visualized in Figure 2d, where

*E*[

*R*] is shown as a function of the decision criterion

*τ*. To maximize

*E*[

*R*], the ideal observer would shift the decision criterion

*τ*slightly to the right (from the point of intersection of curves

*P*(

*d*∣

*T*= 1) and

*P*(

*d*∣

*T*= 0)), similar to the red line in Figure 2a. Consequently, the false alarms decrease and the detection rates are poor (red symbol in Figure 2b). Thus, the ideal observer theory predicts that detection performance for rare targets will be poor (as observed in previous studies (Wolfe et al., 2005) and replicated here). In the next section, we compare the behavior of the ideal observer and our subjects.

*i*th location (

*i*∈ {1…12}) is drawn from a Gaussian distribution,

*x*

_{ i}∼

*G*(

*μ, σ*) (

*μ*= 1 for the target, and

*μ*= 0 for the distractor). Next, we assumed that the display is represented internally as a single number, the decision variable

*d*. Examples of decision variables for our yes/no visual search task include the maximum response at all locations in the display (Palmer, Ames, & Lindsey, 1993; Verghese, 2001), the likelihood ratio of target presence vs. absence in the display (see ideal observer, A.2.3, (Palmer et al., 2000); for use of the ideal rule in 2AFC tasks and cueing tasks, see Eckstein, Shimozaki, & Abbey, 2002; Schoonveld, Shimozaki, & Eckstein, 2007). We choose the latter as it is the ideal rule to integrate information across the display. According to this rule, the likelihood of target presence (or absence) in the display can be expressed in terms of the likelihood of target presence (or absence) at each location. Let

*T*= 1 represent target presence in the display (

*T*= 0 denotes target absence) and let

*T*

_{i}= 1 represent target presence at the

*i*th location in the display (

*T*

_{i}= 0 denotes target absence):

*P*(

*d*∣

*T*= 1),

*P*(

*d*∣

*T*= 0)) and varies the decision criterion

*τ*to operate at different points on the ROC curve. We find the best fitting ROC curve through a maximum likelihood estimation procedure that determines the value of

*σ*that maximizes the likelihood of subject's data. The resulting ROC curves are asymmetric (Figure 2b), as reflected by the difference in shape of the distributions of decision variables to target present vs. absent displays (Figure 2a). Such asymmetric ROCs have also been observed in other studies (Wolfe et al., 2007).

*z*-test,

*p*-value > 0.05). This replicates previous findings (Green & Swets, 1966; Healy & Kubovy, 1981; Kubovy & Healy, 1977; Lee & Janke, 1964, 1965; Lee & Zentall, 1966; Maddox, 2002) that changing the sensory prior (target frequency) causes an optimal shift in the decision criterion. In contrast, we find that changing the reward scheme from Neutral to Airport for a fixed target frequency (10%) yields a suboptimal shift in the decision criterion (Figures 3c and 3d)—subjects fail to maximize expected reward in the 10% Airport condition (they have fewer target detections and fewer false alarms than the ideal observer). These results confirm previous findings (Green & Swets, 1966; Healy & Kubovy, 1981; Kubovy & Healy, 1977; Lee & Janke, 1964, 1965; Lee & Zentall, 1966; Maddox, 2002) that subjects adjust optimally to changes in target frequency, but they adjust suboptimally to changes in the reward scheme.

*r*

_{ij}(

*i, j*∈ {0, 1}) in Equation 1, are perceived differently by the subjects (not as the objective reward value set by the experimenter) due to filtering through a subjective utility function. Costs and benefits perceived by humans are not linearly related to monetary rewards; in particular, the utility of a dollar gained is typically lower than the negative utility of a dollar lost (Kahneman & Tversky, 1979), and the utility of a dollar diminishes with increasing gains (von Neumann & Morgenstern, 1953). Such non-linearity confounds the results from previous studies (replicated in Experiment 1) as the observed suboptimality (failure to maximize expected reward) may be due to the diminished utility of the reward values (e.g., a reward value of 100 points perceived as less than 100). A diminished value of reward and penalties is consistent with the conservative strategy (low detection rates and low false alarms) adopted by subjects in Experiment 1 (and previous studies).

*p*-value < 0.05, two-tailed

*t*-test) from close to 80% down to 30% ( Figure 4, light gray bars; mean ± standard error in detection rates for 50%, 10%, and 2% target frequencies are 78 ± 1%, 58 ± 4%, and 29 ± 4% respectively). This replicates the finding by Wolfe et al. (2005) that rare targets are often missed. In Experiment 2B, we studied the effect of changing the reward structure from Neutral to Airport (Table 1b) when the target frequency decreases to 10% (infrequent) and 2% (rare). We found that the detection rate increases significantly (more than double when the target is rare), restoring performance to levels statistically indistinguishable from the 50% target frequency condition (Figure 4, red bars; detection rates for 10% and 2% target frequencies are 77 ± 3% and 61 ± 11%). In fact, in two out of four subjects, the detection rate in the 10% Airport condition was significantly higher than in the 50% Neutral condition. This shows that reward can significantly influence target detection performance.

*d*′ = 1.52) and allows good quantitative predictions as shown below. When target frequency decreases, subjects move down the ROC curve toward lower detection and lower false alarm rates. In contrast, when reward changes from the Neutral to the Airport scheme, subjects move up along the ROC curve toward higher detection and false alarm rates. This suggests that the internal representation of the stimulus display (

*P*(

*d*∣

*T*= 0),

*P*(

*d*∣

*T*= 1)) remains the same across all conditions—i.e., the subjects' attention/arousal levels remain the same—but the decision criterion changes.

*z*-test,

*p*-value > 0.05). Not only does this show that our model predicts subject's data well, but also that the behavior of subjects is optimal, i.e., subjects maximize expected reward per trial.

*r*

_{ ij}=

*k*in Equation 1), or reward information (value-based choice that ignores sensory information, i.e., setting

*P*(

*d*∣

*T*= 0) =

*P*(

*d*∣

*T*= 1) or

*P*

_{ CD}=

*P*

_{ FA}in Equation 1). As seen in Figure 7, we find that only a model that combines sensory and reward information optimally can explain subject's data across all target frequencies and reward schemes. Thus, subjects in our experiments are combining sensory priors and reward outcomes optimally, rather than purely sensory or economic decision making.

*τ*

_{ opt}that maximizes expected reward per trial. This suggests that humans learned

*τ*

_{ opt}within the block of 100 trials of training that they received at the beginning of each experimental condition (i.e., before the start of a new combination of reward scheme and target frequency). We analyzed the training data as follows. For each experimental condition, we determined the correct detection/hit (

*P*

_{ CD}) and false alarm rates (

*P*

_{ FA}) as a function of the number of trials seen in the training sequence, used these to determine subject's reward using Equation 1, and asked when the subject's reward becomes statistically indistinguishable from the ideal observer's reward (

*z*-test, significance level 0.05). Analysis of the training data reveals that on average, subjects learned

*τ*

_{ opt}rapidly, within 14, 31, 7, 24, and 42 trials in the 50%, 10%, 2% Neutral, and 10%, 2% Airport conditions, respectively. What are the underlying computational and neural mechanisms of such rapid reward-based learning? How much information does a learner need to determine the optimal decision criterion (e.g., can the learner perform well without full knowledge of the statistics of responses in target present and absent displays)? How much memory does a learner need to determine and maintain the optimal decision criterion? In the next section, we address these issues and propose a neurally plausible model of learning.

*τ*

_{ opt}, 2) optimal learner (with perfect and infinite memory), 3) bio learner (neurally plausible implementation of the optimal learner), 4) unit memory learner (who decides based on the previous trial only), 5) finite memory learners (who decide based on the previous 32, 64, and 128 trials). All model learners except the bio learner have perfect memory but with varying memory capacity (i.e., number of previous trials in memory). We explain each model below.

**The ideal observer**knows everything about the experiment except the ground truth of whether the target is present or absent on every trial (

*T*

^{ i}= 1 if the target is present on the

*i*th trial). Thus, the ideal observer knows the probability density of responses in the target present

*P*(

*d*∣

*T*= 1) ∼

*μ*

_{1},

*σ*

_{1}) and absent displays

*P*(

*d*∣

*T*= 0) ∼

*μ*

_{0},

*σ*

_{0}); the target frequency

*P*1; and the reward structure (

*r*

_{00},

*r*

_{01},

*r*

_{10},

*r*

_{11}). This observer can compute the optimal decision criterion

*τ*

_{ opt}that maximizes expected reward

*E*[

*R*] ( Equation 1) using gradient ascent or other optimization techniques. This may be done ahead of seeing the stimulus, thus the ideal observer behaves optimally from the beginning.

*n*trials (

*n*= 1,32,64,128 for the finite memory learners, and infinite for the optimal learner). Rather than a slow process of explicitly learning the target frequency or display statistics, then determining the expected reward profile, and subsequently finding its peak through gradient ascent (or some optimization method), the models use a faster algorithm explained below.

**The optimal learner**( Figure 8) knows much less than an ideal observer—it knows only the ground truth

*T*

^{ i}and reward feedback in

*n*trials seen so far, and the corresponding values of the decision variable (

*d*

^{ i}∣

*T*

^{ i},

*i*∈{1,

*n*}). We call it the optimal learner as it learns

*τ*

_{ opt}in a minimum number of trials (and using few parameters). For each decision criterion

*τ,*the optimal learner computes the total reward

*R*

^{ i}(

*τ*) earned by operating at

*τ*in the previous

*i*trials and selects the

*τ*with maximum reward as the decision criterion for the next trial (

*τ*

^{ i+1}) (if there are multiple

*τ*with maximum reward, the learner chooses the mean

*τ*). The algorithm is explained below: Initially, all decision criteria (spanning the range of decision variable values [

*d*

_{min},

*d*

_{max}]) have equal reward scores (∀

*τ, R*

^{0}(

*τ*) = 0) and hence are equally likely to be chosen. Upon observing a value of the decision variable

*d*

^{ i}on the

*i*th trial, the learner updates the score at each

*τ*with the reward earned by operating at

*τ*(which is positive for a correct response and negative for an incorrect response). The criterion for the next trial,

*τ*

^{ i+1}, is chosen as the one that yields maximum reward

*R*

^{ i}(

*τ*). The exact algorithm is given below:

**The finite memory learner**is similar to the optimal learner except that it has limited memory of the previous

*k*trials only (

*k*∈ {1,32,64,128}).

**Bio learner**( Figure 8) is a neurally plausible implementation of the optimal learner. We assume that the optimal decision criterion is encoded by a population of neurons (with Gaussian tuning curves) tuned to different decision criteria. Upon seeing the

*i*th display with a value of decision variable

*d*

^{ i}, all neurons whose preferred decision criteria favor the correct response undergo response gain enhancement proportional to the sum of reward gained by correct response and penalty avoided by incorrect response. The optimal decision criterion is set as the preferred criterion of the most active neuron in the population, read out through a winner-take-all competition. For simulation purposes, we implemented a population of neurons with equi-spaced Gaussian tuning curves, with standard deviation

*σ*= 0.25 and inter-neuron spacing equal to

*σ*. This population consists of 33 neurons representing values of

*τ*in [

*d*

_{min},

*d*

_{max}]:

*E*[

*R*], Equation 1) and error bar (

*SEM,*standard error of the mean reward per trial) earned by the ideal observer (who operates at the optimal decision criterion). The error bar is computed as

*σ*[

*σ*[

*R*

_{ i}/

*N*] =

*σ*[

*R*]/

*R*

_{ i}is the reward in the

*i*th trial,

*R*is the reward per trial, and

*N*is the sample size. Rather than setting

*N*to an arbitrarily large number of iterations, to ensure a fair comparison between the subjects and models, we set

*N*to be the number of trials performed by the subject in each experimental condition, thus

*N*= {300, 300, 1000} in the 50, 10, and 2% target frequency conditions, respectively. We then simulated the models for 100 blocks of

*N*trials (300 trials for 50, 10%; and 1000 trials for 2% frequency). On each trial, the decision variable (

*d*) was drawn independently from a target present distribution (

*P*(

*d*∣

*T*= 1)) with probability

*P*

_{1}(0.5, 0.1, 0.02 depending on target frequency) or from a target absent distribution (

*P*(

*d*∣

*T*= 0)) with probability (1 −

*P*

_{1}). All learners were fed the same pseudo-random sequences as input. The distributions

*P*(

*d*∣

*T*= 1) and

*P*(

*d*∣

*T*= 0) were determined from the subjects' ROC curve ( Figure 3, see Comparing subjects and ideal observers section for details on how to obtain these distributions from subject's data). The model's decision criteria evolved as a function of the values of the decision variable in previously seen displays (see equations in previous section). We determined the learning rate of each model learner as the number of trials (or displays) taken by the model to maximize expected reward (using a

*z*-test to determine statistical difference between model's median reward and ideal observer's expected reward).

Experimental condition | Subject's learning rate | Optimal learner's rate |
---|---|---|

50% Neutral | 14 | 8 |

10% Neutral | 31 | 16 |

2% Neutral | 7 | 2 |

10% Airport | 24 | 16 |

2% Airport | 42 | 64 |

*p*= 0.63 from a

*t*-test).

*r*

_{ij}(

*i, j*∈{0, 1}) in Equation 1 may not be perceived by the subjects as the objective reward value (chosen by the experimenter). Non-linearity in subject's utility function such as diminished utility with increasing wealth (von Neumann & Morgenstern, 1953) may result in a gain of 10 points being perceived as less than 10, resulting in weak reward incentives. Thus, previously observed suboptimality (failure to maximize expected reward) may be due to the non-linearity of subject's utility function. We avoided this confound by rewarding subjects in a competitive setup—subjects may lose the contest even after earning 1000 points, hence to win the contest, subjects must try to maximize the number of points earned without slackening their effort or diminishing their utility. Under such competitive settings (Experiment 2), we find that subjects maximize expected reward. Thus, the drop in detection performance as the search targets become rare (Wolfe et al., 2005) can be compensated for by changing the reward scheme ( Equation 1 shows that reward and target frequency are multiplied together and therefore should have equivalent effects on observer's behavior).