Abstract
Composite visuo-motor behaviors can be synthesized from simpler behaviors. For example in walking down a sidewalk, a pedestrian may have goals of staying on the sidewalk, avoiding pedestrians and picking up litter. [Sprague 03], showed in a virtual reality simulation that these individual behaviors can be learned by reinforcement learning. However that simulation assumed that the rewards associated with the individual behaviors were known. In practice this is unreasonable as only the total reward for the composite behavior is likely to be available. This is a long-standing problem in learning known as the credit assignment problem.
[Chang 03] showed that an estimate for the individual rewards could be obtained by assuming that the total reward was assigned to each behavior and the variations in that reward were assumed to be noise. This model made sense in their setting, which had the individual behaviors embedded in different agents, but introduced a problem in that the resultant reward estimates were biased and could be suboptimal.
We show that the credit assignment problem has a solution when the visuo-motor behaviors all are embedded in the same agent. Each behavior needs to know which other behaviors are simultaneously active. It can then keep a running estimate of its share as its current estimate adjusted by the total instantaneous reward minus the reward estimates of the concurrent behaviors. The simulations show that, as long as the behaviors are updated in a random order, the estimated reward for each behavior converges to its true value.
This work was supported by NIH grants EY05729 and RR09283