Abstract
In the context of natural behaviors, humans must gather information to choose between different task needs, such as avoiding obstacles or heading towards a target. The brain's neural reward machinery has been implicated in these action choices, and a technique called Inverse Reinforcement Learning (IRL) can be used to estimate reward functions from behavioral data. A frequently overlooked variable in IRL is the discount factor: how much a future reward matters compared to the current reward. If future rewards are too heavily discounted, a person could overlook a future obstacle, even though it incurs a large negative reward. We argue that reward, together with discount factor, defines a value surface for a single task. The reward controls the maximum height of the surface, and the discount factor controls how fast the reward decreases over time or distance, i.e., surface shape. Value surfaces are computationally easy to compose, hence multi-task behaviors can be modeled by combining these surfaces. This leads naturally to a divide-and-conquer approach of IRL, called modular IRL, which estimates relative rewards for subtasks. We expand upon previous modular IRL models (Rothkopf and Ballard, 2013) to include estimating the discount factor, and justify its correctness theoretically and experimentally through computer simulations as well as human experiments. We collect human navigation data in a virtual reality environment. Subjects are instructed to do a combination of following a path, collecting targets, and avoiding obstacles. We show the rewards and discount factors estimated from our algorithm reflect task instructions, and can accurately predict human actions (average angular difference = 24°). Furthermore, with two variables per objective (reward and discount factor), a virtual agent is able to reproduce long human-like navigation trajectories through the environment. We conclude that modular IRL with learned discount factors could be a powerful model for multi-task sensorimotor behaviors.
Meeting abstract presented at VSS 2017