Abstract
Deep reinforcement learning (RL) is a powerful machine learning tool to train AIs to solve visuomotor tasks. Recently these algorithms have achieved human-level performance in tasks such as video games. However, the trained models are often difficult to interpret, because they are represented as deep neural networks that map raw pixel inputs directly to decisions. It is hence unclear whether AIs and humans solve these tasks in similar or different ways, and why AIs and humans perform relatively well or poorly in certain tasks. To understand human visuomotor behaviors in Atari video games, Zhang et al. (2020) collected a dataset of human eye-tracking and decision-making data. Meanwhile, Greydanus et al. (2018) proposed a method to interpret deep RL agents by visualizing the “attention” of RL agents in the form of saliency maps. Combining these two works allows us to shed light on the inner workings of RL agents by analyzing the pixels that they attend to during task execution and comparing them with the pixels attended to by humans. We ask: 1) How similar are the visual features learned by RL agents and humans when performing the same tasks? 2) How do similarities and differences in these learned features correlate with RL agents' performance? We show how the attention of RL agents develops and becomes more human-like during the learning process, as well as how varying the parameters of reward function affects the learned attention. Additionally, compared to humans, RL agents still make simple mistakes in perception (e.g., failing to attend to important objects), and generalize poorly to unfamiliar situations. The insights provided have the potential to inform novel algorithms for closing the performance gap between RL agents and human experts. They also indicate the relative advantages and disadvantages of humans, compared to AIs, in performing these visuomotor tasks.