Abstract
Humans excel at finding objects in complex natural scenes but understanding this behaviour has been difficult because natural scenes contain targets, nontargets and coarse scene features. Here we performed two studies to elucidate object detection on natural scenes. In Study 1, participants detected cars or people in a large set of natural scenes. For each scene, we extracted target-associated features, annotated nontarget objects, and extracted coarse scene structure and used them to model detection performance. Our main finding is that target detection in both person and car tasks was predicted using target and coarse scene features, with no discernible contribution of nontarget objects. By contrast, nontarget objects predicted target rejection times in both person and car tasks, with contributions from target features for person rejection. In Study 2, we sought to understand the computational advantage of context. Context is commonly thought of reducing computation by constraining locations to search. But can it have a more fundamental role in making detection more accurate? To do so, scene context must be learned independently from target features. Humans, unlike computers, can learn contextual expectations separately when we see scenes without targets. To measure these expectations, we asked subjects to indicate the scale, location and likelihood at which targets may occur in scenes without targets. Humans showed highly systematic expectations that we could accurately predict using scene features. Importantly, we found that augmenting state-of-the art deep neural networks with these human-derived expectations improved performance. This improvement came from accepting poor matches at highly likely locations and rejecting strong matches at unlikely locations. Taken together our results show that humans show systematic behaviour in detecting objects and forming expectations on natural scenes that can be predicted and understood using computational modelling.
Meeting abstract presented at VSS 2017