Abstract
The rise of multi-million-item dataset initiatives has enabled machine learning algorithms to reach near-human performances at object and scene recognition. Here we describe the Places Database, a repository of 10 million pictures, labeled with semantic categories and attributes, comprising a quasi-exhaustive list of the types of environments encountered in the world. Using state of the art Convolutional Neural Networks (CNN), we show performances at natural image classification from images collected in the wild from a smart phone as well as the regions used by the model to identify the type of scene. Looking into the representation learned by the units of the neural networks, we find that meaningful units representing shapes, objects, and regions emerge as the diagnostic information to represent visual scenes. With its high-coverage and high-diversity of exemplars, Places offers an ecosystem of visual context to guide progress on currently intractable visual recognition problems. Such problems could include determining the actions happening in a given environment, spotting inconsistent objects or human behaviors for a particular place, and predicting future events or the cause of events given a scene.
Meeting abstract presented at VSS 2017