Abstract
The human visual system is remarkably tolerant to degradations in image resolution: in a scene recognition task, the performance of subjects is similar whether 32x32 color images or multi-mega pixel images are used. Even object recognition and segmentation is performed robustly by the visual system despite the object being unrecognizable in isolation. We present a set of studies to evaluate the minimal image resolution required to perform a number of recognition tasks (scene recognition, object detection and segmentation) and we show that images need 32x32 color pixels. Performances degrade fast below this resolution. The small size of each image carries two important benefits: (i) it permits computer vision tools to be easily applied and (ii) huge image databases may be easily collected. We present a database of 70,000,000 32x32 color images gathered from the Internet using image search engines. Each image is loosely labeled with one of the 70,399 non-abstract nouns in English, as listed in the Wordnet lexical database. Hence the image database represents a dense sampling of all semantic categories. Computer vision traditionally consider a few unrelated classes which are treated independently to one another. In contrast, we use our database in conjunction with a semantic hierarchy, obtained from Wordnet, to impose tree-structured dependencies between the 70,399 classes.