Abstract
Image dataset is a pivotal resource for vision research. We introduce here the preview of a new dataset called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate each of the majority of the 80,000 synsets (concrete and countable nouns and their synonym sets) of WordNet with an average of 500–1000 clean images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. To construct ImageNet, we first collect a large set of candidate images (about 10 thousands) from the Internet image search engines for each synset, which typically contains approximately 10% suitable images. We then deploy an image annotation task on the online workers market Amazon Mechanical Turk. To obtain a reliable rating, each image is evaluated by a dynamically determined number of online workers. At the time this abstract is written, we have six completed sub-ImageNets with more than 2500 synsets and roughly 2 million images in total (Mammal-Net, Vehicle-Net, MusicalInstrument-Net, Tool-Net, Furniture-Net, and GeologicalFormation-Net). Our analyses show that ImageNet is much larger in scale and diversity and much more accurate than the current existing image datasets. A particularly interesting question arises in the construction of ImageNet - the degree to which a concept (within concrete and countable nouns) can be visually represented. We call this the “imageability” of a synset. While a German shepherd is an easy visual category, it is not clear how one could represent a “two-year old horse” with reliable images. We show that “imageability” can be quantified as a function of human subject consensus. Given the large scale of our dataset, ImageNet can offer, for the first time, an “imageability” measurement to a large number of concrete and countable nouns in the English language.
A Microsoft Research New Faculty Fellowship Award a Princeton Frank Moss gift.