Abstract
Recognizing objects in images is one of the most important functions of our visual system. We can recognize not only individual objects, such as the Eiffel Tower or our grandmother's face, but also categories of objects, such as people, shoes, automobiles. Considerable attention has been devoted to formulating models and algorithms that may explain visual recognition and classification however, no theory describes yet how these models may be trained automatically in realistic conditions: Can a child, or a machine, learn to recognize ‘faces’ and ‘cars’ only by looking? This is at best a difficult task: natural scenes are cluttered and may not contain explicit information on the presence, location and structure of new objects, even if such objects are plentiful. We present a computational theory of how object models may be learned from images of such scenes. We model object categories probabilistically, as collections of parts that appear in a characteristic spatial arrangement. We demonstrate that it is possible to train successfully such models on unsegmented cluttered images without supervision, and that object categories are an emergent property of the learning process. Our method is based on maximizing the likelihood of the model with respect to the training data in two steps: first, features that appear often in the environment are selected as probable object parts; second, constellations of such parts that tend to appear in a consistent mutual position are selected. The probabilistic description of such constellations is the model for an object class.