Abstract
A major question in high-level vision is how to represent image structure for the task of recognition. Although there is unlikely to be a unitary answer for all tasks, it is important to look for strategies that may be reasonably general and not too domain dependent. To this end, we have been developing a simple qualitative representation scheme. Motivated by the rapidly saturating contrast response functions for many early visual neurons, the scheme relies exclusively on the use of local ordinal relationships between attributes of neighboring image regions. For instance, if the attribute of interest is luminance, an image is encoded as a collection of luminance inequalities, without having to maintain absolute luminance levels or even the magnitude of contrast. Last year at VSS, we described the use of this scheme for encoding faces. Tests with a computer implementation yielded encouraging face detection performance in novel images. However, it was not clear whether the scheme was applicable only to the task of encoding faces or whether it could be used more generally. In order to address this issue, we have examined the effectiveness of the qualitative representation scheme in a very different domain — natural and urban scenes. Each scene is represented as a collection of inequalities defined over luminance and hue of neighboring regions. The set of regions is initially arbitrary, but can be reduced in cardinality by learning across multiple inputs. We find that this representation yields good results for the tasks of scene indexing (retrieving scenes similar to a given image from a database) and scene categorization. Overall, we find it encouraging that the qualitative representation strategy appears to be applicable across two seemingly very different high-level tasks of object detection and scene categorization. We shall describe the strengths and limitations of this scheme and the prospects for it to serve as a versatile image representation strategy.