Abstract
Perceptual grouping is the problem of determining what features go together and in what configuration. Since this is a computationally hard problem, it is important to ask whether object perception really depends on perceptual grouping. For example, under ideal conditions, a collection of local features may be sufficient to classify an object. These features could be computed via a feedforward process, obviating the need for perceptual grouping. Indeed, this fast feedforward `bag of features’ conception of object processing is prevalent in both human and computer vision research.
Here I will review psychophysical and computational research that challenges the ability of this class of model to explain object perception. Psychophysical assessment shows that humans are largely unable to pool local shape features to make object judgements unless these features are configured holistically. Further, the formation of these perceptual groups is itself found to rely on holistic shape representations, pointing to a recurrent circuit that conditions local grouping computations on this holistic encoding.
While feedforward deep learning models for object classification are more powerful than earlier bag-of-feature models, we find that these models also fail to capture human sensitivity to holistic shape and perceptual robustness to occlusion. This leads to the hypothesis that a computational model designed to solve perceptual grouping tasks as well as object classification will form a better account of human object perception, and I will highlight how optimal solutions to these grouping tasks are typically based on a fusion of feedforward local computations with holistic optimization and feedback.