Abstract
Humans can accurately categorize natural scenes from both photographs and line drawings. Gestalt grouping principles, such as proximity, parallelism and symmetry, are known to aid human performance in such complex perceptual tasks. Convolutional Neural Networks (CNNs) for categorizing scenes, on the other hand, rely heavily on color, texture and shading cues in color photographs. These cues are largely absent in line drawings, which convey contour-based shape information. We here show in computational experiments that CNNs pre-trained on color photographs are able to recognize line drawings of scenes, and that explicitly adding mid-level grouping cues, such as parallelism, symmetry and proximity, can improve CNN performance. Our contributions are threefold: (1) In addition to artist-drawn line drawings, we introduce computer generated line drawings extracted from two large scene databases, MIT67 and Places365, using a fast edge detection algorithm, followed by post-processing. We demonstrate that off-the-shelf pre-trained CNNs perform contour-based scene classification at performance levels well above chance on these datasets. (2) We evaluate computational methods for computing local measures of contour grouping based on medial axis representations of the scenes. Specifically, we compute salience measures for contour separation (corresponding to the proximity Gestalt rule), ribbon symmetry (parallelism), and taper symmetry (mirror symmetry). We show that these grouping cues prioritize contour pixels according to how informative they are for scene categorization. The observed variations in CNN classification performance for subsets of these measures qualitatively match those in scene categorization by human observers. (3) Explicitly adding these salience measures to the line drawings boosts CNN performance over the use of line drawings alone. Overall, our results indicate an important role for perceptually motivated Gestalt grouping cues for contour-based scene classification by state-of-the-art computer vision systems, as demonstrated on datasets of complexity not yet considered in human vision studies.
Acknowledgement: Natural Sciences and Engineering Research Council of Canada (NSERC)