Abstract
One classical criterion for finding good features is that they should be predictable over space and time such that they explain a large part of the input, can be attributed to whole objects, and are distinguishable from noise. Natural images frequently contain object boundaries though, across which the predictions fail. Here, we optimize predictions that can take boundaries into account, such that we can extract where boundaries are in the image from the activations. We formalize the features to be optimized as the activations in a convolutional neural network. We then form a prediction as a product of predictions from neighboring locations. Each neighbor predicts a mixture of a uniform distribution and a Gaussian around its feature vector, corresponding to the cases with a boundary between the neighbor and the predicted location and without a boundary between them. We optimize the features and the prediction such that the predicted probability is higher for the feature vector at the predicted location than for feature vectors from randomly chosen locations (a contrastive predictive coding loss). We used unlabeled natural images from the MS COCO database to learn linear and deeper feature maps.The early linear features converge towards local averages of opponent colors and Gabor-like grating patterns that point to the neighboring locations for which they are predictive. This is superficially consistent with our knowledge about early visual processing. We also evaluate the boundaries extracted from our model on the Berkeley Segmentation Database of human contour annotations. To extract contours from our model we use a computer vision method called globalization. Our contours are reasonable without further adjustment (all models F>= 0.63). Thus, we present a probabilistic model of the feature maps in early visual processing that can take object boundaries into account and can be learned without supervision.