Abstract
Deep learning models can be trained to recognize 3D objects from a point cloud, i.e., a discrete set of points randomly sampled on the surfaces of 3D objects. Dynamic Graph Convolutional Neural Network (DGCNN) takes inputs of 3D coordinates for 1024 points and reaches human-level recognition performance. DGCNN is trained to project 3D coordinates of each point to a high dimensional space (256 dimensions) of geometric features, and then makes recognition decisions based on these features. However, it remains unclear what geometric features DGCNN extracts to support object recognition, and whether DGCNN uses similar 3D shape representations as humans. We used an activation maximization method to identify the preferred input point cloud pattern that maximally activates each neuron in DGCNN. We found that lower-level layers learn local geometric features in small regions (e.g., corners with different curvatures), while higher-level layers pick up more complex patterns in larger regions (e.g., surfaces with different curvatures, parallel segments, elongated segments). We next examined the robustness of humans and DGCNN in 3D object recognition. Human participants were asked to classify ten different common objects shown as a point cloud rotated in depth. Point cloud displays either included 1024 points (100%) or a down-sampled cloud with a smaller number of points (20%, 30%, etc.). For most objects (e.g., airplanes, chairs), human performance was robust even using only 20% of points. In contrast, DGCNN showed much weaker recognition performance when less than 60% of points were included, dropping to chance-level performance with 20% of points. These results imply that humans primarily use global shapes in 3D object recognition, whereas DGCNN relies on local geometric features. Thus DGCNN (like a standard CNN) learns local geometric properties instead of global shapes of objects and is therefore vulnerable to adversarial attacks with minor alterations of local geometry.