Abstract
One popular toy for toddlers involves sorting block shapes into their respective holes. While toddlers require trial-and-error actions to sort blocks correctly, adults can rapidly see the appropriate solution through visual inspection alone. This feat requires an understanding of 3D shape and mental rotation. We study this task in a simplified vision-only setting by generating “shapes” of varying complexity using square matrices filled with connected binary regions, and “holes” by taking the negative region. “Fits” and “doesn’t fit” conditions are created while ensuring that shapes do not match exactly and that the total filled area is the same in both conditions. These matrices are rendered into black-and-white images (“bw”) and into more realistic rendered scenes. Human observers performed a single-interval fits / doesn’t fit task for two complexity levels for bw and rendered scenes. Performance was high for both bw (average d’ high complexity = 2.6, low complexity = 3.1) and rendered scenes (d’ high = 2.7, low = 3.1), showing that indeed humans can perform this task well. To assess whether current machine vision systems can learn this task, we finetuned the weights of a convolutional neural network (CNN; ResNet-50) on 250k bw images at four complexity levels. The network achieved a (test) accuracy of 94% (same complexity) and 87% (generalisation on higher complexity). For the same images seen by humans, the network performs better than humans at both complexity levels (d’ high = 3.5, low = 4.4), but there was no correlation between human response time and network logit. This suggests that the network is solving the task in a non-humanlike way. While this CNN can learn to exceed human performance at this particular task, we expect the model to fail further tests of generalisation because it does not understand the physical properties of the hole-in-the-wall task.
Acknowledgement: Funded by the German Federal Ministry of Education and Research (BMBF) through the Bernstein Computational Neuroscience Program Tuebingen (FKZ: 01GQ1002), the German Excellency Initiative through the Centre for Integrative Neuroscience Tuebingen (EXC307), and the German Science Foundation (DFG priority program 1527, BE 3848/2-1 and SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP03).