Abstract
Position and length are two dominant visual channels that underpin our ability to reason about the visual world in settings ranging from natural scenes to charts and diagrams. Despite progress in convolutional neural networks (CNNs) performing image classification, we know little about cognitive tasks where models need to understand the relationships between objects in charts. When asked, "What is the ratio between two bars?" CNNs will need to identify the bars in an image and the relationship between bars to answer correctly. In this work, we answer questions about how well can a CNN cope with new images after it has learned from a particular set of ratios or heights that were distorted to further away from each other or shifted to never seen before? Are machines more effective than humans for ratio estimate tasks? In two experiments, we sent the models to take the same human tests to compare these two experts. We controlled four train-test configurations to be the same or carry distribution shifted on two features: (1) task label (ratios) and (2) visual feature (bar heights). We found that sampling methods lowered the human errors up to three-folds when we aggregated the predictions. However, CNNs were not robust to distribution shift: accuracy dropped significantly when specific test instances fell out of the training data span. Next, we introduce a trial-by-trial comparison to measure if the two experts systematically make errors of the same inputs. We found that, unlike humans, CNNs were remarkably consistent with one another and showed marginal position preferences over length. We further test for the presence of Weber's law in CNNs and humans by selecting bar heights such that the height difference computed on pairs of images spanned a wide range. We observed the presence of Weber's law for neither CNNs nor humans.