In simulation 1, we performed the most basic and stringent test of abstract relational reasoning. We trained several models on the original problem #1 and then presented with 5,600 images from each of the 10 stimulus test sets. That is, our testing conditions consisted of new images from the original training set (replicating
Funke et al., 2021), and novel images from the other nine test datasets that were not seen during training. As noted elsewhere in this article, a model that has learned the abstract
same and
different relations should generalize learning on the same-different task independently from the pixel-level similarity to original SVRT data.
In simulation 1, we tested three sets of models based on the ResNet architecture. The first set consisted of four ResNet-50 classifiers. All models consisted of a ResNet-50 convolutional front end followed by a hidden layer with 1,024 units with ReLU activation (see
Figure 3A). In simulation 1, there was one output layer consisting of a single sigmoid unit that predicted the probability that the input image belonged to the category same. We pretrained the models’ convolutional front end using either ImageNet (
Deng et al., 2009) or TU-Berlin (
Eitz et al., 2012), a dataset of human-generated sketches. Furthermore, we varied how we treated the output of the convolutional front end before passing it to the hidden layer. We either applied a global average pooling (GAP) operation to the output, as in
Funke et al. (2021), or flattened the output, as in
Messina et al. (2021).
1
The second set of models were different versions of the ResNet architecture that varied on depth. In particular we used ResNet-18, ResNet-34, ResNet-101, and ResNet-152 front ends with GAP pooling and ImageNet pretraining, because this was most successful condition in the first set of models. The goal of testing these models was to measure the potential role of network depth on the generalization of same-different discrimination.
The third set of models consisted of two variations of a relation network (
Santoro et al., 2017). This architecture is especially relevant for the present study because it was explicitly designed to perform relational reasoning on the visual domain and it’s fully compatible with DCNNs. As illustrated in
Figure 3B, a relation network consists of a convolutional front end that outputs a series of filters and a relation module. The relation module organizes the filter activations into columns that correspond with specific positions across filters (denoted by different colors in
Figure 3B), and generates all possible pairs of columns. All this pairs are processed by a single multilayer perceptron,
\(g_{\theta }\), yielding a vector per pair. These vectors are summed up and passed trough a second multilayer perceptron,
\(f_{\phi }\), that yields the final same-different prediction. Note that the feature columns inputs to the relation module do not necessarily represent objects or objects parts. Instead, they represent whatever is in their corresponding receptive fields (e.g., the background, a texture, or even multiple objects at the same time). We created two versions of the relation network
2 by varying the filter inputs to the relation module. In the first version we used the output of the last convolutional layer of ResNet-50 (pretrained on ImageNet), that consisted of 2,048
\(4 {\times} 4\) filters. Because the original relation network of
Santoro et al. (2017) used a CNN front end with filter outputs of size
\(8 {\times} 8\), in the second version we used the 1,024 output filters of the last convolutional layer of Resnet-50 with filter size
\(8 {\times} 8\).
Following the recommendations of
Mehrer et al. (2020), who argue that network behavior should be based on groups of network instances, we trained 10 instances of each model. We used the Adam optimizer (
Kingma & Ba, 2014). Training proceeded in two stages. In the first stage, the pretrained ResNet network was frozen while the rest of the network was trained with a learning rate of 0.0003. In the second stage, the complete model was trained with a learning rate of 0.0001. The training data consisted of the original data from SVRT problem #1. In the first stage, the model was trained on 28,000 images for 5 epochs with a batches of 64 samples. In the second stage, the model was trained on the same images for 10 epochs and with the same batch size.
Because same-different decisions were often performed on test datasets with different distributions than the training datasets, it is possible that there is a different optimal classification threshold for each test dataset. To account for this, we used the area under the receiver operating characteristic (ROC) curve (AUC), which is a performance measure that takes into consideration all possible classification thresholds. AUC values range from 0.0 to 1.0, where 0.5 corresponds with chance-level responding and 1.0 with perfect classification. The AUC can be interpreted as the probability that a randomly sampled example of the positive category (same) will be assigned a higher predicted probability than a randomly sampled example of the negative category (different) (
Hanley & McNeil, 1982). We interpreted the AUC values according the general guidelines of (
Hosmer et al., 2013) (
Table 1).