Since some specific neural network architectures were used in multiple works that will be presented in this article, we will briefly give an overview of how they work and why they might be used in certain circumstances:
CNNs: When applied to image data, the inputs of a neuron can be organized so that the output of the neuron is equivalent to the application of a filter (e.g., Gabor filters) to a specific image region. The kernel of the filter directly corresponds to the weights of the neuron. Since, in most cases, a filter for one region of an image will be equally helpful for other regions, applying the same “filter neuron” for all positions of an image is common. This procedure is equivalent to a convolution between the filter kernel and the image. Hence, layers of such neurons are called convolutional layers and neural networks making use of such layers are called CNNs. In essence, a CNN is purposefully designed to efficiently process and learn from two-dimensional data and utilize spatial invariance, which is present in many images to a certain degree.
Note that learning the weights means that the kernels of the filters used in a CNN are also learned from the data and are not predetermined. For CNNs, it has been shown that this training procedure leads to the layers extracting features, which are then combined into more and more complex features as the information flows toward higher layers. This hierarchical extraction of features has been demonstrated exceptionally well in a series of articles by
Cammarata et al. (2020).
Residual networks (ResNets), introduced by
He, Zhang, Ren, and Sun (2016), are one of the standard CNN architectures widely used in practice because they overcome one shortcoming of plain CNN architectures: The expressivity of a neural network (i.e., the complexity of the computed function) grows exponentially with the number of layers, but only linearly with the number of trainable parameters. This has been shown theoretically for fully connected networks by
Raghu, Poole, Kleinberg, Ganguli, and Sohl-Dickstein (2017), and empirical evidence shows that this likely also holds for CNNs. So deeper networks would generally be preferred to shallower ones. Unfortunately, just stacking more layers leads to the so-called
degradation problem, where the accuracy a network achieves when being trained on a specific dataset is getting worse the deeper the network gets. This is somewhat counterintuitive since unneeded layers could just be optimized to resemble an identity function, resulting in an output identical to a shallower network. However, this does not happen in practice, indicating that deeper networks are generally harder to optimize if their architecture is not adapted.
Residual networks mitigate this problem by not only sending an input
\(\mathbf {x}\) through some of the network layers themselves but also adding the input to the output of the layers at a later point (see
Figure 2). This forwarding of the input to deeper layers is called a shortcut or skip connection. If the layers themselves calculate
\(\mathcal {F}(\mathbf {x})\), this whole block calculates
\(\mathcal {F}(\mathbf {x}) + \mathbf {x}\) and is called a
residual block. By optimizing the weights of the layers, we are optimizing a residual term, hence the name residual networks.
For the standard residual networks, the layers themselves are convolutional layers, and the whole network consists mainly of a sequence of such residual blocks shown in
Figure 2. Although such a residual block should theoretically not be able to learn more than the same network without the skip connection, currently used optimization schemes seem to have a much easier time optimizing this alternative residual rephrasing of the original problem. One reason is that instead of learning an identity function, the layers in a residual block only have to be pushed to output zero since the skip connection already implements the identity function. Another advantage might be that there is always one path for the training signal (via the gradient) to flow to higher layers without going through all the layers themselves.
In
Peer, Stabinger, and Rodríguez-Sánchez (2021), we were able to present another reason why such skip connections improve the training outcome. We were able to detect layers in neural networks that we named
conflicting layers, where inputs with different labels collapse to a single point in the activation vector space. We showed theoretically and empirically that conflicting layers degenerate the gradient during training so that weights of the neural network are updated into wrong directions, leading to worse training outcomes. We could show that residual connections skip these conflicting layers.
All these reasons might explain why skip connections seem to perform well in practice and are among the standard architectural components of most modern deep neural networks. Because of this, residual networks have become one of the most widely used architectures for computer vision applications.
Long short-term memory networks (LSTM-networks) are a type of neural network architecture developed by
Hochreiter and Schmidhuber (1997) for processing sequences of inputs and are an example of so-called recurrent neural networks (RNNs) in comparison to feed-forward neural networks like CNNs. Given a sequence
\((\mathbf {x_0}, \mathbf {x_1}, \cdots , \mathbf {x_n})\), each vector
\(\mathbf {x_t}\) of this sequence is iteratively fed to the LSTM as an input, which produces a hidden state
\(\mathbf {h_t}\) as well as a cell state
\(\mathbf {c_t}\).
\(\mathbf {h_t}\) is used as the output of the LSTM for step
\(t\), but the contents of
\(\mathbf {h_t}\) and
\(\mathbf {c_t}\) are also used, together with
\(\mathbf {x_{t+1}}\), as the input to the LSTM for step
\(t+1\). The network can therefore forward information to itself in the future (i.e., it can “remember” information).
Figure 3 shows how such an LSTM-network is applied to a sequence of inputs.
In practice, the LSTM is not iteratively applied to the sequence, but the iterations are unrolled. During unrolling, for a sequence of length
\(n\), the same LSTM is replicated for each of the
\(n\) iterations, transforming recurrent connections to feed-forward connections, and the resulting bigger system is treated as a single neural network, which can consume the whole sequence at once (see the right side of
Figure 3).
What information is encoded in \(\mathbf {c_t}\) and \(\mathbf {h_t}\) is not predefined but is learned from the training data by the LSTM via multiple internal neural networks. The unrolled network is trained like any other neural network using a loss function and gradient descent. That is, an expected output sequence \((\mathbf {y_0}, \mathbf {y_1}, \cdots , \mathbf {y_n})\) is compared to the actual output of the LSTM \((\mathbf {h_0}, \mathbf {h_1}, \cdots , \mathbf {h_n})\) via an appropriate loss function, a gradient with respect to the network weights is calculated, and gradient descent is used to change the weights of the LSTM in the right direction. Note that all copies of the LSTM that were “produced” during unrolling are still the same network and have to stay identical also during/after training. Therefore, the weight updates of all instances of the LSTM are aggregated and applied to all instances of the LSTM. Since after unrolling, the gradient propagates through all the duplicates of the LSTM for all the elements of the sequence, the LSTM can “learn” to remember some information because it will be helpful later.
Often, the output needed from an LSTM is not a sequence of vectors but a single vector (e.g., for classifying a sequence), in which case only the output \(\mathbf {h_n}\) for the last element in the sequence is compared to an expected output, and all the other hidden states \((\mathbf {h_0}, \mathbf {h_1}, \cdots , \mathbf {h_{n-1}})\) are ignored for the loss.
LSTMs and RNNs, in general, have three advantages over feed-forward networks: (1) They can operate over sequences of arbitrary length because the unrolling can be done dynamically. Imagine we want to classify sentences: We can interpret the sentence as a sequence of symbols that we can feed to an LSTM and use the final output of the LSTM to classify some property of the sentence (e.g., its sentiment). Since we can unroll the LSTM to any length we want, we are not restricted by the length of the sentence. At least not in theory; in practice, using an LSTM for much shorter/longer sequences than it was trained on might lead to diminished performance. (2) The fact that the same neural networks process each element of the sequence in the LSTM means that the network can generalize across positions in the sequence (like a CNN can generalize across positions on the two dimensions of an image). For example, if we have to put different panels from a Raven’s Progressive Matrix (RPM) test (see
Figure 6) into relation to each other, it is intuitive that features extracted for the upper left panel are probably also going to be helpful for the lower right panel and so on. (3) Through the structure of a sequence, we implicitly model that all elements of the sequence are closely related to each other (e.g., all symbols of a sentence, or all panels from an RPM in our case) and most of the relevant information can be inferred by putting them in relation to each other (e.g., the individual symbols in a sentence only really become informative once they are seen as words etc.), which is helpful if we want to learn relational concepts. One problem with LSTMs when modeling relational concepts is that the entities to be put into relation with each other already have to be separated to feed them into the LSTM as a sequence. This splitting does work for many synthetic datasets, but for real images, the entities first have to be separated, which needs some form of attention, supporting Hypothesis 1.
Relation networks (RNs; see
Figure 4), introduced by
Santoro et al. (2017), are based on the principle of applying a neural network
\(g_\theta\) to all possible “object” pairings to detect relationships between them. The big advantage of this is that the application of
\(g_\theta\) on the object pairs can be done iteratively. Therefore, the network size does not increase with the number of objects to be compared, similar to how the size of an LSTM does not increase with the length of the sequence to be processed. Objects, in this case, are simply features for which a relationship should be detected. The output of
\(g_\theta\) for all pairs is added to integrate the information of possible relationships between all object pairs, and the result is sent through an additional neural network
\(f_\phi\) to produce a final classification. This network architecture was able to achieve superhuman performance on the Compositional Language and Elementary Visual Reasoning (CLEVR) dataset by
Johnson et al. (2017), which consists of rendered scenes containing different simple objects of varying sizes, colors, and materials (see
Figure 5). The dataset also includes written questions that, in part, require relational reasoning to be solved (e.g., “Are there any rubber things that have the same size as the blue metallic sphere?”).
In our opinion, the RN architecture has two main bottlenecks: First, given
\(n\) objects to be compared,
\(g_\theta\) has to be evaluated
\(n \atopwithdelims ()2\) times, so the number of evaluations of
\(g_\theta\) grows following O(
\(n^2\)). If relationships between more than two objects should be handled, the number of needed evaluations proliferates. For relationships between
\(r\) objects, the network
\(g_\theta\) has to be evaluated
\(n \atopwithdelims ()r\) times, so the number of evaluations grows with O(
\(n^r\)). Therefore, this approach is only practical if the number of “objects” can be kept relatively small. Without an attention mechanism,
Santoro et al. (2017) were not able to directly extract features of objects because there was no information about what part of an image is an object. This is the same problem we already mentioned for LSTMs and supports Hypothesis 1, which states that some form of attention is an essential component of a system to learn relational concepts. The authors decided to extract features from all positions on a grid over the whole image and handle each position as an object. This method, lacking attention, means that the number of “objects” to be compared grows quadratically with the image’s resolution. Also, this increase in object pairs results in more and more relation features that have to be integrated, increasing the likelihood that irrelevant relationships between other object pairs wash out helpful information. Second, given two object features, the network
\(g_\theta\) has to recognize the relationship from the information contained in those features alone. If the relationship to be detected is “similarity,” the representations have to contain all the information to reconstruct the object from it. With more complex objects, these features will become very complex, and a large amount of information must be passed along to
\(f_\phi\). This bottleneck could be circumvented by iterative processing since the comparison could be made in multiple iterations, and in each iteration, only a tiny part of the whole information from both entities has to be compared.
Although the results of RNs on the CLEVR dataset seem pretty promising, the actual variance encoded in a scene is surprisingly small. There are only 96 different combinations of shape, size, material, and color. In essence, this means an object in the CLEVR dataset only contains fewer than 7 bits of relevant information. Some form of positional information, putting the objects in spatial relation to each other, is also needed to solve some of the questions (e.g., “left of,” “behind”) contained in the dataset. Still, this will likely not increase the amount of information needed to encode a complete scene by a considerable amount.
Therefore, it is not clear how well the results of RNs on the CLEVR dataset transfer to real-world tasks. Results with different datasets, which will be presented over the rest of this article, indicate that the performance of RNs decreases for more complex datasets.