Abstract

Feature-product networks (FP-nets) are inspired by end-stopped cortical cells with FP-units that multiply the outputs of two filters. We enhance state-of-the-art deep networks, such as the ResNet and MobileNet, with FP-units and show that the resulting FP-nets perform better on the Cifar-10 and ImageNet benchmarks. Moreover, we analyze the hyperselectivity of the FP-net model neurons and show that this property makes FP-nets less sensitive to adversarial attacks and JPEG artifacts. We then show that the learned model neurons are end-stopped to different degrees and that they provide sparse representations with an entropy that decreases with hyperselectivity.

Introduction

For machine learning to work, one needs appropriate biases to constrain the solution for the problem at hand. Deep convolutional neural networks (CNNs), for example, are successful due to two constraints that specialize them relative to more general networks such as the multilayer perceptron (MLP): sparse connections and shared weights. It is well known that biases cannot be learned from the data or derived by logical deduction (Watanabe, 1985). In computer vision, appropriate biases can be obtained, as in the case of the CNNs, by studying biological vision (LeCun et al., 2015; Majaj & Pelli, 2018). Besides inspiring the use of localized (oriented) filters (the two CNN biases above) followed by a pointwise nonlinearity, biological vision can provide additional insight, an issue that currently receives somewhat limited attention in the deep-learning community (Majaj & Pelli, 2018; Paiton et al., 2020).

We here focus on the principle of efficient coding (Barlow, 1961; Simoncelli & Olshausen, 2001) and the related neural phenomenon of end-stopping (Hubel & Wiesel, 1965). Statistical analysis shows that oriented linear filters reduce the entropy of natural images by encoding oriented straight patterns (one-dimensional [1

*D*] regions) such as vertical and horizontal edges (Zetzsche et al., 1993). In cortical area V2, however, the majority of cells are end-stopped to different degrees (Hubel & Wiesel, 1965). End-stopped cells are thought to detect two-dimensional (2*D*) regions such as junctions and corners. Since 2*D*regions are unique and sparse in natural images (Barth & Watson, 2000; Mota & Barth, 2000; Zetzsche et al., 1993), they represent images efficiently, that is, with a high degree of sparseness and minimal information loss. A standard way of modeling end-stopped cells is to multiply outputs of orientation-selective cells, resulting in an AND-combination of simple-cell outputs (Zetzsche & Barth, 1990). For example, a corner can be detected by the logical combination of “horizontal edge AND vertical edge.” In Paiton et al. (2020), the authors argue convincingly that principles adopted from vision should be beneficial for deep networks and that the exploitation of multiplicative interactions between neurons has not been sufficiently explored in this specific context. There is, nevertheless, a vast literature on sigma-pi networks in general (e.g., Mel & Koch, 1990; Rumelhart et al., 1986), which is not surprising since such networks define a large class of possible systems.It has been shown that end-stopping can emerge from the principle of predictive coding based on recursive connections (Rao & Ballard, 1999); the latter has also been observed in Barth and Zetzsche (1998). Note that in Rao and Ballard (1999), end-stopping emerges based on unsupervised learning with natural images and, in our case, on task-driven supervised learning in a natural vision task.

Feature-product networks (FP-nets) implement a network architecture that contains explicit multiplications of the feature maps obtained with pairs of linear filters. The main feature of these networks is that they learn the appropriate filter pairs to be multiplied based on the task at hand. An early FP-net architecture has been presented as a preprint (Grüning et al., 2020b), and it has been shown in Grüning et al. (Grüning & Barth, 2021) that a similar network can predict subjective image quality well. Of course, we do not assume that neurons would compute ideal multiplications; the AND terms could be created in alternative ways, for example, by using logarithms (Grüning et al., 2020b) or the minimum operation (Grüning & Barth, 2021a) instead of multiplications. AND terms could also be generated by traditional CNNs with linear filters followed by simple ReLU nonlinearities (Barth & Zetzsche, 1998), but this would require larger networks and would be limited in terms of the possible tuning properties of the resulting nonlinear functions (see also Paiton et al., 2020, regarding the limits of pointwise nonlinearities). Here, we present a novel FP-net architecture that is closer to vision models than the ones introduced previously in Grüning and Barth (2021b) and Grüning et al. (2020b). We first demonstrate its performance and then analyze the learned units by relating them to biological vision.

Regarding the use of multiplicative terms in CNNs, Zoumpourlis et al. (2017) have shown that quadratic forms added to the first layer of a CNN can improve generalization. An FP-net can be interpreted as a special case of a network with an additional second-order Volterra kernel, but it has much fewer parameters. However, CNNs are also special cases of MLPs and, as we have argued above, the challenge is to find the right biases that can take us from the general to the more special case. For more comprehensive overviews on how FP-nets relate to various deep-network architectures, especially to bilinear CNNs (Li et al., 2017), see Grüning et al. (2020a) and Grüning and Barth (2021b). In addition, we would like to mention recent work of Chrysos et al. (2020), which illustrates that the Hadamard product of layers in deep network and the resulting higher-order polynomial representation can improve classification performance. Finally, in recurrent networks, multiplications are used to implement useful gating mechanisms (Collins et al., 2016).

FP-nets as competitive deep networks

With FP-nets, we denote a deep-network architecture that contains one or several FP-blocks. Each \(\mathbf {T}_{1}[i,j,m]\) is the value of \(\mathbf {T}_{1}\) at pixel position \((i, j)\) and feature map \(m\); \(w_{m}^{n}\) are learned weights and \(q\) is an expansion factor that controls the block size. By \(\mathbf {T}_{1}^{m} \in \mathbb {R}^{h \times w}\), we denote the \(m\)th feature map of \(\mathbf {T}_{1}\). The second step is the computation of feature products, the centerpiece of the FP-block. Each feature map \(\mathbf {T}_{1}^{m}; m=1,..., q d_{out}\), is convolved with two learned filters \(\mathbf {V}^{m}\) and \(\mathbf {G}^{m} \in \mathbb {R}^{k \times k}\). Filtering is followed by instance normalization (IN) (Ulyanov et al., 2016) and ReLU nonlinearity yielding two new feature maps. Subsequently, the product of the two filter outputs is computed. For any particular image patch \(\mathbf {X} \in \mathbb {R}^{k \times k}\), with the center pixel being \((i,j)\), of a particular feature map \(\mathbf {T}_{1}^{m}\), the filter operation for the vectorized image patch \(\mathbf {x}= vect(\mathbf {X})\in \mathbb {R}^{k^{2}}\) is the scalar product of the image patch with the vectorized filters \(\mathbf {v} = vect(\mathbf {V}^{m})\) and \(\mathbf {g} = vect(\mathbf {G}^{m})\): \(\mathbf {T}_{2}\in \mathbb {R}^{\frac{w}{s} \times \frac{h}{s} \times q d_{out}}\) is the resulting tensor and \(s\) the stride of the filter operation. If \(s\) is greater than 1, \(\mathbf {T}_{2}\)’s width and height are subsampled. \(\mu\) and \(\sigma\) are the mean value and standard deviation of \(\mathbf {T}_{1}^{m}\) after convolution with either \(\mathbf {V}^{m}\) or \(\mathbf {G}^{m}\): with \((\mathbf {T}_{1}^{m} * \mathbf {V})[i, j]\) being the \((i,j)\)th pixel of the filter result. In the third step, a second linear combination transforms \(\mathbf {T}_{2} \in \mathbb {R}^{ \frac{h}{s} \times \frac{w}{s} \times qd_{out}}\) into \(\mathbf {T}_{3} \in \mathbb {R}^{\frac{h}{s} \times \frac{w}{s} \times d_{out}}\). To comply with the baseline architectures ResNet and MobileNet, a residual connection defines the final output as:

*block*of a deep network implements a sequence of layers and operations that transforms an input tensor \(\mathbf {T}_{0} \in \mathbb {R}^{h \times w \times d_{in}}\) to an output tensor \(\mathbf {T}_{out} \in \mathbb {R}^{\frac{h}{s} \times \frac{w}{s} \times d_{out}}\). A tensor consists of a number (e.g., \(d_{in}\), \(d_{out}\)) of feature maps, each with spatial width \(w\) and height \(h\) that may be altered by a factor \(s\). The typical input tensor for a CNN is an image, the three color channels being the feature maps. The sequence of operations in an FP-block is shown in Figure 1 and consists of three steps: (a) a first linear combination, (b) the feature product, (c) a second linear combination. In the first step, the \(d_{in}\) feature maps of an input tensor \(\mathbf {T}_{0}\) are linearly combined, followed by a ReLU, to yield the tensor \(\mathbf {T}_{1}\) with \(q d_{out}\) feature maps: \begin{eqnarray}
&&\mathbf {T}_{1}[i,j,m] = ReLU\left(\sum _{n=1}^{d_{in}} w_{m}^{n} \mathbf {T}_{0}[i,j,n]\right);\nonumber\\
&& m= 1,..., q d_{out}.
\end{eqnarray}

(1)

\begin{eqnarray}
&&\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\mathbf {T}_{2}[i,j,m] \,{=}\, \frac{1}{\sigma _{v}\sigma _{g}} ReLU(\mathbf {x}^{T}\mathbf {v} \,{-}\, \mu _{v})ReLU(\mathbf {g}^{T}\mathbf {x} \,{-}\, \mu _{g}). \;\;
\end{eqnarray}

(2)

\begin{equation}
\mu _{v} = \frac{s^{2}}{hw} \sum _{i, j}{(\mathbf {T}_{1}^{m} * \mathbf {V}^{m})[i, j]},
\end{equation}

(3)

\begin{equation}
\sigma _{v} = \frac{s^{2}}{hw} \sum _{i, j}{(\mathbf {T}_{1}^{m} * \mathbf {V}^{m} - \mu _{v})^{2}[i, j]},
\end{equation}

(4)

\begin{equation}
\mathbf {T}_{out} = \mathbf {T}_{0} + \mathbf {T}_{3}.
\end{equation}

(5)

Figure 1.

Figure 1.

Using the above FP-block, we designed four different FP-nets based on different baseline architectures: an FP-net based on (a) the original ResNet, and (b) the PyrBlockNet trained on Cifar-10, (c) a ResNet-50, and (d) a MobileNet-V2 both trained on ImageNet. A

*stack*is a larger segment of the network, consisting of several*blocks*. Except for the first stack that may have a stride of 1, each new stack starts with a block with a stride of 2 that reduces the size of each feature map. Within a stack, all blocks operate on feature maps of the same size. Different network architectures may have different numbers and types of blocks. In our case, basic blocks, pyramid blocks, bottleneck blocks, and inverted residual blocks define the ResNet-Cifar, PyrBlockNet, ResNet-50, and MobileNet-V2 architecture, respectively. The block is the core module of an architecture and contains several*layers*. Layers are the smallest network building units such as convolution layers and max-pooling layers. Figure 2 shows an example of a ResNet-Cifar architecture that has three stacks with five blocks each. Each first block of the second and third stacks contains a convolution layer with stride \(s=2\) that downsamples the input. The two other architectures that we used are similar: The ResNet-50 has four stacks with varying numbers of bottleneck blocks. The MobileNet-V2 has six stacks consisting of inverted-residual blocks.Figure 2.

Figure 2.

We transform the four baseline architectures defined above into FP-nets using a simple design rule: Substitute each stack’s first block with an FP-block. The input and output dimensions of the block are kept equal; only the internal operations differ.

We developed this design rule to improve upon already well-established architectures, making FP-nets practical since only a few changes need to be done to create an FP-net. To be compatible with state-of-the-art architectures, the FP-block has a structure similar to the MobileNet-V2 block (Sandler et al., 2018). We found that combinations of convolution blocks and FP-blocks work best and that larger kernel sizes do not improve performance. One way to view a stack is that it constitutes a visual processing chain for a specific image scale. One would expect end-stopping to be more useful at the beginning of this chain. Thus, we replaced the first block of each stack. Note, however, that later stacks, for example, the second and third stack in the Cifar-10 networks, already work with highly processed inputs coming from the previous stacks. Therefore, one would expect that there is a lower necessity of extracting 2

*D*regions in later stacks. Indeed, we will show, when analyzing the \(\gamma\) values of FP-blocks, that highly selective neurons are more common in earlier stacks.We train and test several FP-nets on the two well-known benchmarks Cifar-10 (Krizhevsky et al., 2021) and ImageNet (Deng et al., 2009).

Due to the moderate size of the data set, Cifar-10 is often used to evaluate the potential of new architectures and designs. For our experiments on this data set, we used ResNets (He et al., 2016) as baseline; see Figure 2 for an example. These networks have three stacks, each consisting of \(N\) blocks. We evaluated two types of the ResNet-20, ResNet-32, ResNet-44, and ResNet-56, with \(N = 3\), 5, 7, and 9 blocks, respectively (the numbers after the names indicate the number of convolution or linear layers). Since the first publication of the ResNet architecture, several additional blocks were proposed; see Han et al. (2017) for an overview. As two baselines on Cifar-10, we used the original ResNet and a variant using the pyramid block that we denote PyrBlockNet. For both variants, we created FP-nets by replacing baseline blocks with FP-blocks according to our design rule. We used the same number of blocks, but note that an FP-block contains one additional convolution layer in each block. The FP-net-23, FP-net-35, FP-net-47, and FP-net-59 are based on the PyrBlockNet: Each stack’s first block is an FP-block, and all other blocks are pyramid blocks. Analogously,

*FP-net (basic)*denotes an FP-net based on the original ResNet: Each stack’s first block is an FP-block, and the remaining blocks are basic blocks.Next, we evaluated the performance of FP-nets with the larger ImageNet data set that contains over 1.2 million training examples and 50,000 validation examples (we tested on the publicly available validation set). With an input size of at least \(224 \times 224\) pixels and 1,000 classes, ImageNet poses a greater challenge than Cifar-10. We compared the ResNet-50 to two FP-net-50: one smaller net with an expansion factor \(q=0.8\) and a slightly larger network with \(q=1\). In both cases, for each of the four stacks of the ResNet-50, the first block was replaced by an FP-block to obtain the FP-net-50. Note that, if not explicitly mentioned, the term FP-net-50 refers to the \(q=1\) variant.

To further validate our approach, we evaluated an FP-net based on the popular MobileNet-V2 architecture. As with the ResNet, we replaced the first block of each stack with an FP-block, using \(q=3\).

The results of the Cifar-10 experiments are shown in Figure 3: The left side compares the original ResNet to the FP-net (basic), and the right side compares the PyrBlockNet to the FP-net. Each point of the two curves shows the best possible test error occurring over all training epochs averaged over five runs and for one particular network (i.e., one particular number of blocks). The black line shows the baseline network, the green line the resulting FP-net when substituting the first blocks of the baseline’s stacks. The \(x\)-axis displays the number of parameters, a number that increases with the number of blocks. Note, however, that the inclusion of FP-blocks reduces the number of parameters. Overall, the FP-nets are more compact and perform better with a lower test error and only a small overlap in the standard deviations.

Figure 3.

Figure 3.

Table 1 shows the results on ImageNet. Note that the FP-net (\(q=1\)) performs better than the baseline ResNet-50, and the validation error is reduced by almost 0.4. When considering the already compact MobileNet architecture, the FP-net performs better than the MobileNet with an error decreased by 0.2. We trained the MobileNet-V2 baseline network ourselves to obtain its validation error. For the ResNet-50, we report the value from the Tensorpack repository (Wu, 2016). The performance depending on the number of parameters for the ResNet and FP-variants is illustrated in Figure 4.

Figure 4.

Figure 4.

Table 1.

FP-nets and visual coding

Hyperselectivity of FP-units

Vilankar and Field (2017) used the term

*hyperselectivity*to quantify how strongly a neuron is tuned to its optimal stimulus, that is, how quickly the response drops when the optimal stimulus changes. In the context of deep learning, hyperselectivity is relevant because it can increase robustness, for example, robustness against adversarial attacks (Paiton et al., 2020). One way to quantify hyperselectivity is to measure the*curvature*of*iso-response contours*. Given an \(n\)-dimensional input to a function \(f\), an \((n-1)\)-dimensional surface may exist such that for all points \(\mathbf {s}\) on the surface, the output \(f(\mathbf {s})\) is a constant. As \(n\) can be a high dimension, 2*D*projections are used to analyze such iso-surfaces, which in two dimensions become iso-response contours \(\mathbf {s} = \phi (t), t \in \mathbb {R}\).The typical linear-nonlinear (LN) model neuron used in CNNs is a function \(f_{LN}(\mathbf {x})\) that involves a linear projection on a weight vector \(\mathbf {w} \in \mathbb {R}^{n}\) followed by a pointwise nonlinearity \(\rho (x)\). To analyze the iso-response contour of such a neuron, one first projects the input on \(\mathbf {w}\), the axis corresponding to the optimal stimulus \(\mathbf {x}_{opt}\). To find a second axis, one searches for a vector orthogonal to \(\mathbf {x}_{opt}\), for example, by picking \(n\) random values and using the Gram–Schmidt process (see Equation 16) to transform the random vector to one that is orthogonal to \(\mathbf {{x}_{opt}}\). When looking at the output of an LN-neuron for \(\mathbf {x_{opt}}\) perturbed by any orthogonal vector \(\mathbf {z}\) with \( \mathbf {{x}_{opt}}^{T} \mathbf {z} = \mathbf {w}^{T} \mathbf {z} = 0\), the iso-response contour is always a straight line parallel to \(\mathbf {z}\), because \(f_{LN}(\mathbf {x}_{opt} + \mathbf {z}) = \rho (\mathbf {w}^{T}(\mathbf {x}_{opt} + \mathbf {z})) = \rho (\mathbf {w}^{T} \mathbf {x}_{opt}) = f_{LN}(\mathbf {x}_{opt})\). Thus, for LN-neurons, the iso-response contours have zero curvature. For hyperselective neurons (\(f_{HS}(\mathbf {x})\)), there exist vectors \(\mathbf {z}\) that are orthogonal to \(\mathbf {x}_{opt}\) and decrease the neuron’s optimal response such that \(f_{HS}(\mathbf {x}_{opt} + \mathbf {z}) \lt f_{HS}(\mathbf {x}_{opt})\). In this case, the exo-origin iso-response contour bends away from the origin of the basis defined by \(\mathbf {x}_{opt}\) and \(\mathbf {z}\). A higher curvature of this bend indicates a more significant activation dropoff in regions that are different from the optimal stimulus (i.e., a greater hyperselectivity). One way to quantify the curvature is to use the coefficient of the quadratic term obtained by fitting a second-order polynomial to the iso-response contour. FP-nets contain FP-blocks that consist of FP-units, or

*FP-neurons*, which yield the feature-product output for a pixel \((i,j)\) in a feature map \(m\) as defined by Equation 2. As shown in the Appendix, FP-neurons exhibit curved exo-origin iso-response contours with a curvature that depends on the angle \(\gamma = \measuredangle (\mathbf {v}, \mathbf {g})\). Iso-response contours are shown in Figure 5 for different values of \(\gamma\). Note that curvature, and thus hyperselectivity, increases with \(\gamma\). Accordingly, a large \(\gamma\) leads to a lower entropy of the resulting feature maps; see Figure 6.Figure 5.

Figure 5.

Figure 6.

Figure 6.

Entropy and degree of end-stopping

To further support the view that FP-neurons are hyperselective depending on \(\gamma\), we analyzed the entropy of the feature maps generated by different FP-neurons. The results in Figure 6 show that the learned filters tend to have a \(\gamma\) larger than zero, that is, the majority of FP-neurons are hyperselective and that a high \(\gamma\)-value leads to a lower entropy. Details of how the entropy is computed are given in the Appendix.

In order to analyze the end-stopping behavior of the model neurons that are learned in the FP-nets trained on Cifar-10 and ImageNet, we needed to quantify the degree of end-stopping. In order to relate to physiological measurements, we started by analyzing the response of FP-neurons to straight lines and line ends, but this turned out to be problematic because the FP-nets use small \(3 \times 3\) filters and subsample the input. To keep the analogy, but with a more robust measure, we used a square as input and quantified the average responses to the uniform zero-dimensional (0

*D*) regions, the straight 1*D*edges, and the 2*D*corners. The degree of end-stopping is then defined by the relation between 1*D*and 2*D*responses. In order to account for ON/OFF- type responses, we used both a bright and a dark square. The results are shown in Figure 7, and the details of the algorithm are given in the Appendix.Figure 7.

Figure 7.

Note that, as the real neurons in cortical areas V1 and V2, the model neurons in the FP-net are end-stopped to different degrees. Thus, end-stopping seems to be beneficial for both the ImageNet and Cifar-10 tasks, since the emergence of end-stopping is here driven by the classification error. As expected, the multiplication in the FP-block shifts the distribution toward a higher degree of end-stopping. However, the network could have learned filter pairs that do not lead to end-stopped FP-neurons. The bias that we introduce (i.e., the multiplication) just makes it easier for the network to learn end-stopped representations.

The angle distributions in Figure 8 show that indeed linear FP-neurons are learned as well since more than 15% of FP-neurons have a \(\gamma\)-value near zero. With increasing network depth, the number of linear FP-neurons increases, indicating that hyperselectivity and especially end-stopping are more frequent in earlier stages of the visual processing chain.

Figure 8.

Figure 8.

FP-neurons are more robust against adversarial attacks

Although outperforming almost all alternative approaches on many vision tasks, CNNs are surprisingly sensitive to barely visible perturbations of the input images (Szegedy et al., 2013). An adversarial attack on a classifier function \(f\) adds a noise pattern \(\mathbf {\eta }\) to an input image \(\mathbf {x}\) so that \(f(\mathbf {x} + \mathbf {\eta })\) does not return the correct class \(y=f(\mathbf {x})\). Furthermore, the attacker ensures that some \(p\)-norm of \(\mathbf {\eta }\) does not exceed \(\epsilon\). In many cases, including this work, the infinity-norm is chosen, and the \(\epsilon\) values are in the set \(\lbrace {1}/{255}, {2}/{255}, ...\rbrace\). Thus, for example, for \(\epsilon = {1}/{255}\), each 8-bit pixel value is at most altered by adding or subtracting the value 1. Goodfellow et al. (2014) argue that the main reason for the sensitivity to adversarial examples is due to the linearity of CNNs: With a high-dimensional input, one can substantially change a linear neuron’s output, even with small perturbations. Consider the output of an LN-neuron for an input \(\mathbf {x}\) with dimension \(n\) perturbed by \(\mathbf {\eta }\). We choose \(\mathbf {\eta }\) to be the sign function of the weight vector multiplied with \(\epsilon\): \(\mathbf {\eta }=sign(\mathbf {w}) \cdot \epsilon\). Thus, \(\mathbf {\eta }\) roughly points in the direction of the optimal stimulus (which is also the gradient), but its infinity-norm does not exceed \(\epsilon\). Assuming that the mean absolute value of \(\mathbf {w}\) is \(m\), \(f_{LN}(\mathbf {\eta })\) is approximately equal to \(\epsilon n m\). Accordingly, a significant change of the LN-neuron’s output can be achieved by a small \(\epsilon\) value if the input dimension \(n\) is large, which is the case for many vision-related tasks. This gradient-ascent method can also be applied to nonlinear neurons. Within a local region, the output of almost any function \(f\) can be approximated by a linear function. To optimally increase the output, the input needs to be moved along the gradient direction. The fast gradient sign method (FGSM; Goodfellow et al., 2014) perturbs the original input image \(\mathbf {x}\) by adding \(\mathbf {\eta }=\epsilon sign(\nabla f(\mathbf {x}))\). Another approach is to define \(\mathbf {\eta }\) to be the gradient times a positive step size \(\tau\) followed by clipping to \(\eta \in [-\epsilon , +\epsilon ]^{n}\). The clipped iterative gradient ascent (CIGA) greedily moves along the direction of the highest linear increase, with \(q_{i}^{j}\) being the Note that in the following particular example, the input is chosen to yield nonnegative projections on \(\mathbf {v}\) and \(\mathbf {g}\); thus, we can remove the ReLUs. The resulting gradient is The effectiveness of an iteration step strongly depends on the current position. The highest possible increase would be obtained along the line defined by the optimal stimulus. In Figure 9 on the right, this is the black line. If the initial input is located on this line, any step in the gradient direction yields an optimal increase of the FP-neuron output. However, for any other position with a nonzero gradient, an unbounded iteration step would move toward the optimal stimulus line. The blue curve in Figure 9 shows the path for several iterations of CIGA: Starting above the optimal stimulus line, each step slowly converges to the optimal stimulus line, eventually moving almost parallel to it. Once the \(\epsilon\) threshold of 1 is reached in the horizontal dimension, the (now bounded) path runs parallel to the vertical dimension to increase the neuron output further. The optimal solution is found once the \(\epsilon\) bound is also reached in the vertical dimension. The important difference when comparing with LN-neurons is that there are numerous conditions (depending on \(\tau\), \(\mathbf {x}\), \(\gamma\), and \(\epsilon\)) where CIGA would need several steps to find an optimal solution. This reduced effectiveness of the gradient ascent illustrates why hyperselective neurons are more robust against adversarial attacks; for example, if \(\epsilon\) is too small, or \(\tau\) is chosen poorly, or with too few iterations, an attack might not increase the FP-neuron output by much. Note that single neurons are usually not the target of adversarial attacks; instead, the gradient is determined on the classification loss function. Still, the argument holds that hyperselective neurons are harder to activate than LN-neurons, resulting in an increased robustness.

\begin{eqnarray}
\begin{array}{@{}r@{\;}c@{\;}l@{}}
\mathbf {\eta }_{0} &=& \mathbf {0}; \tau \gt 0\\
\mathbf {q}_{i+1} &=& \mathbf {\eta }_{i} + \tau \nabla f(\mathbf {x} + \mathbf {\eta }_{i})\\
\mathbf {\eta }_{i+1}^{j} &=& \min (\max (q_{i+1}^{j}, -\epsilon ), \epsilon ),\\
\end{array} \qquad
\end{eqnarray}

(6)

*j*th entry of the unbounded result \(\mathbf {q}_{i}\) at the*i*th iteration step. In the following, we use CIGA in our illustrations of the principle, and in our experiments, we employ FGSM as it is a widely recognized adversarial attack method. When regarding an iso-response contour plot, one can easily spot the direction of the gradient, which is orthogonal to an iso-response contour (Paiton et al., 2020). In Figure 9 on the left, the gradient for an LN-neuron is parallel to the optimal stimulus (black line). As long as the initial input yields a nonzero gradient, each step of CIGA maximally increases the LN-neuron output. Thus, the algorithm’s effectiveness is only bounded by \(\epsilon\) but widely independent of the initial input \(\mathbf {x}\). For a step size larger than \(\epsilon\), CIGA finds the optimal solution in one step. We now investigate the effects of CIGA on a simplified version of an FP-neuron: \begin{equation}
F(\mathbf {x}) = \mathbf {x}^T \mathbf {v} \mathbf {g}^T \mathbf {x}.
\end{equation}

(7)

\begin{equation}
\nabla _{F}(x) = (\mathbf {v}^{T}\mathbf {x}) \mathbf {g} + (\mathbf {g}^{T}\mathbf {x}) \mathbf {v}.
\end{equation}

(8)

Figure 9.

Figure 9.

To test this hypothesis, we created new Cifar-10 test sets \(\mathcal {S}_{\epsilon _{i}} = \lbrace FGSM(\mathbf {x}, \epsilon _{i}): \mathbf {x} \in \mathcal {X}_{C10} \rbrace\) derived from the original test set \(\mathcal {X}_{C10}\). Here, we focused on the most subtle adversarial attacks: we created one test set \(\mathcal {S}_{{1}/{255}}\), where each test image was perturbed by using FGSM with \(\epsilon ={1}/{255}\). Results for larger \(\epsilon\)-values are shown in the Appendix (see Table 2 and Table 3). To exclude the hypothesis that the better accuracy (with perturbations) is due to the fact that the FP-nets already generalize better, we present results where we measure the percentage of changed predictions of the classifier \(f\). \(\mathbb {1}\) is the indicator function returning a 1 for a true statement and a zero otherwise. \(\Gamma\) is some function (here, FGSM) that perturbs the original image \(\mathbf {x}\) based on some parameter \(\theta\). We evaluated this metric for each of the four architectures that we trained on the original Cifar-10 training set (see Section “FP-nets as competitive deep networks”); no additional adversarial training scheme was employed. As shown in Figure 10, 40% to 50% of the predictions did change. However, for both baseline models, substituting some of the LN-neurons with FP-neurons increased the robustness against FGSM attacks.

\begin{eqnarray}
&&Perc.\ of\ changed\ predictions (f, \Gamma , \theta )\nonumber\\
&&\qquad = \frac{1}{ |\mathcal {X}_{C10}| }\sum _{\mathbf {x} \in \mathcal {X}_{C10}} \mathbb {1}( f(\mathbf {x}) \ne f(\Gamma (\mathbf {x}, \theta ) )), \quad
\end{eqnarray}

(9)

Figure 10.

Figure 10.

The results reiterate that CNN predictions can be significantly altered by deliberate and subtle attacks (we show some example images in the Appendix). Unfortunately, this lack of robustness creates problems of practical relevance beyond such attacks. For example, JPEG-compression can create artifacts that have similar effects. To evaluate robustness against JPEG artifacts, we created the Cifar-10 test sets \(\mathcal {S}_{Q_i} = \lbrace JPEG(\mathbf {x}, Q_i): \mathbf {x} \in \mathcal {X}_{C10} \rbrace\), with \(JPEG(\mathbf {x}, Q)\) being the JPEG-compressed version of the original image \(\mathbf {x}\) with a quality rate \(Q \in \lbrace 1,2,...,100 \rbrace\), 100 being the original image. A low quality indicates a high compression with stronger artifacts (example images are given in the Appendix). In Figure 11, we show the results for the low compression test set \(\mathcal {S}_{90}\) and further results in the Appendix (see Tables 4 and 5).

Figure 11.

Figure 11.

Again, using FP-neurons increased the robustness against artifacts. However, even a moderate compression alters up to 10% of the CNNs’ predictions.

Example FP-unit

As shown above, the learned FP-neurons are hyperselective and end-stopped to different degrees. However, these two properties do not fully specify an FP-neuron. When analyzing the individual FP-neurons in more detail, it is difficult to further specify them according to simple properties such as orientation or phase. Nevertheless, some FP-neurons look as if they were taken from a textbook on “how to model end-stopped neurons,” and we show one example in Figure 12.

Figure 12.

Figure 12.

Discussion and conclusions

We have presented a novel FP-net architecture and have demonstrated its competitive performance. To do so, we have designed experiments with state-of-the-art deep networks and showed that we could improve their performance by substituting original blocks in the network architecture with FP-blocks that implement an explicit multiplication of feature maps. Given this simple design rule, we can expect our approach to be of practical use, since any traditional network can easily be transformed into an FP-net that will most likely perform better. We did not employ any hyperparameter tuning specific to the FP-nets but just used the hyperparameters of the original networks; one may thus expect even better performance with additional tuning. We believe that the improvement that comes with FP-nets is due to an appropriate bias, which allows the network to learn efficient representations based on units (model neurons) that are end-stopped to different degrees. The multiplications that we introduce allow for AND rather than OR combinations and thus make the resulting units more selective than linear filters with pointwise nonlinearities. Note that the key feature of FP-nets is that one learns pairs of linear filters, which are then AND combined. In case of FP-nets, the AND is implemented by multiplications. We could, however, show that logarithms (Grüning et al., 2020b) and the minimum operation (Grüning & Barth, 2021a) can also work as AND operation. We consider the improvements that bio-inspired FP-nets achieve over the baseline networks to be the main contribution of our article.

Moreover, we have analyzed the selectivity of the FP-units in an attempt to relate them to what is known about visual neurons. We could show that FP-units are indeed end-stopped to different degrees. The emergence of end-stopping in a network that learns based on only the classification error demonstrates that end-stopping is beneficial for the task of object recognition. This finding is supported by previously known mathematical results, according to which (a) 2

*D*features such as corners and junctions are statistically rare in natural images, leading to sparse representations (Zetzsche et al., 1993), and (b) 2*D*features are still unique since there exists a mathematical proof that 0*D*(uniform) and 1*D*(straight) regions in images are redundant (Mota & Barth, 2000), although being statistically frequent.Of course, the considerations above cannot be taken to imply that biological vision implements an FP-net architecture, especially as the FP-nets implement additional and typical deep-network operations such as linear recombinations that increase the entropy of the representation. In other words, much of what well-performing deep networks do is not something one would necessarily consider to be optimal.

It is known that sparse-coding units are more selective than typical CNN units, that is, than linear neurons with pointwise nonlinearities (Paiton et al., 2020), and thus less prone to certain adversarial attacks. This increased selectivity has been quantified with the curvature of the iso-response contours. We could show that the iso-response contours of the FP-units are curved, with the degree of curvature depending on the angle between the multiplied feature vectors, and that a large number of hyperselective units emerge in FP-nets trained for object recognition. Furthermore, our results show that FP-nets are indeed more robust against adversarial attacks and compression artifacts, and this is, again, due to the vision-inspired FP-units.

Acknowledgments

Commercial relationships: none.

Corresponding author: Philipp Grüning.

Email: gruening@inb.uni-luebeck.de.

Address: Institute for Neuro- and Bioinformatics, University of Lübeck, Germany.

References

Barlow, H.
(1961). Possible principles underlying the transformation of sensory messages.

*Sensory Communication,*1(1), 217–234.
Barth, E., & Watson, A. B. (2000). A geometric framework for nonlinear visual coding.

*Optics Express,*7(4), 155–165. Available from http://webmail.inb.uni-luebeck.de/inb-publications/pdfs/BaWa00.pdf. [CrossRef]
Barth, E., & Zetzsche, C. (1998). Endstopped operators based on iterated nonlinear center-surround inhibition. In Rogowitz, B. E. & Pappas, T. N. (Eds.),

*Human vision and electronic imaging*(Vol. 3299, pp. 67–78). Bellingham, WA: Optical Society of America, Available from http://webmail.inb.uni-luebeck.de/~barth/papers/spie98.fm4.pdf.
Bradski, G.
(2000). The openCV library.

*Dr. Dobb's Journal: Software Tools for the Professional Programmer,*25(11), 120–123.
Chrysos, G., et al. (2020). P-nets: Deep polynomial neural networks. In

*2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, Seattle, WA, USA, pp. 7323–7333, doi:10.1109/CVPR42600.2020.00735.
Collins, J., Sohl-Dickstein, J., & Sussillo, D. (2016). Capacity and trainability in recurrent neural networks.

*Stat,*1050, 29.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database.

*2009 IEEE Conference on Computer Vision and Pattern Recognition*, pp. 248–255, doi:10.1109/CVPR.2009.5206848.
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In Bengio, Y., LeCun, Y. (Eds.),

*3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, Conference Track Proceedings*. Opgehaal van, http://arxiv.org/abs/1412.6572.
Gray, G.
(2017).

*Sequential-imagenet-dataloader*. Retrieved February 20, 2021, from, https://github.com/BayesWatch/sequential-imagenet-dataloader.
Grüning, P., & Barth, E. (2021a). Bio-inspired min-nets improve the performance and robustness of deep networks. In

*SVRHM 2021 Workshop @ NeurIPS*, https://openreview.net/forum?id=zxxdFLB8F24.
Grüning, P., & Barth, E. (2021b). Fp-nets for blind image quality assessment.

*Journal of Perceptual Imaging,*4(1), 10402-1–10402-13.
Grüning, P., Martinetz, T., & Barth, E. (2020a). Feature products yield efficient networks.

*arXiv preprint arXiv:2008.07930*.
Grüning, P., Martinetz, T., & Barth, E. (2020b). Log-nets: Logarithmic feature-product layers yield more compact networks. In Farkaš, I., Masulli, P., & Wermter, S. (Eds.),

*Artificial Neural Networks and Machine Learning – ICANN 2020*(pp. 79–91). Cham, Switzerland: Springer International Publishing.
Han, D., Kim, J., & Kim, J. (2017). Deep pyramidal residual networks. In

*2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, Honolulu, HI, USA, pp. 6307–6315, doi:10.1109/CVPR.2017.668.
He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition.

*2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 770–778, doi:10.1109/CVPR.2016.90.
Howard, J.
(2018).

*Imagenet-fast.*https://github.com/fastai/imagenet-fast. Accessed February 20, 2021.
Hubel, D. H., & Wiesel, T. N. (1965). Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat.

*Journal of Neurophysiology,*28(2), 229–289. [CrossRef]
Kim, H.
(2020). Torchattacks: A pytorch repository for adversarial attacks.

*arXiv preprint arXiv:2010.01950*.
Krizhevsky, A., Nair, V., & Hinton, G. (2021).

*Cifar-10 (Canadian Institute for Advanced Research).*
Li, D., Zhou, A., & Yao, A. (2021).

*Mobilenetv2.pytorch.*Retrieved February 20, 2021, from https://github.com/d-li14/mobilenetv2.pytorch.
Li, Y., Wang, N., Liu, J., & Hou, X. (2017). Factorized bilinear models for image recognition. In

*2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy*, pp. 2098–2106, doi:10.1109/ICCV.2017.229.
Lu, L.
(2020). Dying ReLU and initialization: Theory and numerical examples.

*Communications in Computational Physics,*28(5), 1671–1706, doi:10.4208/cicp.OA-2020-0165.
Majaj, N. J., & Pelli, D. G. (2018). Deep learning—Using machine learning to study biological vision.

*Journal of Vision,*18(13), 2–2. [CrossRef]
Mel, B. W., & Koch, C. (1990). Sigma-pi learning: On radial basis functions and cortical associative learning. In Touretzky, D. (Ed.),

*Advances in Neural Information Processing Systems,*2.
Mota, C., & Barth, E. (2000). On the uniqueness of curvature features.

*Dynamische Perzeption,*9, 175–178, Available from https://webmail.inb.uni-luebeck.de/inb-publications/htmls/ulm2000.html.
Paiton, D. M., Frye, C. G., Lundquist, S. Y., Bowen, J. D., Zarcone, R., & Olshausen, B. A. (2020). Selectivity and robustness of sparse coding networks.

*Journal of Vision,*20(12), 10, https://doi.org/10.1167/jov.20.12.10. [CrossRef]
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G. et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R., Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Chintala, S. (Eds.),

*Advances in neural information processing systems*(Vol. 32, pp. 8026–8037). Red Hook, NY: Curran Associates, Inc.
Rao, R. P., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptivefield effects.

*Nature Neuroscience,*2(1), 79–87. [CrossRef]
Rumelhart, D. E., Hinton, G. E., & McClelland, J. L. (1986). A general framework for parallel distributed processing.

*Parallel Distributed Processing: Explorations in the Microstructure of Cognition,*1(26), 45–76.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. (2018). MobileNetV2: Inverted residuals and linear bottlenecks. In

*2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, Salt Lake City, UT, USA, pp. 4510–4520, doi:10.1109/CVPR.2018.00474.
Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation.

*Annual Review of Neuroscience,*24(1), 1193–1216. [CrossRef]
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks.

*Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2*, pp. 2377–2385. Presented at the Montreal, Canada. Cambridge, MA, USA: MIT Press.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D. et al. (2014).

*Going deeper with convolutions.*
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014).

*Intriguing properties of neural networks.*Paper presented at 2nd International Conference on Learning Representations, ICLR 2014, Banff, Canada.
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization.

*arXiv preprint arXiv:1607.08022*.
Veit, A., Wilber, M. J., & Belongie, S. (2016). Residual networks behave like ensembles of relatively shallow networks. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (Eds.),

*Advances in Neural Information Processing Systems*(Vol 29). Curran Associates, Inc.
Vilankar, K. P., & Field, D. J. (2017). Selectivity, hyperselectivity, and the tuning of v1 neurons.

*Journal of Vision,*17(9), 9, https://doi.org/10.1167/17.9.9. [CrossRef]
Watanabe, S.
(1985).

*Pattern Recognition: Human and Mechanical*. Hoboken, New Jersey: Wiley-Interscience.
Wu, Y.
(2016).

*Tensorpack*, https://github.com/tensorpack/tensorpack/tree/master/examples/ResNet. Accessed February 20, 2021.
Zetzsche, C., & Barth, E. (1990). Fundamental limits of linear filters in the visual processing of two-dimensional signals.

*Vision Research,*30, 1111–1117. Available from http://webmail.inb.uni-luebeck.de/inb-publications/pdfs/ZeBa90a.pdf. [CrossRef]
Zetzsche, C., Barth, E., & Wegmann, B. (1993). The importance of intrinsically two-dimensional image features in biological vision and picture coding. In Watson, A. B. (Ed.),

*Digital images and human vision*(pp. 109–38). Cambridge, MA: MIT Press. Available from http://webmail.inb.uni-luebeck.de/inb-publications/htmls/ZeBaWe93a.html.
Zoumpourlis, G., Doumanoglou, A., Vretos, N., Daras, P. Non-linear convolution filters for CNN-based learning. In

*2017 IEEE International Conference on Computer Vision (ICCV)*, Venice, Italy, pp. 4771–4779, doi:10.1109/ICCV.2017.510.Appendix

Details on network design and training procedure

All experiments were conducted using the PyTorch deep-learning framework (Paszke et al., 2019). Note that in all cases, for Equation 1, the output of the weighted sum has been normalized via batch normalization before applying the ReLU nonlinearity.

Residual connections

For the residual connections in Equation 5, some additional computations are needed if the dimensions of \(\mathbf {T}_{0}\) and \(\mathbf {T}_{3}\) differ. In case that \(d_{out}\) is greater than \(d_{in}\), zero padding is used to match the dimension of the feature maps. If \(d_{in}\) is greater than \(d_{out}\), an additional linear combination is learned to reduce the number of feature maps. If the FP-block’s stride \(s\) is greater than 1, \(\mathbf {T}_{0}\) is subsampled by average pooling. For more implementation details regarding residual connections, see Han et al. (2017). Residual connections enable a more stable gradient flow during training, allow to better model identity functions (He et al., 2016), and enable CNNs to behave like ensembles of shallower networks (Veit et al., 2016).

Cifar-10 experiments

Cifar-10 contains 50,000 training and 10,000 test images (RGB, with height and width 32) of 10 different commonplace objects, such as airplane, bird, cat, and ship. For each FP-net and each PyrBlockNet, five experiments were conducted with five different random seeds that control the initialization of each network’s random weights and the random mini-batch collection during training. The networks were trained for 200 epochs, using stochastic gradient descent (SGD), with a learning rate of 0.1 that was reduced to 0.01 and 0.001 after the 100th and 150th epoch. We used a momentum of 0.9, a weight decay of 0.0001, and a batch size of 128. For data augmentation, during training, with a probability of \(50\%\), each input image was flipped horizontally. Subsequently, all images were padded with 4 pixels, and then a random crop of \(32 \times 32\) were used. Furthermore, the RGB crop was first divided by 255 and normalized with the ImageNet mean \(\mu _{imNet} (0.485, 0.456, 0.406)\) and standard deviation \(\sigma _{imNet} (0.229, 0.224, 0.225)\) for the three input channels, respectively. When computing the test scores, no random cropping and no horizontal flipping were used. Each FP-block’s expansion factor \(q\) was set to 2. Based on the work of Srivastave et al. (2015), the best test error was reported to better reflect the variance of the results due to different network initializations.

FP-ResNet on ImageNet

The FP-net-50 was trained for 100 epochs with randomly initialized weights using SGD on \(224 \times 224\) crops with a batch size of 512. After one third, and then again after two thirds of the training time, the initial learning rate of 0.1 was decreased by a factor of 10. The weight decay was 0.0001 and the momentum 0.9. For data augmentation, we used the code from the sequential-imagenet-dataloader repository (Gray, 2017); during training, crops of various random sizes were passed to the network ranging from \(8\%\) to \(100\%\) of the original image size. The aspect ratio was chosen randomly between 3/4 and 4/3. Furthermore, different photometric distortions (e.g., random contrast changes) were applied as described in Szegedy et al. (2014) and the Tensorpack repository (Wu, 2016). When computing the test scores, each input image is first resized such that the shortest edge’s length is 256. Next, the image is cropped in the center to size (224, 224), divided by 255, subtracted with 0.5, and again divided by 0.5.

MobileNet-V2 and FP-MobileNet

The FP-MobileNet was trained from scratch with SGD for 150 epochs and with a batch size of 256. The initial learning rate of 0.05 was decreased according to a cosine scheduling; see Li et al.’s repository (Li et al., 2021). The training data augmentation included random resizing and cropping, random horizontal flips, color jitters, division by the maximum value, and normalization by \(\mu _{imNet} (0.485, 0.456, 0.406)\) and \(\sigma _{imNet} (0.229, 0.224, 0.225)\). During testing, the input images were first resized to \(255 \times 255\) and then a center crop of size \(224 \times 224\) was computed. Subsequently, the crop was normalized as described above. For more information, see Fastai’s repository (Howard, 2018).

Entropy

We analyzed the entropy of all FP-neurons \(\mathbf {T}_{2}\) for the FP-ResNet-50 (ImageNet) and the FP-ResNet-59 (Cifar-10). One hundred randomly sampled images from the respective test set (in case of ImageNet, the validation set) were passed to each network. For each input, we computed the corresponding feature maps for every FP-block, one tensor \(\mathbf {T}_{2}\) for every block. We normalized each feature map \(\mathbf {T}_{2}^{m}\) from \(\mathbb {R}^{+}\) to \(\lbrace 0,1,...,255\rbrace\) and computed the entropy of the pixel distribution over the 256 integer values. For the 100 input images, we obtained 100 entropy values for each feature map. We averaged these 100 values resulting in the mean entropy for each feature map (i.e., each FP-neuron).

We observed that some of the feature maps \(\mathbf {T}_{1}^{m}\) had all pixel values equal to zero (so-called dying ReLUs; Lu et al., 2019). The corresponding FP-neurons were removed from the analysis. For the FP-ResNet-50, the percentage of dying ReLUs was \(23 \%\), \(0.002 \%\), \(7 \%\), and \(18 \%\) for the first, second, third, and fourth FP-blocks, respectively. For the FP-ResNet-59, only the third FP-block had \(5 \%\) dying ReLUs. We tested different weight initializations or alternative nonlinearities, such as the leaky ReLU. However, although, using leaky ReLUs stopped the emergence of dying ReLUs, we only noticed a small gain in performance.

Degree of end-stopping

To measure the degree of end-stopping, we used two input images \(I_0\), \(I_1\) , one with a bright and one with a dark square: Pixels belonging to the square had a value of +1 or −1, respectively; all other pixels were zero. Each image was normalized to have zero mean and a standard deviation of 1. We computed the intermediate outputs \(\mathbf {T}_{2}(I_0)\) and squared them to obtain the activation energy. For the PyrResNet, we used the ReLU after the first convolution as intermediate output. \(\mathbf {T}_{i}(I_0)\) is the \(\mathbf {W}_{\psi }\) is a binary matrix used to compute the \({\psi }D\) value. All pixels within the region of interest are 1; the others are zero. The weighted areas are shown in Figure 13 in the right panel: The square in the middle is the region of interest for \(0D\). The four small squares along the straight edges of the input square measure \(1D\); the four small squares at the corners measure \(2D\). Note that the three different regions of interest have the same total area. The left panel shows the input image \(I_0\). The \({\psi }D(\mathbf {T}_{n}^{m})\) for both input images is the sum \({\psi }D(\mathbf {T}_{n}^{m}) = {\psi }D(\mathbf {T}_{n}^{m}(I_0))+{\psi }D(\mathbf {T}_{n}^{m}(I_1))\). The degree of end-stopping of a feature map is then defined as: with \(\epsilon = 0.1\). Note that the degree of end-stopping is high (close to 1) if the \(2D\) activation is high and the \(1D\) activation is low. However, two special cases were considered: (a) a feature map is “silent,” if all values are very small (i.e., \(0D + 1D + 2D \lt 0.1\)). (b) The feature map is “\(0D\)” if the \(0D\) and \(1D\) activations are similar: For these two special cases, Equation 11 would no longer quantify the degree of end-stopping. Therefore, the degree of end-stopping values was not evaluated for silent and \(0D\) feature maps. The plots in Figure 7 show the normalized histograms for the degree of end-stopping. All bars have a bin width of 0.1 and their heights sum up to 1.

*i*th tensor that is computed using the image \(I_0\) as input. We then normalized each tensor \(\mathbf {T}_{n}\) from \(\mathbb {R}^{+}\) to \([0, 1]\) by dividing it with the mean plus three times the standard deviation and clipped any values greater than 1 to make the normalization less susceptible to possible outliers. The percentage of outliers never exceeded 10%. For each feature map, we determined the values \({0D}\), \(1D\), and \(2D\) by summing the feature map pixel values (i.e., the activations) over specific regions of interest that were either homogeneous areas, straight edges, or corners in the input image: \begin{equation}
{\psi }D(\mathbf {T}_{n}^{m}) = \sum _{i,j}{\mathbf {T}_{n}[i,j,m] \mathbf {W}_{\psi }[i,j]}.
\end{equation}

(10)

\begin{equation}
\phi (\mathbf {T}_{n}^{m}) = 1 - \frac{1D(\mathbf {T}_{n}^{m})}{2D(\mathbf {T}_{n}^{m}) + \epsilon }
\end{equation}

(11)

\begin{equation}
\mathbf {T}_{n}^{m} \;is\; {}^{\prime }0D^{\prime } \Leftrightarrow 1 - \frac{0D(\mathbf {T}_{n}^{m})}{1D(\mathbf {T}_{n}^{m}) + \epsilon } \lt 0.1.
\end{equation}

(12)

Figure 13.

Figure 13.

Iso-response contours

In this section, we derive the analytical expression for the iso-response contours of FP-neurons. We follow a geometric approach in order to show explicitly how the exo-origin curvature depends on the angle \(\gamma = \measuredangle (\mathbf {v}, \mathbf {g})\). An alternative approach would be to work with the eigenvector of the symmetric matrix \(\frac{1}{2}(\mathbf {v} \mathbf {g}^T + \mathbf {g} \mathbf {v}^T)\).

In the two-dimensional subspace defined by \(\mathbf {v}\) and \(\mathbf {g}\), and for a specific constant \(z \in \mathbb {R}^{+}\), we can derive the coordinates of the iso-response contours analytically by using a simplified version of Equation 2: Equation 7. \(F(\mathbf {x})\) is the output of the FP-neuron, that is, the product of the outputs of two linear filters \(\mathbf {v}\) and \(\mathbf {g} \in \mathbb {R}^{n}, n=k^{2}\). For simplicity, we disregard the instance normalization. Thus, we assume that the mean values are zero (\(\mu _{v} = \mu _{g} = 0\)) and the standard deviations are 1 (\(\sigma _{v} = \sigma _{g} = 1\)), which are the two variables used for instance normalization. Furthermore, we constrain the input space of \(\mathbf {x}\) to \(\mathbb {S} = \lbrace \mathbf {x} \in \mathbb {R}^{k^{2}}: \mathbf {x}^T \mathbf {v} \ge 0 \wedge \mathbf {x}^T \mathbf {g} \ge 0 \rbrace\) to account for the ReLU nonlinearities. Furthermore, we restrict \(\gamma\) to \([0, \pi )\) since for \(\gamma = \pi\), both vectors point in opposite directions, and for any point \(\mathbf {x}\), one scalar product is always negative.

The optimal stimulus of \(F(\mathbf {x})\) is not parallel to one of the filters but points in the direction of the bisector of \(\gamma\). This property becomes more obvious when rewriting \(F(\mathbf {x})\) as a function depending on \(\alpha = \measuredangle (\mathbf {v}, \mathbf {x})\) and \(\beta = \measuredangle (\mathbf {g}, \mathbf {x})\): To simplify this particular equation, we assume \(\Vert \mathbf {x} \Vert = 1\) and disregard the vector lengths \(\Vert \mathbf {v} \Vert\) and \(\Vert \mathbf {g} \Vert\) since the arguments \(\alpha\) and \(\beta\), and the argmax of \(F\), do not depend on vector length. With \( \alpha + \beta = \gamma\), we obtain Note that for \(\alpha = \frac{1}{2} \gamma\), \(F\) reaches the maximum value \(\frac{1 + {\cos }(\gamma )}{2}\).

\begin{equation}
F(\alpha , \beta , \mathbf {x}) = cos(\alpha ) cos(\beta ) \Vert \mathbf {v} \Vert \Vert \mathbf {g} \Vert \Vert \mathbf {x} \Vert ^2.
\end{equation}

(13)

\begin{eqnarray}
F(\alpha , \beta ) &\;=& cos(\alpha ) cos(\gamma - \alpha )\nonumber\\
& \;=& \frac{1}{2} (cos(2\alpha - \gamma ) + cos(\gamma )). \qquad
\end{eqnarray}

(14)

The subspace of input vectors that do not alter the FP-neuron’s output is defined by For any vector \(\mathbf {p}\) orthogonal to \(\mathbf {v}\) and \(\mathbf {g}\), the iso-response contours are straight, as they are for LN-neurons. However, as we will show in the following, there exists an orthogonal direction \(\mathbf {o}\) relative to which FP-units exhibit curved iso-response contours and, thus, hyperselectivity.

\begin{eqnarray}
F(\mathbf {x} + \mathbf {p}) = F(\mathbf {x}) \Leftrightarrow \mathbf {p}^{T} \mathbf {v} = \mathbf {p}^{T} \mathbf {g} = 0. \quad
\end{eqnarray}

(15)

It is important to note that any input vector \(\mathbf {x}\) is projected to the plane defined by the vectors \(\mathbf {v}\) and \(\mathbf {g}\) (see Equation 7); any vector \(\mathbf {p}\) from the subspace of Equation 15 is orthogonal to this plane. We can consider the function \(f(\mathbf {a})\) that operates on only 2If \(\mathbf {g} = \lambda \mathbf {v}, \lambda \in \mathbb {R}\), \(\mathbf {o}\) is simply any vector orthogonal to \(\mathbf {v}\). A point \((a, b)^{T}\) in the two-dimensional projection space can be injected into the original input space \(\mathbb {S}\): \(\mathbf {x}_{ab}\) denotes that the vector depends on only the position in the projection space \(\mathbf {a} = (a, b)^{T}\). The relations between the scalar products in the input space and the scalar products in the projection space are given by with \(\mathbf {e}_{1} = (1, 0)^{T}\) and \(\mathbf {e}_{2} = (0, 1)^{T}\). Accordingly, the multiplication of \( \mathbf {x}^{T} \mathbf {v}\) with \( \mathbf {x}^{T} \mathbf {g}\) yields with \(c_{1} = \Vert \mathbf {v} \Vert \Vert \mathbf {g} \Vert\). In the projection space, the direction vector of the optimal stimulus \(\mathbf {a}_{opt}\) is given by \((cos(\frac{\gamma }{2}), {\sin }(\frac{\gamma }{2}))^{T}\) (see Equation 14). \(\mathbf {a}_{orth} = (-sin(\frac{\gamma }{2}), {\cos }(\frac{\gamma }{2}))^{T}\) is orthogonal to it. We aim to find all points \(x, y \in \mathbb {R}\) such that with \(z \in \mathbb {R}^{+}\). Substitution and simplification yields: For a given value \(x\), and \(c=\frac{c_1}{z}\), the \(y\) position of the iso-response contour is given by With this equation, we can estimate the curvature of the exo-origin bend by using the quadratic coefficient of the second-order Taylor approximation around \(x=0\) to obtain For \(x=0\), \(y(0)\) is the position along the optimal stimulus, where \(f(y(0) \mathbf {a}_{opt}) = z\). Keeping \(y(0)\) fixed, the attenuation of \(f\) when moving in a direction orthogonal to the optimal stimulus is quadratic:

*D*input vectors \(\mathbf {a} = (a, b)^{T}\), which are the projections of \(\mathbf {x}\) onto the vectors \(\frac{\mathbf {v}}{\Vert \mathbf {v} \Vert }\) and \(\frac{\mathbf {o}}{\Vert \mathbf {o} \Vert }\), respectively. Unless \(\mathbf {g}\) is parallel to \(\mathbf {v}\), we can derive \(\mathbf {o}\) as the direction orthogonal to \(\mathbf {v}\) by using the Gram–Schmidt process: \begin{eqnarray}
\mathbf {o} = \mathbf {g} - \frac{\mathbf {v}^T\mathbf {g}}{\Vert \mathbf {v} \Vert ^{2}} \mathbf {v}. \quad
\end{eqnarray}

(16)

\begin{equation}
\mathbf {x}_{ab} = \frac{a}{\Vert \mathbf {v} \Vert } \mathbf {v} + \frac{b}{\Vert \mathbf {o} \Vert } \mathbf {o}.
\end{equation}

(17)

\begin{equation}
\mathbf {x}_{ab}^{T} \mathbf {v} = \Vert \mathbf {v} \Vert (a, b) \mathbf {e}_{1} = a \Vert \mathbf {v} \Vert
\end{equation}

(18)

\begin{equation}
\mathbf {x}_{ab}^{T} \mathbf {o} = \Vert \mathbf {o} \Vert (a, b) \mathbf {e}_{2} = b \Vert \mathbf {o} \Vert
\end{equation}

(19)

\begin{eqnarray}
\mathbf {x}_{ab}^{T} \mathbf {g} &\;=& \Vert \mathbf {g} \Vert (a, b) (cos(\gamma ), sin(\gamma ))^{T}\nonumber\\
&\;=& \Vert \mathbf {g} \Vert (a cos(\gamma ) + b sin(\gamma )), \qquad
\end{eqnarray}

(20)

\begin{eqnarray}
\mathbf {x}^{T} \mathbf {v} \mathbf {x}^{T} \mathbf {g} &\;=& (a \Vert \mathbf {v} \Vert )(a cos(\gamma ) + b sin(\gamma ))\Vert \mathbf {g} \Vert \nonumber\\
&\;=& \left(\begin{array}{@{}c@{}} a^{2} \\
ab \end{array}\right)^{T} \left(\begin{array}{@{}c@{}} c_1 cos(\gamma ) \\
c_1 sin(\gamma ) \end{array}\right) = f(\mathbf {a}),\qquad
\end{eqnarray}

(21)

\begin{equation}
f(x \mathbf {a}_{orth} + y \mathbf {a}_{opt}) = z,
\end{equation}

(22)

\begin{eqnarray}
z = c_1\left(y^2 cos^2\left(\frac{\gamma }{2}\right) - x^2 sin^2\left(\frac{\gamma }{2}\right) \right). \quad
\end{eqnarray}

(23)

\begin{eqnarray}
y(x) = \sqrt{tan^2\left(\frac{\gamma }{2}\right)x^{2} + \frac{c}{cos^2\left(\frac{\gamma }{2}\right)}}. \quad
\end{eqnarray}

(24)

\begin{eqnarray}
\frac{1}{2} \left[ \frac{d^2}{dx^2}(y) \right] (0) = \frac{ tan^2( \frac{\gamma }{2} )}{2\sqrt{\frac{c}{ cos^2( \frac{\gamma }{2}) }}}. \quad
\end{eqnarray}

(25)

\begin{eqnarray}
\Delta z &\;=& f(x \mathbf {a}_{orth} + y(0) \mathbf {a}_{opt}) - f(y(0) \mathbf {a}_{opt})\nonumber\\
&\;=& -c_1 x^2 sin^2\left(\frac{\gamma }{2}\right). \qquad
\end{eqnarray}

(26)

Figure 14 gives a three-dimensional example to illustrate how a 3

*D*point \(\mathbf {p} \in \mathbb {R}^{3}\) can be mapped to the plane spanned by \(\mathbf {v}\) and \(\mathbf {g}\). The axes \(a\) and \(b\) of the projection space coincide with \(\mathbf {v}\) and \(\mathbf {o}\). Thus, there is a direct correspondence between \(\mathbf {p}\) and the projected point \(\mathbf {q} = (a, b)^{T}\) (see Equation 17). To estimate the curvature, we rotate the \((a, b)\) coordinate frame clockwise by \(\frac{\pi - \gamma }{2}\) to the frame \((x, y)\). From this perspective, we can measure the change of \(y\) when moving along the \(x\)-axis and away from \(x=0\): Equation 24 shows that for \(\gamma \in (0, \pi )\), \(y(x)\) increases when changing \(x\). Accordingly, the iso-response contour bends away from the origin of the rotated frame \((x, y)\).Figure 14.

Figure 14.

Adversarial attacks

The mean robust errors (i.e., the errors regarding the perturbed images), in percentages, averaged over five runs for different architectures and \(\epsilon\)-values are given in Table 2, and the averaged percentages of changed predictions (see Equation 9) are given in Table 3. We show a selection of adversarial examples in Figure 15. We observed that, for the basic block networks, the FP-net (basic) consistently outperformed the ResNet for all \(N\)s except \(N=3\). For the pyramid block networks, the larger FP-nets (\(N \in \lbrace 7,9\rbrace\)) consistently outperformed the PyrBlockNets. Accordingly, especially for larger CNNs, we increased the robustness against adversarial attacks by using FP-blocks. To compute the FGSM attacks for each test image, we used the code provided by Kim (2020). The RGB test images were first divided by 255, then the FGSM algorithm was applied, and finally, the image was normalized as described above.

Table 2.

Table 3.

Figure 15.

Figure 15.

JPEG-compression

The mean robust errors percentages averaged over five runs are given in Table 4, and the averaged percentages of changed predictions are given in Table 5. Examples for JPEG-compressed images depending on the quality level are given in Figure 16. To compute the compression, we used the software provided by the OpenCV library (Bradski, 2000). We made the following observations: A decrease in quality by 10 was followed by an error increase of roughly \(3\hbox{--}4 \%\). Analogously, each quality decrease increased the number of changed predictions by \(3\hbox{--}4 \%\). Deeper networks performed better. Networks using the basic block, the ResNet and the FP-net (basic), performed better than networks based on the pyramid block. Except for a quality of 10, the FP-net (basic) outperformed the ResNet, and similarly, the FP-net outperformed the PyrBlockNet. From this, we concluded that using FP-blocks in a CNN increased the robustness against noise coming from JPEG-compression.

Table 4.

Table 5.

Figure 16.

Figure 16.