The VAE-ABS architecture starts with 10 (one for each class) class-specific variational auto-encoder (VAE) networks (
Kingma and Welling, 2013). When testing for adversarial robustness, for a given input image, they perform inference by minimizing the negative class-specific log-likelihood. Going forward, we will only consider the optimization process for a single class-conditioned VAE since inference is performed independently among VAEs and the arguments can be applied to each of them. We first rewrite the log-likelihood using variables that are consistent with this article:
\begin{eqnarray}
\!\!\!\!\!\!\!\!\!\!\! {L}^{*}_{y}(s) &\,=& {\rm max} _{a} \log p_{\theta }(s|a) - \text{D}_{\text{KL}} \left[ {N}(a, \sigma \mathbb {1}) || {N}(\mathbf {0}, \mathbb {1}))\right]{} \nonumber\\
&\,=& {\rm min} _{a} \frac{1}{2} \sum _{i=1}^{N}\left[s - f_{\text{VAE}}(a;\theta )\right]_{i}^{2}\nonumber\\
&& +\, \frac{1}{2}\sum _{j=1}^{M}\left[a_{j}^{2} + \sigma ^{2} - \log \sigma ^{2} - 1\right]{} \nonumber\\
& =& {\rm min} _{a} \frac{1}{2} \sum _{i=1}^{N}r_{i}^{2} + \lambda \sum _{j=1}^{M}C(a_{j}), \qquad
\end{eqnarray}
where
\( {L}^{*}_{y}(s)\) is the maximum a posteriori (MAP) estimate of the log-likelihood conditioned on the class,
\(y\), and the image,
\(s\);
\(f_{\text{VAE}}(a;\theta )\) is the generated image for the class-specific VAE;
\(r\) is the reconstruction error;
\(\text{D}_{\text{KL}}\) is the KL-divergence;
\(\sigma\) is the conditional Gaussian standard deviation;
\(\lambda = \frac{M}{2}(\sigma ^{2} - \log \sigma ^{2} - 1)\) is a constant; and
\(C(a_{j}) = a_{j}^{2}\). Additionally,
\(p_{\theta }(s|a)\) is the data likelihood, which is a function of the generative arm of the VAE network that is parameterized by
\(\theta\). Comparing
Equations (22) and
(2) reveals that the likelihood expressions are different in the decoder function and the prior imposed on the latent variables, which manifests itself in the form of the latent variable activation cost,
\(C(\cdot )\). As is the case with our network, they compute the MAP estimate by descending the negative log-likelihood gradient:
\begin{eqnarray}
\!\!\!\!\!\!\!\!\!\!{-}\frac{\partial {L}}{\partial a_{k}} = \frac{\partial f(a;\theta )}{\partial a_{k}} \sum _{i=1}^{N}\left[s - f(a)\right]_{i} - \lambda \frac{\partial C(a_{k})}{\partial a_{k}}. \qquad
\end{eqnarray}
For our fully-connected LCA network, the decoder,
\(f(\cdot ) := f_{\text{LCA}}(a;\Phi ) = \Phi a\), is linear, and the gradient with respect to an individual latent variable,
\(a_{k}\), is
\(\Phi _{k}\). However, the VAE-ABS decoder,
\(f(\cdot ) := f_{\text{VAE}}(a;\theta )\), is a cascade of four convolutional network layers with exponential-linear (
Clevert, Unterthiner, & Hochreiter, 2015) activations, and thus the derivative,
\(\frac{\partial f_{\text{VAE}}}{\partial a_{k}}\), is the product of a series of piecewise linear and exponential functions. In both cases, the generated image is a function of the entire latent vector,
\(a\). This means that the each latent variable’s update step is a function of the other (class-specific for VAE-ABS) latent variables. Therefore, like the LCA network, the VAE-ABS latent encoding is a population nonlinear function of the input.