We trained a Wasserstein-GAN (
Arjovsky et al., 2017) on the sixty thousand 32 × 32 images contained in the CIFAR10 data set (
Krizhevsky, 2009) using gradient penalty as proposed by
Gulrajani et al. (2017). In short, a GAN consists of a generator network
G that maps a latent vector
Display Formula\(\boldsymbol{z}\) to image space and a discriminator network
D that takes an image as input and predicts whether that image is a real image from the training data set or an image that was generated by mapping a latent vector through the generator network (see
Figure 1). The generator network and the discriminator network were trained in alternation using stochastic gradient descent. Specifically, training alternated between five updates of the discriminator network and one update of the generator network. Updates of the discriminator network were chosen to minimize the loss
\begin{equation*}
\mathbb {E}_{\boldsymbol{z}}[D(G(\boldsymbol{z}))] - \mathbb {E}_{\boldsymbol{y}}[D(\boldsymbol{y})] + \lambda \Vert \nabla _{\boldsymbol{y}}D(\tilde{\boldsymbol{y}})\Vert ,
\end{equation*}
and updates of the generator were chosen to maximize this loss. Here, the first term quantifies the false alarm rate of the discriminator (i.e., the likelihood that the discriminator
D classifies a fake image
Display Formula\(G(\boldsymbol{z})\) as real), the second term quantifies the hit rate of the discriminator, and the third term is a penalty term to encourage the discriminator to be 1-Lipschitz (a stronger form of continuity). In accordance with
Gulrajani et al. (2017), we set λ = 10 for discriminator updates and λ = 0 for generator updates. In this equation,
Display Formula\(\tilde{\boldsymbol{y}}\) denotes a random location between
Display Formula\(\hat{\boldsymbol{y}}=G(\boldsymbol{z})\) and the training image
Display Formula\(\boldsymbol{y}\). Networks with different numbers of hidden states (parameter
N in
Figure 1) were trained for 200,000 update cycles using an ADAM optimizer (
Kingma & Ba, 2015) with learning rate 10
−4 and β
0 = 0, β
1 = 0.9. Specifically, we trained networks with
N = 40, 50, 60, 64, 70, 80, 90, 128 (see
Figure 1). Wasserstein-1 error (
Arjovsky et al., 2017) on a validation set (the CIFAR10 test data set) was lowest with
N = 90 in agreement with visual inspection of sample quality, so we chose a network with
N = 90 for all remaining analyses. In Appendix “Naturalism of CIFAR10 images,” we provide evidence that the CIFAR10 data set contains image features that a naive observer would likely use during natural vision.