One of the most interesting use cases of deep learning are generative models. We can generate all kinds of things, including text, images, sounds... Various chatbots, automatic machine translations, speech2text, text2speech, image captioning, image quality enhancements, image2 image transformations and others are cool applications that have some kind of generative models under the hood. So, I decided to generate some anime girls (waifus, for science and research, of course). I thought that would be a fun way to learn deep learning generative models. The idea is to generate a SmartCat chan, a Smart Cat-Girl :)

In this blog I’m going to present image generative models that I tried in my quest for generating an imaginary waifu. But before the anime girls, I tested my TensorFlow implementations on CelebA dataset (on which results I’ll mainly focus in this blog). In my quest I saw abominations, I traveled through a storm of nightmares, experienced epiphany in psychedelic emptiness and darkness.

Anyway, we are going to see generative models like: Generative Adversarial Network (GAN), Deep Convolutional Generative Adversarial Network (DC-GAN), and an extension to DC-GAN using Variational Autoencoder. Implementations of models in TensorFlow (only models, without training scripts) are in this repo. Note that this repo will surely change in the future. The used datasets are: CelebA, Anime girls (scrapped from the internet).

Generative Adversarial Network - Overview

Generative Adversarial Network (GAN) is a type of a neural network that learns to generate data in an unsupervised manner. The idea is to feed the neural network with tons of images and, as a result, we get new generated images. Inside the GAN architecture, we can find two separate neural networks: Generator and Discriminator. The discriminator is a neural network that takes the input image and outputs whether this image is real or fake. The task of the generator is to create a totally new image that would fool the discriminator.

We can see where this is going. During training, those two neural networks compete with each other, and they are improving within each iteration. At first, the generator throws some garbage to the discriminator, and the discriminator looks at that image and then at the real images. After that the discriminator is like: “dude, this ain’t even real!”. Next time, the generator gets better and fools the discriminator by generating a better image, but after a few iterations the discriminator also gets better at distinguishing between real and fake images. This competition between the generator and the discriminator are going through the iteration of training.

Deep Convolutional Generative Adversarial Network

Simple convolutional GANs are unstable to train, and most of the time we don’t get the desired photorealistic images, in other words: abominations. To overcome this problem, authors of the original Deep Convolutional Generative Adversarial Network (I always get bored when I need to write down the full name of DCGAN) paper proposed several changes in order to make training more stable.

  1. Instead of the pooling function that reduces dimensionality in the discriminator, use the strided convolution layer. In other words, we let the network learn how to reduce dimensionality. In the generator layer, we use deconvolution to upsample dimensions of feature maps.
  2. Usage of batch normalization. We’ve all seen the power behind batch normalization
  3. Make a fully convolutional network without fully connected layers.
  4. Use Relu and Leaky Relu activation functions.

We can see the architecture in the image below. Note that in the generator, we start by using some random vector in order to produce meaningful images at the end. Instead of a random vector, we can use an autoencoder, but more on that later.

Using DCGAN you’ll get a better result, not perfect, but better. Here are some samples of generated faces.

Well, it’s ok. It should be better though. I put the uglier one here on purpose :). The Killer Squad that you see up here were generated in the middle of training while the network wasn’t converged totally. But hey, I can’t draw either. Even though I didn’t generate a waifu (only foul creatures, and pancake people), this is still better than my artwork. I guess my art remained at the level of a 6-year-old, sorry if I offended some 6yo child reading this.

Please don’t ask me to draw an anime girl! One cool thing is that we can use the input vector and do arithmetic operation on it. Besides crappy generated images, the results are amazing. We can add sunglasses on people without sunglasses by simply adding and subtracting vectors. Also, we can also interpolate from one person to another, change gender and even rotate face. Here is an example of interpolation between two faces.

Variational Autoencoder - Overview

Variational autoencoders are the type of autoencoders (unsupervised learning neural networks) that try to fit data into the Normal distribution. We know the autoencoders are trying to compress data (e.g. images) into lower dimensional space. The compression is done in that way the autoencoder can reconstruct data back to original space.

We can see that the autoencoder “encodes” input information in the encoding space. The values in the encoding space define features from input, e.g. hair color, eye color, nose size, etc. Autoencoders are great in reconstructing input images, but there is a limitation factor. They can’t easily generate new samples. It is hard, because their encoding space can be really messy, so we can’t easily sample from that. In that case, variational autoencoder comes to the rescue. Variational autoencoders enforce data to fit into the Normal distribution. Once we know the distribution of the encoding space, we can easily sample from there and generate new kinds of stuff.

The structure of variational autoencoders are mostly the same as with plain autoencoders, except the encoding part. After a few layers of encoder, we define 2 separate layers. One layer represents mean, and the other layer represents standard deviation.  We can now sample our encoding using mean and std outputs. As the loss function we take both the reconstruction loss and the KL divergence loss.

The images generated from variational autoencoders are blurry, and if not trained well, the network outputs images that are similar to the mean image. But, combining the idea of a variational autoencoder with DCGAN is something that produces better results in generating new images.


We saw how the variational autoencoder tends to fit data into the Gaussian distribution, but it tends to produce rather blurry images. On the other side, deep convolutional generative adversarial neural networks produce sharper images, but lack photo realistics (look up for pancake people up here). Combining both VAE and DCGAN we can overcome some of these problems. The overall architecture is presented in the image.

Now, our generator will produce images for reconstruction, based on the input image (for calculated z), and images based on a random sample z. After that, we will define losses for the discriminator and generator. The encoder loss will be the KL divergence loss like we had in the variational autoencoder. In addition to our losses, we will introduce the perceptual loss. Basically, we calculate to which extent are content features (in higher layers) similar in the encoder and in the discriminator. We wire up the perceptual loss from the Lth layer into the encoder and generator losses.

When training GANs, due to unhealthy competition our network can get stuck, or even die. We need to be careful when defining losses, and pay attention to their values. We can also use already pretrained models (like VGG on ImageNet, or something else) for the encoder and discriminator.

I got much better results when training VAE-DCGAN on celebs (still better than my drawings). Here are some samples from the generator:

Behold, waifus are coming.

Due to not so easy training process I only generated 64x64 images. I’ll try to generate a higher resolution with DC-GANs directly next time.


So in this blog, my quest was to generate anime girls. Although I didn’t exactly succeed to generate full size, high resolution images, I did manage to cover the idea behind generative models, in particular GAN, VAE, DC-GAN and VAE-DCGAN. Every method from these has its pros and cons, but eventually, VAE-DCGAN achieved the most realistic results.

There is also the image generative method that uses recurrent neural networks, but I’ll cover that in another blog (If I ever make one :) )

One interesting paper about generating large scale images that I didn’t cover in this blog is: Progressive growing of GANs for improved quality, stability, and variation. The idea is to train the GAN layer by layer, and with each layer we enhance our input and output image sizes. The results are promising, but like everything else, I’d like to try it on my own and fail miserably a few times until I get it right.

In the next blog(s) I will continue exploring generative models, and dive deeper into super resolution problems.

There are some papers that you may find interesting:

  1. Generative Adversarial Nets
  2. Unsupervised Representation Learning With Deep Convolutional Generative Adversarial Networks
  3. Auto-Encoding Variational Bayes
  4. Variational Approaches for Auto-Encoding Generative Adversarial Networks
  5. Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks