From ea2937b9164cbb3b4dc8d3161deaa12b7ddcd90a Mon Sep 17 00:00:00 2001 From: Mirko Birbaumer <mirko.birbaumer@hslu.ch> Date: Sun, 3 Apr 2022 20:34:17 +0000 Subject: [PATCH] Adaptation of VAE to tensorflow 2.7 --- ...am, Neural Style Transfer, and GAN's.ipynb | 1385 +++++------------ 1 file changed, 389 insertions(+), 996 deletions(-) diff --git a/notebooks/Block_7/Jupyter Notebook Block 7 - Generative Models - DeepDream, Neural Style Transfer, and GAN's.ipynb b/notebooks/Block_7/Jupyter Notebook Block 7 - Generative Models - DeepDream, Neural Style Transfer, and GAN's.ipynb index 66cd9bc..7e1f590 100644 --- a/notebooks/Block_7/Jupyter Notebook Block 7 - Generative Models - DeepDream, Neural Style Transfer, and GAN's.ipynb +++ b/notebooks/Block_7/Jupyter Notebook Block 7 - Generative Models - DeepDream, Neural Style Transfer, and GAN's.ipynb @@ -11,15 +11,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Sampling from a latent space of images to create entirely new images or edit existing\n", - "ones is currently the most popular and successful application of creative AI. In this section\n", - "and the next, we’ll review some high-level concepts pertaining to image generation,\n", - "alongside implementations details relative to the two main techniques in this\n", - "domain: _variational autoencoders_ (VAEs) and _generative adversarial networks_ (GANs). \n", + "The most popular and successful application of creative AI today is image generation:\n", + "learning latent visual spaces and sampling from them to create entirely new pictures\n", + "interpolated from real ones — pictures of imaginary people, imaginary places, imaginary\n", + "cats and dogs, and so on.\n", "\n", - "The techniques we present here aren’t specific to images—you could develop latent spaces\n", - "of sound, music, or even text, using GANs and VAEs—but in practice, the most interesting\n", - "results have been obtained with pictures, and that’s what we focus on here." + "In this section and the next, we’ll review some high-level concepts pertaining to\n", + "image generation, alongside implementation details relative to the two main techniques\n", + "in this domain: _variational autoencoders_ (VAEs) and _generative adversarial networks_\n", + "(GANs). Note that the techniques we will discuss here aren’t specific to images — you\n", + "could develop latent spaces of sound, music, or even text, using GANs and VAEs — but\n", + "in practice, the most interesting results have been obtained with pictures, and that’s\n", + "what we’ll focus on here." ] }, { @@ -33,14 +36,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The key idea of image generation is to develop a low-dimensional latent space of representations\n", - "(which naturally is a vector space) where any point can be mapped to a\n", - "realistic-looking image. The module capable of realizing this mapping, taking as input\n", - "a latent point and outputting an image (a grid of pixels), is called a _generator_ (in the\n", - "case of GANs) or a _decoder_ (in the case of VAEs). \n", + "The key idea of image generation is to develop a low-dimensional _latent space_ of representations\n", + "(which, like everything else in deep learning, is a vector space), where any\n", + "point can be mapped to a “valid†image: an image that looks like the real thing. The\n", + "module capable of realizing this mapping, taking as input a latent point and outputting\n", + "an image (a grid of pixels), is called a _generator_ (in the case of GANs) or a _decoder_\n", + "(in the case of VAEs). Once such a latent space has been learned, you can sample\n", + "points from it, and, by mapping them back to image space, generate images that have\n", + "never been seen before (see the figure below). These new images are the in-betweens of\n", + "the training images.\n", "\n", - "Once such a latent space has been developed, you can sample points from it, either deliberately \n", - "or at random, and, by mapping them to image space, generate images that have never been seen before.\n", "\n", "<img src='./Bilder/latent_space.jpg'>" ] @@ -49,11 +54,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "GANs and VAEs are two different strategies for learning such latent spaces of image\n", - "representations, each with its own characteristics. VAEs are great for learning latent\n", - "spaces that are well structured, where specific directions encode a meaningful axis of\n", - "variation in the data. GANs generate images that can potentially be highly realistic, but\n", - "the latent space they come from may not have as much structure and continuity." + "GANs and VAEs are two different strategies for learning such latent spaces of\n", + "image representations, each with its own characteristics. VAEs are great for learning\n", + "latent spaces that are well structured, where specific directions encode a meaningful\n", + "axis of variation in the data (see the figure below). GANs generate images that can potentially\n", + "be highly realistic, but the latent space they come from may not have as much\n", + "structure and continuity." ] }, { @@ -62,6 +68,7 @@ "source": [ "## Concept vectors for image editing\n", "\n", + "\n", "The idea of concept vectors is the following : given a latent space of representations, or an\n", "embedding space, certain directions in the space may encode interesting axes of variation\n", "in the original data. \n", @@ -97,8 +104,8 @@ "Variational autoencoders, simultaneously discovered by Kingma and Welling in\n", "December 2013 and Rezende, Mohamed, and Wierstra in January 2014 are a kind\n", "of generative model that’s especially appropriate for the task of image editing via concept\n", - "vectors. They’re a modern take on autoencoders — a type of network that aims to\n", - "encode an input to a low-dimensional latent space and then decode it back—that\n", + "vectors. They’re a modern take on autoencoders - a type of network that aims to\n", + "encode an input to a low-dimensional latent space and then decode it back - that\n", "mixes ideas from deep learning with Bayesian inference.\n", "\n", "\n", @@ -129,25 +136,25 @@ "with a little bit of statistical magic that forces them to learn continuous, highly structured\n", "latent spaces. They have turned out to be a powerful tool for image generation.\n", "\n", - "\n", - "A VAE, instead of compressing its input image into a fixed code in the latent space,\n", - "turns the image into the parameters of a statistical distribution: a mean and a variance.\n", - "Essentially, this means you’re assuming the input image has been generated by a\n", - "statistical process, and that the randomness of this process should be taken into\n", + "A VAE, instead of compressing its input image into a fixed code in the latent\n", + "space, turns the image into the parameters of a statistical distribution: a mean and a\n", + "variance. Essentially, this means we’re assuming the input image has been generated\n", + "by a statistical process, and that the randomness of this process should be taken into\n", "account during encoding and decoding. The VAE then uses the mean and variance\n", - "parameters to randomly sample one element of the distribution, and decodes that element\n", - "back to the original input, see the Figure below: \n", - "\n", + "parameters to randomly sample one element of the distribution, and decodes that\n", + "element back to the original input , see the Figure below: \n", "\n", "<img src='./Bilder/vae_illustration.jpg'>\n", "\n", "A VAE maps an image to two vectors, `z_mean` and `z_log_sigma`, which define\n", "a probability distribution over the latent space, used to sample a latent point to decode.\n", "\n", + "The stochasticity of this process improves robustness and forces the latent space to encode meaningful \n", + "representations everywhere: every point sampled in the latent space is decoded to a valid\n", + "output.\n", + "\n", "\n", - "The stochasticity of this process\n", - "improves robustness and forces the latent space to encode meaningful representations\n", - "everywhere: every point sampled in the latent space is decoded to a valid output." + "\n" ] }, { @@ -156,7 +163,7 @@ "source": [ "In technical terms, here’s how a VAE works:\n", "\n", - "1. An encoder module turns the input samples input_img into two parameters in\n", + "1. An encoder module turns the input samples `input_img` into two parameters in\n", "a latent space of representations, `z_mean` and `z_log_variance`\n", "\n", "2. You randomly sample a point z from the latent normal distribution that’s\n", @@ -209,9 +216,352 @@ "metadata": {}, "source": [ "You can then train the model using the reconstruction loss and the regularization loss.\n", - "The following listing shows the encoder network you’ll use, mapping images to the\n", + "\n", + "For the regularization loss, we typically use an expression (the Kullback–Leibler divergence)\n", + "meant to nudge the distribution of the encoder output toward a well-rounded\n", + "normal distribution centered around 0. This provides the encoder with a sensible\n", + "assumption about the structure of the latent space it’s modeling." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Implementing a VAE with Keras\n", + "\n", + "We’re going to be implementing a VAE that can generate MNIST digits. It’s going to\n", + "have three parts:\n", + "\n", + "1. An _encoder network_ that turns a real image into a mean and a variance in the latent space\n", + "\n", + "2. A _sampling layer_ that takes such a mean and variance, and uses them to sample a random point from the latent space\n", + "\n", + "3. A _decoder network_ that turns points from the latent space back into images\n", + "\n", + "The following listing shows the encoder network we’ll use, mapping images to the\n", "parameters of a probability distribution over the latent space. It’s a simple convnet\n", - "that maps the input image x to two vectors, `z_mean` and `z_log_var`." + "that maps the input image `x` to two vectors, `z_mean` and `z_log_var`. One important\n", + "detail is that we use strides for downsampling feature maps instead of max pooling.\n", + "(remember the U-Net in the image segmentation section). Recall\n", + "that, in general, strides are preferable to max pooling for any model that cares about\n", + "information location — that is to say, where stuff is in the image — and this one does, since\n", + "it will have to produce an image encoding that can be used to reconstruct a valid\n", + "image." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### VAE encoder network" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from tensorflow import keras\n", + "from tensorflow.keras import layers\n", + "\n", + "# Dimensionality of the latent space: a 2D plane\n", + "latent_dim = 2\n", + "encoder_inputs = keras.Input(shape=(28, 28, 1))\n", + "x = layers.Conv2D(32, 3, activation=\"relu\", strides=2, padding=\"same\")(encoder_inputs)\n", + "x = layers.Conv2D(64, 3, activation=\"relu\", strides=2, padding=\"same\")(x)\n", + "x = layers.Flatten()(x)\n", + "x = layers.Dense(16, activation=\"relu\")(x)\n", + "# The input image ends up being encoded into these\n", + "# two parameters\n", + "z_mean = layers.Dense(latent_dim, name=\"z_mean\")(x)\n", + "z_log_var = layers.Dense(latent_dim, name=\"z_log_var\")(x)\n", + "encoder = keras.Model(encoder_inputs, [z_mean, z_log_var], name=\"encoder\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Its summary looks like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model: \"encoder\"\n", + "__________________________________________________________________________________________________\n", + " Layer (type) Output Shape Param # Connected to \n", + "==================================================================================================\n", + " input_2 (InputLayer) [(None, 28, 28, 1)] 0 [] \n", + " \n", + " conv2d_2 (Conv2D) (None, 14, 14, 32) 320 ['input_2[0][0]'] \n", + " \n", + " conv2d_3 (Conv2D) (None, 7, 7, 64) 18496 ['conv2d_2[0][0]'] \n", + " \n", + " flatten_1 (Flatten) (None, 3136) 0 ['conv2d_3[0][0]'] \n", + " \n", + " dense_1 (Dense) (None, 16) 50192 ['flatten_1[0][0]'] \n", + " \n", + " z_mean (Dense) (None, 2) 34 ['dense_1[0][0]'] \n", + " \n", + " z_log_var (Dense) (None, 2) 34 ['dense_1[0][0]'] \n", + " \n", + "==================================================================================================\n", + "Total params: 69,076\n", + "Trainable params: 69,076\n", + "Non-trainable params: 0\n", + "__________________________________________________________________________________________________\n" + ] + } + ], + "source": [ + "encoder.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next is the code for using `z_mean` and `z_log_var`, the parameters of the statistical distribution\n", + "assumed to have produced input_img, to generate a latent space point `z`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Latent-space-sampling layer" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "class Sampler(layers.Layer):\n", + " def call(self, z_mean, z_log_var):\n", + " batch_size = tf.shape(z_mean)[0]\n", + " z_size = tf.shape(z_mean)[1]\n", + " # Draw a batch of random normal\n", + " # Apply the VAE vectors.\n", + " epsilon = tf.random.normal(shape=(batch_size, z_size))\n", + " return z_mean + tf.exp(0.5 * z_log_var) * epsilon" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following listing shows the decoder implementation. We reshape the vector `z` to\n", + "the dimensions of an image and then use a few convolution layers to obtain a final\n", + "image output that has the same dimensions as the original input_img." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# Input where we’ll feed z\n", + "latent_inputs = keras.Input(shape=(latent_dim,))\n", + "# Produce the same number of coefficients that we\n", + "# had at the level of the Flatten layer in the encoder\n", + "x = layers.Dense(7 * 7 * 64, activation=\"relu\")(latent_inputs)\n", + "# Revert the Flatten layer of the encoder\n", + "x = layers.Reshape((7, 7, 64))(x)\n", + "# Revert the Conv2D layers of the encoder\n", + "x = layers.Conv2DTranspose(64, 3, activation=\"relu\", strides=2, padding=\"same\")(x)\n", + "x = layers.Conv2DTranspose(32, 3, activation=\"relu\", strides=2, padding=\"same\")(x)\n", + "# The output ends up with shape (28, 28, 1)\n", + "decoder_outputs = layers.Conv2D(1, 3, activation=\"sigmoid\", padding=\"same\")(x)\n", + "decoder = keras.Model(latent_inputs, decoder_outputs, name=\"decoder\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Its summary looks like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model: \"decoder\"\n", + "_________________________________________________________________\n", + " Layer (type) Output Shape Param # \n", + "=================================================================\n", + " input_3 (InputLayer) [(None, 2)] 0 \n", + " \n", + " dense_2 (Dense) (None, 3136) 9408 \n", + " \n", + " reshape (Reshape) (None, 7, 7, 64) 0 \n", + " \n", + " conv2d_transpose (Conv2DTra (None, 14, 14, 64) 36928 \n", + " nspose) \n", + " \n", + " conv2d_transpose_1 (Conv2DT (None, 28, 28, 32) 18464 \n", + " ranspose) \n", + " \n", + " conv2d_4 (Conv2D) (None, 28, 28, 1) 289 \n", + " \n", + "=================================================================\n", + "Total params: 65,089\n", + "Trainable params: 65,089\n", + "Non-trainable params: 0\n", + "_________________________________________________________________\n" + ] + } + ], + "source": [ + "decoder.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let’s create the VAE model itself. This is your first example of a model that isn’t\n", + "doing supervised learning (an autoencoder is an example of _self-supervised learning_,\n", + "because it uses its inputs as targets). Whenever you depart from classic supervised\n", + "learning, it’s common to subclass the `Model` class and implement a custom `train_\n", + "step()` to specify the new training logic.\n", + "That’s what we’ll do here." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "class VAE(keras.Model):\n", + " def __init__(self, encoder, decoder, **kwargs):\n", + " super().__init__(**kwargs)\n", + " self.encoder = encoder\n", + " self.decoder = decoder\n", + " self.sampler = Sampler()\n", + " # We use these metrics to keep track of the loss averages\n", + " # over each epoch.\n", + " self.total_loss_tracker = keras.metrics.Mean(name=\"total_loss\")\n", + " self.reconstruction_loss_tracker = keras.metrics.Mean(\n", + " name=\"reconstruction_loss\")\n", + " self.kl_loss_tracker = keras.metrics.Mean(name=\"kl_loss\")\n", + "\n", + " \n", + "\n", + " # We list the metrics in the metrics\n", + " # property to enable the model to reset\n", + " # them after each epoch (or between\n", + " # multiple calls to fit()/evaluate())\n", + " @property\n", + " def metrics(self):\n", + " return [self.total_loss_tracker,\n", + " self.reconstruction_loss_tracker,\n", + " selfself.kl_loss_tracker]\n", + "\n", + " def train_step(self, data):\n", + " with tf.GradientTape() as tape:\n", + " z_mean, z_log_var = self.encoder(data)\n", + " z = self.sampler(z_mean, z_log_var)\n", + " reconstruction = decoder(z)\n", + " # We sum the reconstruction loss over the spatial\n", + " # dimensions (axes 1 and 2) and take its mean over the\n", + " # batch dimension.\n", + " reconstruction_loss = tf.reduce_mean(\n", + " tf.reduce_sum(keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2))\n", + " )\n", + " # Add the regularization term (Kullback–Leibler divergence)\n", + " kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))\n", + " total_loss = reconstruction_loss + tf.reduce_mean(kl_loss)\n", + " \n", + " grads = tape.gradient(total_loss, self.trainable_weights)\n", + " self.optimizer.apply_gradients(zip(grads, self.trainable_weights))\n", + " self.total_loss_tracker.update_state(total_loss)\n", + " self.reconstruction_loss_tracker.update_state(reconstruction_loss)\n", + " self.kl_loss_tracker.update_state(kl_loss)\n", + " return {\n", + " \"total_loss\": self.total_loss_tracker.result(),\n", + " \"reconstruction_loss\": self.reconstruction_loss_tracker.result(),\n", + " \"kl_loss\": self.kl_loss_tracker.result(),\n", + " }\n", + " \n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we’re ready to instantiate and train the model on MNIST digits. Because the\n", + "loss is taken care of in the custom layer, we don’t specify an external loss at compile\n", + "time (`loss=None`), which in turn means we won’t pass target data during training - as\n", + "you can see, we only pass `x_train` to the model in `fit()`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Training the VAE" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 1/30\n" + ] + }, + { + "ename": "NotImplementedError", + "evalue": "Exception encountered when calling layer \"vae_1\" (type VAE).\n\nWhen subclassing the `Model` class, you should implement a `call()` method.\n\nCall arguments received:\n • inputs=tf.Tensor(shape=(128, 28, 28, 1), dtype=float32)\n • training=True\n • mask=None", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNotImplementedError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m<ipython-input-11-97ae65a7d644>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;31m# Note that we don’t pass targets in fit(), since train_step()\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;31m# doesn’t expect any\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mvae\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmnist_digits\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mepochs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m30\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbatch_size\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m128\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/opt/conda/lib/python3.7/site-packages/keras/utils/traceback_utils.py\u001b[0m in \u001b[0;36merror_handler\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 65\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;31m# pylint: disable=broad-except\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 66\u001b[0m \u001b[0mfiltered_tb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_process_traceback_frames\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__traceback__\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 67\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfiltered_tb\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 68\u001b[0m \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 69\u001b[0m \u001b[0;32mdel\u001b[0m \u001b[0mfiltered_tb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/opt/conda/lib/python3.7/site-packages/keras/engine/training.py\u001b[0m in \u001b[0;36mcall\u001b[0;34m(self, inputs, training, mask)\u001b[0m\n\u001b[1;32m 473\u001b[0m \u001b[0ma\u001b[0m \u001b[0mlist\u001b[0m \u001b[0mof\u001b[0m \u001b[0mtensors\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mthere\u001b[0m \u001b[0mare\u001b[0m \u001b[0mmore\u001b[0m \u001b[0mthan\u001b[0m \u001b[0mone\u001b[0m \u001b[0moutputs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 474\u001b[0m \"\"\"\n\u001b[0;32m--> 475\u001b[0;31m raise NotImplementedError('When subclassing the `Model` class, you should '\n\u001b[0m\u001b[1;32m 476\u001b[0m 'implement a `call()` method.')\n\u001b[1;32m 477\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mNotImplementedError\u001b[0m: Exception encountered when calling layer \"vae_1\" (type VAE).\n\nWhen subclassing the `Model` class, you should implement a `call()` method.\n\nCall arguments received:\n • inputs=tf.Tensor(shape=(128, 28, 28, 1), dtype=float32)\n • training=True\n • mask=None" + ] + } + ], + "source": [ + "import numpy as np\n", + "(x_train, _), (x_test, _) = keras.datasets.mnist.load_data()\n", + "# We train on all MNIST digits, so we concatenate\n", + "# the training and test samples\n", + "mnist_digits = np.concatenate([x_train, x_test], axis=0)\n", + "mnist_digits = np.expand_dims(mnist_digits, -1).astype(\"float32\") / 255\n", + "vae = VAE(encoder, decoder)\n", + "# Note that we don’t pass a loss argument in compile(), since the loss\n", + "# is already part of the train_step().\n", + "vae.compile(optimizer=keras.optimizers.Adam(), run_eagerly=True)\n", + "# Note that we don’t pass targets in fit(), since train_step()\n", + "# doesn’t expect any\n", + "vae.fit(mnist_digits, epochs=30, batch_size=128)" ] }, { @@ -535,7 +885,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Part IV : Adverserial Networks" + "# Part II : Adverserial Networks" ] }, { @@ -961,7 +1311,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Part I : Deep Dreams\n", + "# Part III : Deep Dreams\n", "\n", "Unzip the `Bilder.zip` file in the same directory where you run this notebook." ] @@ -1399,7 +1749,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Part II : Neural style transfer\n", + "# Part IV : Neural style transfer\n", "\n", "In addition to DeepDream, another major development in deep-learning-driven\n", "image modification is __neural style transfer__, introduced by Leon Gatys et al. in the summer\n", @@ -2012,963 +2362,6 @@ "\n", "3. Give more weight on content image or style image." ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Part III : Generating Images with Variational Autoencoders (VAE)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Sampling from a latent space of images to create entirely new images or edit existing\n", - "ones is currently the most popular and successful application of creative AI. In this section\n", - "and the next, we’ll review some high-level concepts pertaining to image generation,\n", - "alongside implementations details relative to the two main techniques in this\n", - "domain: _variational autoencoders_ (VAEs) and _generative adversarial networks_ (GANs). \n", - "\n", - "The techniques we present here aren’t specific to images—you could develop latent spaces\n", - "of sound, music, or even text, using GANs and VAEs—but in practice, the most interesting\n", - "results have been obtained with pictures, and that’s what we focus on here." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Sampling from Latent Spaces of Images" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The key idea of image generation is to develop a low-dimensional latent space of representations\n", - "(which naturally is a vector space) where any point can be mapped to a\n", - "realistic-looking image. The module capable of realizing this mapping, taking as input\n", - "a latent point and outputting an image (a grid of pixels), is called a _generator_ (in the\n", - "case of GANs) or a _decoder_ (in the case of VAEs). \n", - "\n", - "Once such a latent space has been developed, you can sample points from it, either deliberately \n", - "or at random, and, by mapping them to image space, generate images that have never been seen before.\n", - "\n", - "<img src='./Bilder/latent_space.jpg'>" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "GANs and VAEs are two different strategies for learning such latent spaces of image\n", - "representations, each with its own characteristics. VAEs are great for learning latent\n", - "spaces that are well structured, where specific directions encode a meaningful axis of\n", - "variation in the data. GANs generate images that can potentially be highly realistic, but\n", - "the latent space they come from may not have as much structure and continuity." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Concept vectors for image editing\n", - "\n", - "The idea of concept vectors is the following : given a latent space of representations, or an\n", - "embedding space, certain directions in the space may encode interesting axes of variation\n", - "in the original data. \n", - "\n", - "In a latent space of images of faces, for instance, there may\n", - "be a smile vector $s$, such that if latent point $z$ is the embedded representation of a certain\n", - "face, then latent point $z + s$ is the embedded representation of the same face,\n", - "smiling. \n", - "\n", - "\n", - "Once you’ve identified such a vector, it then becomes possible to edit images\n", - "by projecting them into the latent space, moving their representation in a meaningful\n", - "way, and then decoding them back to image space. There are concept vectors for\n", - "essentially any independent dimension of variation in image space—in the case of\n", - "faces, you may discover vectors for adding sunglasses to a face, removing glasses, turning\n", - "a male face into a female face, and so on. \n", - "\n", - "\n", - "The Figure below is an example of a smile vector,\n", - "a concept vector discovered by Tom White from the Victoria University School of\n", - "Design in New Zealand, using VAEs trained on a dataset of faces of celebrities (the\n", - "CelebA dataset).\n", - "\n", - "<img src='./Bilder/smile_vector.jpg'>\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Variational autoencoders\n", - "\n", - "Variational autoencoders, simultaneously discovered by Kingma and Welling in\n", - "December 2013 and Rezende, Mohamed, and Wierstra in January 2014 are a kind\n", - "of generative model that’s especially appropriate for the task of image editing via concept\n", - "vectors. They’re a modern take on autoencoders — a type of network that aims to\n", - "encode an input to a low-dimensional latent space and then decode it back—that\n", - "mixes ideas from deep learning with Bayesian inference.\n", - "\n", - "\n", - "A classical image autoencoder takes an image, maps it to a latent vector space via\n", - "an encoder module, and then decodes it back to an output with the same dimensions\n", - "as the original image, via a decoder module, see the figure below: \n", - "\n", - "<img src='./Bilder/autoencoder.jpg'>\n", - "\n", - "It’s then trained by\n", - "using as target data the same images as the input images, meaning the autoencoder\n", - "learns to reconstruct the original inputs. By imposing various constraints on the code\n", - "(the output of the encoder), you can get the autoencoder to learn more-or-less interesting\n", - "latent representations of the data. \n", - "\n", - "Most commonly, you’ll constrain the code to\n", - "be low-dimensional and sparse (mostly zeros), in which case the encoder acts as a way\n", - "to compress the input data into fewer bits of information." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In practice, such classical autoencoders don’t lead to particularly useful or nicely\n", - "structured latent spaces. They’re not much good at compression, either. For these reasons,\n", - "they have largely fallen out of fashion. VAEs, however, augment autoencoders\n", - "with a little bit of statistical magic that forces them to learn continuous, highly structured\n", - "latent spaces. They have turned out to be a powerful tool for image generation.\n", - "\n", - "\n", - "A VAE, instead of compressing its input image into a fixed code in the latent space,\n", - "turns the image into the parameters of a statistical distribution: a mean and a variance.\n", - "Essentially, this means you’re assuming the input image has been generated by a\n", - "statistical process, and that the randomness of this process should be taken into\n", - "account during encoding and decoding. The VAE then uses the mean and variance\n", - "parameters to randomly sample one element of the distribution, and decodes that element\n", - "back to the original input, see the Figure below: \n", - "\n", - "\n", - "<img src='./Bilder/vae_illustration.jpg'>\n", - "\n", - "A VAE maps an image to two vectors, `z_mean` and `z_log_sigma`, which define\n", - "a probability distribution over the latent space, used to sample a latent point to decode.\n", - "\n", - "\n", - "The stochasticity of this process\n", - "improves robustness and forces the latent space to encode meaningful representations\n", - "everywhere: every point sampled in the latent space is decoded to a valid output." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In technical terms, here’s how a VAE works:\n", - "\n", - "1. An encoder module turns the input samples input_img into two parameters in\n", - "a latent space of representations, `z_mean` and `z_log_variance`\n", - "\n", - "2. You randomly sample a point z from the latent normal distribution that’s\n", - "assumed to generate the input image, via\n", - "`z = z_mean + exp(z_log_variance) * epsilon`\n", - "\n", - "where epsilon is a random tensor of small values.\n", - "\n", - "3. A decoder module maps this point in the latent space back to the original input\n", - "image.\n", - "\n", - "\n", - "\n", - "Because `epsilon` is random, the process ensures that every point that’s close to the latent location\n", - "where you encoded `input_img` (z-mean) can be decoded to something similar to\n", - "`input_img`, thus forcing the latent space to be continuously meaningful. Any two close points\n", - "in the latent space will decode to highly similar images. Continuity, combined with the low\n", - "dimensionality of the latent space, forces every direction in the latent space to encode a meaningful\n", - "axis of variation of the data, making the latent space very structured and thus highly suitable\n", - "to manipulation via concept vectors.\n", - "\n", - "\n", - "\n", - "The parameters of a VAE are trained via two loss functions: a _reconstruction loss_ that\n", - "forces the decoded samples to match the initial inputs, and a _regularization loss_ that\n", - "helps learn well-formed latent spaces and reduce overfitting to the training data. Let’s\n", - "quickly go over a Keras implementation of a VAE. Schematically, it looks like this:" - ] - }, - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "# Encodes the input into a mean and variance parameter\n", - "z_mean, z_log_variance = encoder(input_img)\n", - "\n", - "# Draws a latent point using a small random epsilon\n", - "z = z_mean + exp(z_log_variance) * epsilon\n", - "\n", - "# Decodes z back to an image\n", - "reconstructed_img = decoder(z)\n", - "\n", - "# Instantiates the autoencoder model, which maps an \n", - "# input image to its reconstruction\n", - "model = Model(input_img, reconstructed_img)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can then train the model using the reconstruction loss and the regularization loss.\n", - "The following listing shows the encoder network you’ll use, mapping images to the\n", - "parameters of a probability distribution over the latent space. It’s a simple convnet\n", - "that maps the input image x to two vectors, `z_mean` and `z_log_var`." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Latent-space-sampling function" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import tensorflow as tf\n", - "from tensorflow import keras\n", - "from tensorflow.keras import layers\n", - "\n", - "class Sampling(layers.Layer):\n", - " \"\"\"Uses (z_mean, z_log_var) to sample z, the vector encoding a digit.\"\"\"\n", - "\n", - " def call(self, inputs):\n", - " z_mean, z_log_var = inputs\n", - " batch = tf.shape(z_mean)[0]\n", - " dim = tf.shape(z_mean)[1]\n", - " epsilon = tf.keras.backend.random_normal(shape=(batch, dim))\n", - " return z_mean + tf.exp(0.5 * z_log_var) * epsilon\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### VAE encoder network" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "latent_dim = 2\n", - "\n", - "encoder_inputs = keras.Input(shape=(28, 28, 1))\n", - "x = layers.Conv2D(32, 3, activation=\"relu\", strides=2, padding=\"same\")(encoder_inputs)\n", - "x = layers.Conv2D(64, 3, activation=\"relu\", strides=2, padding=\"same\")(x)\n", - "x = layers.Flatten()(x)\n", - "x = layers.Dense(16, activation=\"relu\")(x)\n", - "z_mean = layers.Dense(latent_dim, name=\"z_mean\")(x)\n", - "z_log_var = layers.Dense(latent_dim, name=\"z_log_var\")(x)\n", - "z = Sampling()([z_mean, z_log_var])\n", - "encoder = keras.Model(encoder_inputs, [z_mean, z_log_var, z], name=\"encoder\")\n", - "encoder.summary()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next is the code for using `z_mean` and `z_log_var`, the parameters of the statistical distribution\n", - "assumed to have produced `input_img`, to generate a latent space point z.\n", - "Here, you wrap some arbitrary code (built on top of Keras backend primitives) into a\n", - "`Lambda` layer. In Keras, everything needs to be a layer, so code that isn’t part of a builtin\n", - "layer should be wrapped in a `Lambda` (or in a custom layer)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### VAE decoder network, mapping latent space points to images\n", - "\n", - "The following listing shows the decoder implementation. You reshape the vector z to\n", - "the dimensions of an image and then use a few convolution layers to obtain a final\n", - "image output that has the same dimensions as the original `input_img`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "latent_inputs = keras.Input(shape=(latent_dim,))\n", - "x = layers.Dense(7 * 7 * 64, activation=\"relu\")(latent_inputs)\n", - "x = layers.Reshape((7, 7, 64))(x)\n", - "x = layers.Conv2DTranspose(64, 3, activation=\"relu\", strides=2, padding=\"same\")(x)\n", - "x = layers.Conv2DTranspose(32, 3, activation=\"relu\", strides=2, padding=\"same\")(x)\n", - "decoder_outputs = layers.Conv2DTranspose(1, 3, activation=\"sigmoid\", padding=\"same\")(x)\n", - "decoder = keras.Model(latent_inputs, decoder_outputs, name=\"decoder\")\n", - "decoder.summary()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The dual loss of a VAE doesn’t fit the traditional expectation of a sample-wise function\n", - "of the form `loss(input, target)`. Thus, you’ll set up the loss by writing a custom\n", - "layer that internally uses the built-in `add_loss` layer method to create an arbitrary loss." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Custom layer used to compute the VAE loss" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "class VAE(keras.Model):\n", - " def __init__(self, encoder, decoder, **kwargs):\n", - " super(VAE, self).__init__(**kwargs)\n", - " self.encoder = encoder\n", - " self.decoder = decoder\n", - "\n", - " def train_step(self, data):\n", - " if isinstance(data, tuple):\n", - " data = data[0]\n", - " with tf.GradientTape() as tape:\n", - " z_mean, z_log_var, z = encoder(data)\n", - " reconstruction = decoder(z)\n", - " reconstruction_loss = tf.reduce_mean(\n", - " keras.losses.binary_crossentropy(data, reconstruction)\n", - " )\n", - " reconstruction_loss *= 28 * 28\n", - " kl_loss = 1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var)\n", - " kl_loss = tf.reduce_mean(kl_loss)\n", - " kl_loss *= -0.5\n", - " total_loss = reconstruction_loss + kl_loss\n", - " grads = tape.gradient(total_loss, self.trainable_weights)\n", - " self.optimizer.apply_gradients(zip(grads, self.trainable_weights))\n", - " return {\n", - " \"loss\": total_loss,\n", - " \"reconstruction_loss\": reconstruction_loss,\n", - " \"kl_loss\": kl_loss,\n", - " }" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, you’re ready to instantiate and train the model. Because the loss is taken care\n", - "of in the custom layer, you don’t specify an external loss at compile time (`loss=None`),\n", - "which in turn means you won’t pass target data during training (as you can see, you\n", - "only pass `x_train` to the model in `fit`)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Training the VAE" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "(x_train, _), (x_test, _) = keras.datasets.mnist.load_data()\n", - "mnist_digits = np.concatenate([x_train, x_test], axis=0)\n", - "mnist_digits = np.expand_dims(mnist_digits, -1).astype(\"float32\") / 255\n", - "\n", - "vae = VAE(encoder, decoder)\n", - "vae.compile(optimizer=keras.optimizers.Adam())\n", - "vae.fit(mnist_digits, epochs=30, batch_size=128)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Once such a model is trained—on MNIST, in this case—you can use the decoder network\n", - "to turn arbitrary latent space vectors into images." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Sampling a grid of points from the 2D latent space and decoding them to images" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "\n", - "\n", - "def plot_latent_space(vae, n=30, figsize=15):\n", - " # display a n*n 2D manifold of digits\n", - " digit_size = 28\n", - " scale = 1.0\n", - " figure = np.zeros((digit_size * n, digit_size * n))\n", - " # linearly spaced coordinates corresponding to the 2D plot\n", - " # of digit classes in the latent space\n", - " grid_x = np.linspace(-scale, scale, n)\n", - " grid_y = np.linspace(-scale, scale, n)[::-1]\n", - "\n", - " for i, yi in enumerate(grid_y):\n", - " for j, xi in enumerate(grid_x):\n", - " z_sample = np.array([[xi, yi]])\n", - " x_decoded = vae.decoder.predict(z_sample)\n", - " digit = x_decoded[0].reshape(digit_size, digit_size)\n", - " figure[\n", - " i * digit_size : (i + 1) * digit_size,\n", - " j * digit_size : (j + 1) * digit_size,\n", - " ] = digit\n", - "\n", - " plt.figure(figsize=(figsize, figsize))\n", - " start_range = digit_size // 2\n", - " end_range = n * digit_size + start_range\n", - " pixel_range = np.arange(start_range, end_range, digit_size)\n", - " sample_range_x = np.round(grid_x, 1)\n", - " sample_range_y = np.round(grid_y, 1)\n", - " plt.xticks(pixel_range, sample_range_x)\n", - " plt.yticks(pixel_range, sample_range_y)\n", - " plt.xlabel(\"z[0]\")\n", - " plt.ylabel(\"z[1]\")\n", - " plt.imshow(figure, cmap=\"Greys_r\")\n", - " plt.show()\n", - "\n", - "\n", - "plot_latent_space(vae)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The grid of sampled digits (see the Figure above) shows a completely continuous\n", - "distribution of the different digit classes, with one digit morphing\n", - "into another as you follow a path through latent space. Specific directions\n", - "in this space have a meaning: for example, there’s a direction for\n", - "“four-ness,†“one-ness,†and so on.\n", - "\n", - "\n", - "In the next section, we’ll cover in detail the other major tool for generating\n", - "artificial images: generative adversarial networks (GANs)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Display how the latent space clusters different digit classes" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def plot_label_clusters(vae, data, labels):\n", - " # display a 2D plot of the digit classes in the latent space\n", - " z_mean, _, _ = vae.encoder.predict(data)\n", - " plt.figure(figsize=(12, 10))\n", - " plt.scatter(z_mean[:, 0], z_mean[:, 1], c=labels)\n", - " plt.colorbar()\n", - " plt.xlabel(\"z[0]\")\n", - " plt.ylabel(\"z[1]\")\n", - " plt.show()\n", - "\n", - "\n", - "(x_train, y_train), _ = keras.datasets.mnist.load_data()\n", - "x_train = np.expand_dims(x_train, -1).astype(\"float32\") / 255\n", - "\n", - "plot_label_clusters(vae, x_train, y_train)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Wrapping up \n", - "\n", - "Wrapping up\n", - "- Image generation with deep learning is done by learning latent spaces that capture\n", - "statistical information about a dataset of images. By sampling and decoding\n", - "points from the latent space, you can generate never-before-seen images.\n", - "There are two major tools to do this: VAEs and GANs.\n", - "\n", - "- VAEs result in highly structured, continuous latent representations. For this reason,\n", - "they work well for doing all sorts of image editing in latent space: face\n", - "swapping, turning a frowning face into a smiling face, and so on. They also work\n", - "nicely for doing latent-space-based animations, such as animating a walk along a\n", - "cross section of the latent space, showing a starting image slowly morphing into\n", - "different images in a continuous way.\n", - "\n", - "- GANs enable the generation of realistic single-frame images but may not induce\n", - "latent spaces with solid structure and high continuity.\n", - "Most successful practical applications I have seen with images rely on VAEs, but GANs\n", - "are extremely popular in the world of academic research—at least, circa 2016–2017.\n", - "You’ll find out how they work and how to implement one in the next section.\n", - "\n", - "## Extensions for VAE\n", - "\n", - "\n", - "To play further with image generation, I suggest working with the [Largescale\n", - "Celeb Faces Attributes](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) (CelebA) dataset. It’s a free-to-download image\n", - "dataset containing more than 200,000 celebrity portraits. It’s great for experimenting\n", - "with concept vectors in particular—it definitely beats MNIST." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Part IV : Adverserial Networks" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Generative adversarial networks (GANs), introduced in 2014 by \n", - "[Goodfellow et al.,](https://arxiv.org/abs/1406.2661) are\n", - "an alternative to VAEs for learning latent spaces of images. They enable the generation\n", - "of fairly realistic synthetic images by forcing the generated images to be statistically\n", - "almost indistinguishable from real ones.\n", - "\n", - "\n", - "An intuitive way to understand GANs is to imagine a forger trying to create a fake\n", - "Picasso painting. At first, the forger is pretty bad at the task. He mixes some of his\n", - "fakes with authentic Picassos and shows them all to an art dealer. The art dealer makes\n", - "an authenticity assessment for each painting and gives the forger feedback about what\n", - "makes a Picasso look like a Picasso. The forger goes back to his studio to prepare some\n", - "new fakes. As times goes on, the forger becomes increasingly competent at imitating\n", - "the style of Picasso, and the art dealer becomes increasingly expert at spotting fakes.\n", - "In the end, they have on their hands some excellent fake Picassos.\n", - "\n", - "\n", - "That’s what a GAN is: a forger network and an expert network, each being trained\n", - "to best the other. As such, a GAN is made of two parts:\n", - "\n", - "- _Generator network_ — Takes as input a random vector (a random point in the\n", - "latent space), and decodes it into a synthetic image\n", - "- _Discriminator network_ (or adversary) — Takes as input an image (real or synthetic),\n", - "and predicts whether the image came from the training set or was created by\n", - "the generator network.\n", - "\n", - "The generator network is trained to be able to fool the discriminator network, and\n", - "thus it evolves toward generating increasingly realistic images as training goes on: artificial\n", - "images that look indistinguishable from real ones, to the extent that it’s impossible\n", - "for the discriminator network to tell the two apart (see figure 8.15). Meanwhile,\n", - "the discriminator is constantly adapting to the gradually improving capabilities of the\n", - "generator, setting a high bar of realism for the generated images. Once training is\n", - "over, the generator is capable of turning any point in its input space into a believable\n", - "image. Unlike VAEs, this latent space has fewer explicit guarantees of meaningful\n", - "structure; in particular, it isn’t continuous.\n", - "\n", - "\n", - "<img src='./Bilder/gan_illustration.jpg'>\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Remarkably, a GAN is a system where the optimization minimum isn’t fixed, unlike in\n", - "any other training setup you’ve encountered in this book. Normally, gradient descent\n", - "consists of rolling down hills in a static loss landscape. But with a GAN, every step\n", - "taken down the hill changes the entire landscape a little. \n", - "\n", - "It’s a dynamic system where\n", - "the optimization process is seeking not a minimum, but an equilibrium between two\n", - "forces. For this reason, GANs are notoriously difficult to train—getting a GAN to work\n", - "requires lots of careful tuning of the model architecture and training parameters" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### A schematic GAN implementation\n", - "\n", - "In this section, we’ll explain how to implement a GAN in Keras, in its barest form—\n", - "because GANs are advanced, diving deeply into the technical details would be out of\n", - "scope for this chapter. The specific implementation is a _deep convolutional GAN_ (DCGAN):\n", - "a GAN where the generator and discriminator are deep convnets. In particular, it uses\n", - "a `Conv2DTranspose` layer for image upsampling in the generator.\n", - "You’ll train the GAN on images from CIFAR10, a dataset of 50,000 32 × 32 RGB\n", - "images belonging to 10 classes (5,000 images per class). To make things easier, you’ll\n", - "only use images belonging to the class “frog.â€\n", - "Schematically, the GAN looks like this:\n", - "\n", - "1. A generator network maps vectors of shape (`latent_dim`) to images of shape\n", - "$(32, 32, 3)$.\n", - "\n", - "2. A discriminator network maps images of shape $(32, 32, 3)$ to a binary score\n", - "estimating the probability that the image is real.\n", - "\n", - "3. A gan network chains the generator and the discriminator together: \n", - "`gan(x) = discriminator(generator(x))`. Thus this gan network maps latent space vectors\n", - "to the discriminator’s assessment of the realism of these latent vectors as\n", - "decoded by the generator.\n", - "\n", - "\n", - "4. You train the discriminator using examples of real and fake images along with\n", - "“realâ€/“fake†labels, just as you train any regular image-classification model.\n", - "\n", - "5. To train the generator, you use the gradients of the generator’s weights with\n", - "regard to the loss of the gan model. This means, at every step, you move the\n", - "weights of the generator in a direction that makes the discriminator more likely\n", - "to classify as “real†the images decoded by the generator. In other words, you\n", - "train the generator to fool the discriminator." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### A bag of tricks\n", - "\n", - "The process of training GANs and tuning GAN implementations is notoriously difficult.\n", - "There are a number of known tricks you should keep in mind. Like most things\n", - "in deep learning, it’s more alchemy than science: these tricks are heuristics, not\n", - "theory-backed guidelines. They’re supported by a level of intuitive understanding of\n", - "the phenomenon at hand, and they’re known to work well empirically, although not\n", - "necessarily in every context.\n", - "Here are a few of the tricks used in the implementation of the GAN generator and\n", - "discriminator in this section. It isn’t an exhaustive list of GAN-related tips; you’ll find\n", - "many more across the GAN literature:\n", - "\n", - "- We use `tanh` as the last activation in the generator, instead of `sigmoid`, which is\n", - "more commonly found in other types of models.\n", - "\n", - "- We sample points from the latent space using a _normal distribution_ (Gaussian distribution),\n", - "not a uniform distribution.\n", - "\n", - "- Stochasticity is good to induce robustness. Because GAN training results in a\n", - "dynamic equilibrium, GANs are likely to get stuck in all sorts of ways. Introducing\n", - "randomness during training helps prevent this. We introduce randomness\n", - "in two ways: by using dropout in the discriminator and by adding random noise\n", - "to the labels for the discriminator.\n", - "\n", - "- Sparse gradients can hinder GAN training. In deep learning, sparsity is often a\n", - "desirable property, but not in GANs. Two things can induce gradient sparsity:\n", - "max pooling operations and `ReLU` activations. Instead of max pooling, we recommend\n", - "using strided convolutions for downsampling, and we recommend\n", - "using a `LeakyReLU` layer instead of a ReLU activation. It’s similar to `ReLU`, but it\n", - "relaxes sparsity constraints by allowing small negative activation values.\n", - "\n", - "- In generated images, it’s common to see checkerboard artifacts caused by\n", - "unequal coverage of the pixel space in the generator (see figure 8.17). To fix\n", - "this, we use a kernel size that’s divisible by the stride size whenever we use a\n", - "strided `Conv2DTranpose` or `Conv2D` in both the generator and the discriminator." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## The generator\n", - "\n", - "First, let’s develop a `generator` model that turns a vector (from the latent space—\n", - "during training it will be sampled at random) into a candidate image. One of the\n", - "many issues that commonly arise with GANs is that the generator gets stuck with generated\n", - "images that look like noise. A possible solution is to use dropout on both the discriminator\n", - "and the generator." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### GAN generator network" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import tensorflow\n", - "from tensorflow import keras\n", - "from tensorflow.keras import layers\n", - "import numpy as np\n", - "latent_dim = 32\n", - "height = 32\n", - "width = 32\n", - "channels = 3\n", - "\n", - "generator_input = keras.Input(shape=(latent_dim,))\n", - "# Transforms the input into a 16 × 16 128-channel feature map\n", - "x = layers.Dense(128 * 16 * 16)(generator_input)\n", - "x = layers.LeakyReLU()(x)\n", - "x = layers.Reshape((16, 16, 128))(x)\n", - "x = layers.Conv2D(256, 5, padding='same')(x)\n", - "x = layers.LeakyReLU()(x)\n", - "# Upsamples to 32 × 32\n", - "x = layers.Conv2DTranspose(256, 4, strides=2, padding='same')(x)\n", - "x = layers.LeakyReLU()(x)\n", - "x = layers.Conv2D(256, 5, padding='same')(x)\n", - "x = layers.LeakyReLU()(x)\n", - "x = layers.Conv2D(256, 5, padding='same')(x)\n", - "x = layers.LeakyReLU()(x)\n", - "\n", - "# Produces a 32 × 32 1-channel feature map (shape of a CIFAR10 image)\n", - "x = layers.Conv2D(channels, 7, activation='tanh', padding='same')(x)\n", - "# Instantiates the generator model, which maps the input\n", - "# of shape (latent_dim,) into an image of shape (32, 32, 3)\n", - "generator = keras.models.Model(generator_input, x)\n", - "generator.summary()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### The discriminator\n", - "\n", - "Next, you’ll develop a discriminator model that takes as input a candidate image\n", - "(real or synthetic) and classifies it into one of two classes: “generated image†or “real\n", - "image that comes from the training set.â€" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### The GAN discriminator network" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "discriminator_input = layers.Input(shape=(height, width, channels))\n", - "x = layers.Conv2D(128, 3)(discriminator_input)\n", - "x = layers.LeakyReLU()(x)\n", - "x = layers.Conv2D(128, 4, strides=2)(x)\n", - "x = layers.LeakyReLU()(x)\n", - "x = layers.Conv2D(128, 4, strides=2)(x)\n", - "x = layers.LeakyReLU()(x)\n", - "x = layers.Conv2D(128, 4, strides=2)(x)\n", - "x = layers.LeakyReLU()(x)\n", - "x = layers.Flatten()(x)\n", - "# One dropout layer: an important trick!\n", - "x = layers.Dropout(0.4)(x)\n", - "# Classification layer\n", - "x = layers.Dense(1, activation='sigmoid')(x)\n", - "# Instantiates the discriminator model, which turns\n", - "# a (32, 32, 3) input into a binary classification\n", - "# decision (fake/real)\n", - "discriminator = tensorflow.keras.models.Model(discriminator_input, x)\n", - "discriminator.summary()\n", - "discriminator_optimizer = tensorflow.keras.optimizers.RMSprop(\n", - "lr=0.0008,\n", - " # Uses gradient clipping (by value) in the optimizer\n", - "clipvalue=1.0,\n", - " # To stabilize training, uses learning-rate decay\n", - "decay=1e-8)\n", - "discriminator.compile(optimizer=discriminator_optimizer,\n", - "loss='binary_crossentropy')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## The adversarial network\n", - "\n", - "Finally, you’ll set up the GAN, which chains the generator and the discriminator.\n", - "When trained, this model will move the generator in a direction that improves its ability\n", - "to fool the discriminator. This model turns latent-space points into a classification\n", - "decision—“fake†or “realâ€â€”and it’s meant to be trained with labels that are always\n", - "“these are real images.†So, training gan will update the weights of `generator` in a way\n", - "that makes discriminator more likely to predict “real†when looking at fake images.\n", - "It’s very important to note that you set the `discriminator` to be frozen during training\n", - "(non-trainable): its weights won’t be updated when training gan. If the discriminator\n", - "weights could be updated during this process, then you’d be training the discriminator\n", - "to always predict “real,†which isn’t what you want!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Adversarial network" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "discriminator.trainable = False\n", - "gan_input = tensorflow.keras.Input(shape=(latent_dim,))\n", - "gan_output = discriminator(generator(gan_input))\n", - "gan = tensorflow.keras.models.Model(gan_input, gan_output)\n", - "gan_optimizer = tensorflow.keras.optimizers.RMSprop(lr=0.0004, clipvalue=1.0, decay=1e-8)\n", - "gan.compile(optimizer=gan_optimizer, loss='binary_crossentropy')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## How to train your DCGAN\n", - "\n", - "Now you can begin training. To recapitulate, this is what the training loop looks like\n", - "schematically. For each epoch, you do the following:\n", - "1. Draw random points in the latent space (random noise).\n", - "\n", - "2. Generate images with `generator` using this random noise\n", - "\n", - "3. Mix the generated images with real ones\n", - "\n", - "4. Train `discriminator` using these mixed images, with corresponding targets:\n", - "either “real†(for the real images) or “fake†(for the generated images)\n", - "\n", - "5. Draw new random points in the latent space\n", - "\n", - "6. Train gan using these random vectors, with targets that all say “these are real\n", - "images.†This updates the weights of the generator (only, because the discriminator\n", - "is frozen inside gan) to move them toward getting the discriminator to\n", - "predict “these are real images†for generated images: this trains the generator\n", - "to fool the discriminator.\n", - "Let’s implement it." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Implementing GAN training" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from tensorflow.keras.preprocessing import image\n", - "# Loads CIFAR10 data\n", - "(x_train, y_train), (_, _) = keras.datasets.cifar10.load_data()\n", - "\n", - "# Selects frog images (class 6)\n", - "x_train = x_train[y_train.flatten() == 6]\n", - "\n", - "# Normalizes data\n", - "x_train = x_train.reshape((x_train.shape[0],) + (height, width, channels)).astype('float32') / 255.\n", - "iterations = 10000\n", - "batch_size = 20\n", - "\n", - "# Specifies where you want to save generated images\n", - "save_dir = './data/'\n", - "start = 0\n", - "for step in range(iterations):\n", - " # Samples random points in the latent space\n", - " random_latent_vectors = np.random.normal(size=(batch_size, latent_dim))\n", - " # Decodes them to fake images\n", - " generated_images = generator.predict(random_latent_vectors)\n", - " # Combines them with real images\n", - " stop = start + batch_size\n", - " real_images = x_train[start: stop]\n", - " combined_images = np.concatenate([generated_images, real_images])\n", - " # Assembles labels, discriminating real from fake images\n", - " labels = np.concatenate([np.ones((batch_size, 1)), np.zeros((batch_size, 1))])\n", - " # Adds random noise to the labels—an important trick!\n", - " labels += 0.05 * np.random.random(labels.shape)\n", - " d_loss = discriminator.train_on_batch(combined_images, labels)\n", - " # Samples random points in the latent space\n", - " random_latent_vectors = np.random.normal(size=(batch_size, latent_dim))\n", - " # Assembles labels that say “these are all real images†(it’s a lie!)\n", - " misleading_targets = np.zeros((batch_size, 1))\n", - " # Trains the generator (via the gan model, where the discriminator weights are frozen)\n", - " a_loss = gan.train_on_batch(random_latent_vectors, misleading_targets)\n", - " \n", - " start += batch_size\n", - " if start > len(x_train) - batch_size:\n", - " start = 0\n", - " \n", - " # Occasionally saves and plots (every 100 steps)\n", - " if step % 100 == 0:\n", - " # Saves model weights\n", - " gan.save_weights('gan.h5')\n", - " # Prints metrics\n", - " print('discriminator loss:', d_loss)\n", - " print('adversarial loss:', a_loss)\n", - " # Saves one generated image\n", - " img = image.array_to_img(generated_images[0] * 255., scale=False)\n", - " img.save(os.path.join(save_dir, 'generated_frog' + str(step) + '.png'))\n", - " # Saves one real image for comparison\n", - " img = image.array_to_img(real_images[0] * 255., scale=False)\n", - " img.save(os.path.join(save_dir, 'real_frog' + str(step) + '.png'))\n", - " \n", - " \n", - " " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When training, you may see the adversarial loss begin to increase considerably, while\n", - "the discriminative loss tends to zero—the discriminator may end up dominating the\n", - "generator. If that’s the case, try reducing the discriminator learning rate, and increase\n", - "the dropout rate of the discriminator." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Wrapping up\n", - "- A GAN consists of a generator network coupled with a discriminator network.\n", - "The discriminator is trained to differenciate between the output of the generator\n", - "and real images from a training dataset, and the generator is trained to fool the\n", - "discriminator. Remarkably, the generator never sees images from the training\n", - "set directly; the information it has about the data comes from the discriminator.\n", - "\n", - "- GANs are difficult to train, because training a GAN is a dynamic process rather\n", - "than a simple gradient descent process with a fixed loss landscape. Getting a\n", - "GAN to train correctly requires using a number of heuristic tricks, as well as\n", - "extensive tuning.\n", - "\n", - "- GANs can potentially produce highly realistic images. But unlike VAEs, the\n", - "latent space they learn doesn’t have a neat continuous structure and thus may\n", - "not be suited for certain practical applications, such as image editing via latentspace\n", - "concept vectors." - ] } ], "metadata": { -- GitLab