diff --git a/notebooks/Block_5/Jupyter Notebook Block 5 - Object Detection and Segmentation.ipynb b/notebooks/Block_5/Jupyter Notebook Block 5 - Object Detection and Segmentation.ipynb index f1e1db51e4082ccf66edaa07b84633425cf0c346..9514432b9a915cf5e25ced336f6599301156694d 100644 --- a/notebooks/Block_5/Jupyter Notebook Block 5 - Object Detection and Segmentation.ipynb +++ b/notebooks/Block_5/Jupyter Notebook Block 5 - Object Detection and Segmentation.ipynb @@ -2326,7 +2326,9 @@ " x = layers.add([x, residual]) # Add back residual\n", " previous_block_activation = x # Set aside next residual\n", "\n", - " # Add a per-pixel classification layer\n", + " # We end the model with a per-pixel three-way\n", + " # softmax to classify each output pixel into one of\n", + " # our three categories\n", " outputs = layers.Conv2D(num_classes, 3, activation=\"softmax\", padding=\"same\")(x)\n", "\n", " # Define the model\n", @@ -2338,37 +2340,140 @@ "keras.backend.clear_session()\n", "\n", "# Build model\n", - "model = get_model(img_size, num_classes)\n", + "model = get_model(img_size, num_classes=3)\n", "model.summary()" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "The first half of the model closely resembles the kind of convnet you’d use for image\n", + "classification: a stack of `Conv2D` layers, with gradually increasing filter sizes. We downsample\n", + "our images three times by a factor of two each, ending up with activations of size\n", + "$(25, 25, 256)$. The purpose of this first half is to encode the images into smaller feature\n", + "maps, where each spatial location (or pixel) contains information about a large spatial\n", + "chunk of the original image. You can understand it as a kind of compression.\n", + "\n", + "\n", + "One important difference between the first half of this model and the classification\n", + "models you’ve seen before is the way we do downsampling: in the classification\n", + "ConvNets from the last chapter, we used `MaxPooling2D` layers to downsample feature\n", + "maps. Here, we downsample by adding _strides_ to every other convolution layer. We do \n", + "this because, in the case of image segmentation, we care a lot about the _spatial location_ of information in the image, since we need to produce per-pixel target masks as output of the \n", + "model. When you do $2\\times 2$ max pooling, you are completely destroying location information within each pooling window: you return one scalar value per window, with zero knowledge of which of the four locations in the windows the value came from. So while max pooling layers perform\n", + "well for classification tasks, they would hurt us quite a bit for a segmentation\n", + "task. Meanwhile, strided convolutions do a better job at downsampling feature maps\n", + "while retaining location information. Throughout this book, you’ll notice that we\n", + "tend to use strides instead of max pooling in any model that cares about feature location,\n", + "such as generative models.\n", + "\n", + "The second half of the model is a stack of `Conv2DTranspose` layers. What are those?\n", + "Well, the output of the first half of the model is a feature map of shape $(25, 25, 256)$, \n", + "but we want our final output to have the same shape as the target masks, $(200, 200,3)$. Therefore, we need to apply a kind of _inverse_ of the transformations we’ve applied\n", + "so far — something that will _upsample_ the feature maps instead of downsampling them.\n", + "That’s the purpose of the `Conv2DTranspose` layer: you can think of it as a kind of convolution\n", + "layer that _learns to upsample_. If you have an input of shape $(100, 100, 64)$, and you\n", + "run it through the layer `Conv2D(128, 3, strides=2, padding=\"same\")`, you get an\n", + "output of shape $(50, 50, 128)$. If you run this output through the layer \n", + "`Conv2DTranspose(64, 3, strides=2, padding=\"same\")`, you get back an output of shape $(100,\n", + "100, 64)$, the same as the original. So after compressing our inputs into feature maps of\n", + "shape $(25, 25, 256)$ via a stack of `Conv2D` layers, we can simply apply the corresponding\n", + "sequence of `Conv2DTranspose` layers to get back to images of shape $(200, 200, 3)$.\n", + "\n", + "We can now compile and fit our model:" + ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "model.compile(optimizer=\"rmsprop\", loss=\"sparse_categorical_crossentropy\")\n", + "\n", + "logdir = os.path.join(\"logs\", datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\"))\n", + "callbacks = [\n", + " keras.callbacks.ModelCheckpoint(filepath=\"fine_tuning.keras\", save_best_only=True, monitor=\"val_loss\"),\n", + " tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)\n", + "] \n", + "\n", + "history = model.fit(train_input_imgs, train_targets,\n", + " epochs=50,\n", + " callbacks=callbacks,\n", + " batch_size=64,\n", + " validation_data=(val_input_imgs, val_targets))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let’s display our training and validation loss:" + ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "plt.plot(history.history['accuracy'])\n", + "plt.plot(history.history['val_accuracy'])\n", + "plt.title('model accuracy')\n", + "plt.ylabel('accuracy')\n", + "plt.xlabel('epoch')\n", + "plt.legend(['train', 'valid'], loc='lower right')\n", + "plt.show()\n", + "plt.plot(history.history['loss'])\n", + "plt.plot(history.history['val_loss'])\n", + "plt.title('model loss')\n", + "plt.ylabel('loss')\n", + "plt.xlabel('epoch')\n", + "plt.legend(['train', 'valid'], loc='upper right')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can see that we start overfitting midway, around epoch 25. Let’s reload our best\n", + "performing model according to the validation loss, and demonstrate how to use it to\n", + "predict a segmentation mask" + ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "from tensorflow.keras.utils import array_to_img\n", + "model = keras.models.load_model(\"oxford_segmentation.keras\")\n", + "i = 4\n", + "test_image = val_input_imgs[i]\n", + "plt.axis(\"off\")\n", + "plt.imshow(array_to_img(test_image))\n", + "mask = model.predict(np.expand_dims(test_image, 0))[0]\n", + "\n", + "# Utility to display a model’s prediction\n", + "def display_mask(pred):\n", + " mask = np.argmax(pred, axis=-1)\n", + " mask *= 127\n", + " plt.axis(\"off\")\n", + " plt.imshow(mask)\n", + " \n", + "display_mask(mask)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are a couple of small artifacts in our predicted mask, caused by geometric shapes\n", + "in the foreground and background. Nevertheless, our model appears to work nicely." + ] }, { "cell_type": "markdown",