From 90a1977c2625ce354ad4f0895e306c40ed795fbe Mon Sep 17 00:00:00 2001
From: Mirko Birbaumer <mirko.birbaumer@hslu.ch>
Date: Wed, 23 Mar 2022 23:46:00 +0000
Subject: [PATCH] Added semantic segmentation section

---
 ... - Object Detection and Segmentation.ipynb | 123 ++++++++++++++++--
 1 file changed, 114 insertions(+), 9 deletions(-)

diff --git a/notebooks/Block_5/Jupyter Notebook Block 5 - Object Detection and Segmentation.ipynb b/notebooks/Block_5/Jupyter Notebook Block 5 - Object Detection and Segmentation.ipynb
index f1e1db5..9514432 100644
--- a/notebooks/Block_5/Jupyter Notebook Block 5 - Object Detection and Segmentation.ipynb	
+++ b/notebooks/Block_5/Jupyter Notebook Block 5 - Object Detection and Segmentation.ipynb	
@@ -2326,7 +2326,9 @@
     "        x = layers.add([x, residual])  # Add back residual\n",
     "        previous_block_activation = x  # Set aside next residual\n",
     "\n",
-    "    # Add a per-pixel classification layer\n",
+    "    # We end the model with a per-pixel three-way\n",
+    "    # softmax to classify each output pixel into one of\n",
+    "    # our three categories\n",
     "    outputs = layers.Conv2D(num_classes, 3, activation=\"softmax\", padding=\"same\")(x)\n",
     "\n",
     "    # Define the model\n",
@@ -2338,37 +2340,140 @@
     "keras.backend.clear_session()\n",
     "\n",
     "# Build model\n",
-    "model = get_model(img_size, num_classes)\n",
+    "model = get_model(img_size, num_classes=3)\n",
     "model.summary()"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
-   "source": []
+   "source": [
+    "The first half of the model closely resembles the kind of convnet youâ€™d use for image\n",
+    "classification: a stack of `Conv2D` layers, with gradually increasing filter sizes. We downsample\n",
+    "our images three times by a factor of two each, ending up with activations of size\n",
+    "$(25, 25, 256)$. The purpose of this first half is to encode the images into smaller feature\n",
+    "maps, where each spatial location (or pixel) contains information about a large spatial\n",
+    "chunk of the original image. You can understand it as a kind of compression.\n",
+    "\n",
+    "\n",
+    "One important difference between the first half of this model and the classification\n",
+    "models youâ€™ve seen before is the way we do downsampling: in the classification\n",
+    "ConvNets from the last chapter, we used `MaxPooling2D` layers to downsample feature\n",
+    "maps. Here, we downsample by adding _strides_ to every other convolution layer. We do \n",
+    "this because, in the case of image segmentation, we care a lot about the _spatial location_ of information in the image, since we need to produce per-pixel target masks as output of the \n",
+    "model. When you do $2\\times 2$ max pooling, you are completely destroying location information within each pooling window: you return one scalar value per window, with zero knowledge of which of the four locations in the windows the value came from. So while max pooling layers perform\n",
+    "well for classification tasks, they would hurt us quite a bit for a segmentation\n",
+    "task. Meanwhile, strided convolutions do a better job at downsampling feature maps\n",
+    "while retaining location information. Throughout this book, youâ€™ll notice that we\n",
+    "tend to use strides instead of max pooling in any model that cares about feature location,\n",
+    "such as generative models.\n",
+    "\n",
+    "The second half of the model is a stack of `Conv2DTranspose` layers. What are those?\n",
+    "Well, the output of the first half of the model is a feature map of shape $(25, 25, 256)$, \n",
+    "but we want our final output to have the same shape as the target masks, $(200, 200,3)$. Therefore, we need to apply a kind of _inverse_ of the transformations weâ€™ve applied\n",
+    "so far â€” something that will _upsample_ the feature maps instead of downsampling them.\n",
+    "Thatâ€™s the purpose of the `Conv2DTranspose` layer: you can think of it as a kind of convolution\n",
+    "layer that _learns to upsample_. If you have an input of shape $(100, 100, 64)$, and you\n",
+    "run it through the layer `Conv2D(128, 3, strides=2, padding=\"same\")`, you get an\n",
+    "output of shape $(50, 50, 128)$. If you run this output through the layer \n",
+    "`Conv2DTranspose(64, 3, strides=2, padding=\"same\")`, you get back an output of shape $(100,\n",
+    "100, 64)$, the same as the original. So after compressing our inputs into feature maps of\n",
+    "shape $(25, 25, 256)$ via a stack of `Conv2D` layers, we can simply apply the corresponding\n",
+    "sequence of `Conv2DTranspose` layers to get back to images of shape $(200, 200, 3)$.\n",
+    "\n",
+    "We can now compile and fit our model:"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "model.compile(optimizer=\"rmsprop\", loss=\"sparse_categorical_crossentropy\")\n",
+    "\n",
+    "logdir = os.path.join(\"logs\", datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\"))\n",
+    "callbacks = [\n",
+    "    keras.callbacks.ModelCheckpoint(filepath=\"fine_tuning.keras\", save_best_only=True, monitor=\"val_loss\"),\n",
+    "    tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)\n",
+    "] \n",
+    "\n",
+    "history = model.fit(train_input_imgs, train_targets,\n",
+    "    epochs=50,\n",
+    "    callbacks=callbacks,\n",
+    "    batch_size=64,\n",
+    "    validation_data=(val_input_imgs, val_targets))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Letâ€™s display our training and validation loss:"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "plt.plot(history.history['accuracy'])\n",
+    "plt.plot(history.history['val_accuracy'])\n",
+    "plt.title('model accuracy')\n",
+    "plt.ylabel('accuracy')\n",
+    "plt.xlabel('epoch')\n",
+    "plt.legend(['train', 'valid'], loc='lower right')\n",
+    "plt.show()\n",
+    "plt.plot(history.history['loss'])\n",
+    "plt.plot(history.history['val_loss'])\n",
+    "plt.title('model loss')\n",
+    "plt.ylabel('loss')\n",
+    "plt.xlabel('epoch')\n",
+    "plt.legend(['train', 'valid'], loc='upper right')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can see that we start overfitting midway, around epoch 25. Letâ€™s reload our best\n",
+    "performing model according to the validation loss, and demonstrate how to use it to\n",
+    "predict a segmentation mask"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "from tensorflow.keras.utils import array_to_img\n",
+    "model = keras.models.load_model(\"oxford_segmentation.keras\")\n",
+    "i = 4\n",
+    "test_image = val_input_imgs[i]\n",
+    "plt.axis(\"off\")\n",
+    "plt.imshow(array_to_img(test_image))\n",
+    "mask = model.predict(np.expand_dims(test_image, 0))[0]\n",
+    "\n",
+    "# Utility to display a modelâ€™s prediction\n",
+    "def display_mask(pred):\n",
+    "    mask = np.argmax(pred, axis=-1)\n",
+    "    mask *= 127\n",
+    "    plt.axis(\"off\")\n",
+    "    plt.imshow(mask)\n",
+    "    \n",
+    "display_mask(mask)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are a couple of small artifacts in our predicted mask, caused by geometric shapes\n",
+    "in the foreground and background. Nevertheless, our model appears to work nicely."
+   ]
   },
   {
    "cell_type": "markdown",
-- 
GitLab