added background for o-rings exercise

de9dbafb · Mirko Birbaumer · 4ca021f0 · de9dbafb · de9dbafb
Commit de9dbafb authored 2 years ago by Mirko Birbaumer
--- a/notebooks/Block_2/Exercises Block 2 - Neural Networks.ipynb
+++ b/notebooks/Block_2/Exercises Block 2 - Neural Networks.ipynb
@@ -432,9 +432,24 @@
   "source": [
    "This notebook calculates a logistic regression using Keras. It's basically meant to show the principles of Keras.\n",
    "\n",
+    "### Background\n",
+    "\n",
+    "The Space Shuttle Challenger exploded 73 second after liftoff on January 28th, 1986. The disaster claimed the lives of all seven astronauts on board, including school teacher Christa McAuliffe.\n",
+    "\n",
+    "The details surrounding this disaster were very involved. For the purposes of this analysis, it is sufficient to point out that engineers that manufactured the large boosters that launched the rocket were aware of the possible failures that could happen during cold temperatures. They tried to prevent the launch, but were ultimately ignored and disaster ensued.\n",
+    "\n",
+    "The main concern of engineers in launching the Challenger was the evidence that the large O-rings sealing the several sections of the boosters could fail in cold temperatures.\n",
+    "\n",
+    "\n",
    "###  Datset\n",
    "\n",
-    "We investigate the data set of the challenger flight with broken O-rings (`Y=1`) versus start temperature."
+    "The lowest temperature of any of the 23 prior launches (before the Challenger explosion) was 53° F. This is evident in the data set shown below. Engineers prior to the Challenger launch suggested that the launch not be attempted below 53°. The “evidence” that the o-rings could fail below 53° was based on a simple conclusion that since the launch at 53° experienced two o-ring failures, it seemed unwise to launch below that temperature. In the following analysis we demonstrate more fully how dangerous it was to launch on this specific day where the outside temperature at the time of the launch was 31°.\n",
+    "\n",
+    "The `Broken O-rings` column in the data set below records whether O-rings experienced failures during that particular launch - if they did, the value for `Broken O-rings` is 1. The `Temperature [F]` column lists the outside temperature at the time of launch.\n",
+    "\n",
+    "### Goal of Analysis\n",
+    "\n",
+    "We investigate the data set of the challenger flight with broken O-rings (`Y=1`) vs start temperature.§"
   ]
  },
  {

 %% Cell type:markdown id: tags:

 # Exercise 1 : Conversion from Celsius to Fahrenheit (Simple Regression Analysis)

 %% Cell type:markdown id: tags:

 The problem we will solve is to convert from Celsius to Fahrenheit, where the approximate formula is:

 $$ f = c \times 1.8 + 32 $$


 Of course, it would be simple enough to create a conventional Python function that directly performs this calculation, but that wouldn't be machine learning.


 Instead, we will give TensorFlow some sample Celsius values (0, 8, 15, 22, 38) and their corresponding Fahrenheit values (32, 46, 59, 72, 100).
 Then, we will train a model that figures out the above formula through the training process. This is a _simple regression analysis_ problem.

 %% Cell type:markdown id: tags:

 ## Import dependencies

 First, import TensorFlow. Here, we're calling it `tf` for ease of use. We also tell it to only display errors.

 Next, import [NumPy](http://www.numpy.org/) as `np`. Numpy helps us to represent our data as highly performant lists.

 %% Cell type:code id: tags:

 ``` python
 from __future__ import absolute_import, division, print_function, unicode_literals
 import tensorflow as tf

 import numpy as np
 ```

 %% Cell type:code id: tags:

 ``` python
 import logging
 logger = tf.get_logger()
 logger.setLevel(logging.ERROR)
 ```

 %% Cell type:markdown id: tags:

 ## Set up training data

 As we saw before, supervised Machine Learning is all about figuring out an algorithm given a set of inputs and outputs. Since the task in this Codelab is to create a model that can give the temperature in Fahrenheit when given the degrees in Celsius, we create two lists `celsius_q` and `fahrenheit_a` that we can use to train our model.

 %% Cell type:code id: tags:

 ``` python
 celsius_q    = np.array([-40, -10,  0,  8, 15, 22,  38],  dtype=float)
 fahrenheit_a = np.array([-40,  14, 32, 46, 59, 72, 100],  dtype=float)

 for i,c in enumerate(celsius_q):
  print("{} degrees Celsius = {} degrees Fahrenheit".format(c, fahrenheit_a[i]))
 ```

 %% Output

    -40.0 degrees Celsius = -40.0 degrees Fahrenheit
    -10.0 degrees Celsius = 14.0 degrees Fahrenheit
    0.0 degrees Celsius = 32.0 degrees Fahrenheit
    8.0 degrees Celsius = 46.0 degrees Fahrenheit
    15.0 degrees Celsius = 59.0 degrees Fahrenheit
    22.0 degrees Celsius = 72.0 degrees Fahrenheit
    38.0 degrees Celsius = 100.0 degrees Fahrenheit

 %% Cell type:markdown id: tags:

 ### Some Machine Learning terminology

 - **Feature** — The input(s) to our model. In this case, a single value — the degrees in Celsius.

 - **Labels/response variable** — The output our model predicts. In this case, a single value — the degrees in Fahrenheit. In a classification setting, we would predict labels (discrete classes), in a regression setting, we predict a continuous response variable, such as Fahrenheit.

 - **Example** — A pair of inputs/outputs used during training. In our case a pair of values from `celsius_q` and `fahrenheit_a` at a specific index, such as `(22,72)`.

 %% Cell type:markdown id: tags:

 ## 1. Define the Network

 Next create the model. We will use simplest possible model we can, a Dense network. Since the problem is straightforward, this network will require only a single layer, with a single neuron.

 ### Build a layer

 We'll call the layer `l0` and create it by instantiating `tf.keras.layers.Dense` with the following configuration:

 *   `input_shape=[1]` — This specifies that the input to this layer is a single value. That is, the shape is a one-dimensional array with one member. Since this is the first (and only) layer, that input shape is the input shape of the entire model. The single value is a floating point number, representing degrees Celsius.

 *   `units=1` — This specifies the number of neurons in the layer. The number of neurons defines how many internal variables the layer has to try to learn how to solve the problem (more later). Since this is the final layer, it is also the size of the model's output — a single float value representing degrees Fahrenheit. (In a multi-layered network, the size and shape of the layer would need to match the `input_shape` of the next layer.)

 %% Cell type:code id: tags:

 ``` python
 l0 = tf.keras.layers.Dense(units=1, input_shape=[1])
 ```

 %% Cell type:markdown id: tags:

 ### Assemble layers into the model

 Once layers are defined, they need to be assembled into a model. The Sequential model definition takes a list of layers as argument, specifying the calculation order from the input to the output.

 This model has just a single layer, l0.

 %% Cell type:code id: tags:

 ``` python
 model =  <--------- Your Code here -------------->
 ```

 %% Cell type:markdown id: tags:

 ## 2. Compile the network, with loss and optimizer functions

 Before training, the model has to be compiled. When compiled for training, the model is given:

 - **Loss function** — A way of measuring how far off predictions are from the desired outcome. (The measured difference is called the "loss".)

 - **Optimizer function** — A way of adjusting internal values in order to reduce the loss.

 %% Cell type:markdown id: tags:

 model.compile(loss='mean_squared_error',
              optimizer=tf.keras.optimizers.Adam(0.1))

 %% Cell type:markdown id: tags:

 These are used during training (`model.fit()`, below) to first calculate the loss at each point, and then improve it. In fact, the act of calculating the current loss of a model and then improving it is precisely what training is.

 During training, the optimizer function is used to calculate adjustments to the model's internal variables. The goal is to adjust the internal variables until the model (which is really a math function) mirrors the actual equation for converting Celsius to Fahrenheit.

 TensorFlow uses numerical analysis to perform this tuning, and all this complexity is hidden from you so we will not go into the details here. What is useful to know about these parameters are:

 The loss function ([mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)) and the optimizer ([Adam](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)) used here are standard for simple models like this one, but many others are available. It is not important to know how these specific functions work at this point.

 One part of the Optimizer you may need to think about when building your own models is the learning rate (`0.1` in the code above). This is the step size taken when adjusting values in the model. If the value is too small, it will take too many iterations to train the model. Too large, and accuracy goes down. Finding a good value often involves some trial and error, but the range is usually within 0.001 (default), and 0.1

 %% Cell type:markdown id: tags:

 ## 3. Fit the model

 Train the model by calling the `fit` method.

 During training, the model takes in Celsius values, performs a calculation using the current internal variables (called "weights") and outputs values which are meant to be the Fahrenheit equivalent. Since the weights are initially set randomly, the output will not be close to the correct value. The difference between the actual output and the desired output is calculated using the loss function, and the optimizer function directs how the weights should be adjusted.

 This cycle of calculate, compare, adjust is controlled by the `fit` method. The first argument is the inputs, the second argument is the desired outputs. The `epochs` argument specifies how many times this cycle should be run, and the `verbose` argument controls how much output the method produces.

 %% Cell type:code id: tags:

 ``` python
 history = model.fit(<--- your code here --->, <--- your code here --->, epochs=500, verbose=False)
 print("Finished training the model")
 ```

 %% Cell type:markdown id: tags:

 ## 4. Evaluate the Model - Display training statistics

 The `fit` method returns a history object. We can use this object to plot how the loss of our model goes down after each training epoch. A high loss means that the Fahrenheit degrees the model predicts is far from the corresponding value in `fahrenheit_a`.

 We'll use [Matplotlib](https://matplotlib.org/) to visualize this (you could use another tool). As you can see, our model improves very quickly at first, and then has a steady, slow improvement until it is very near "perfect" towards the end.


 %% Cell type:code id: tags:

 ``` python
 import matplotlib.pyplot as plt
 plt.xlabel('Epoch Number')
 plt.ylabel("Loss Magnitude")
 plt.plot(history.history['loss'])
 ```

 %% Cell type:markdown id: tags:

 ## 5. Use the model to predict values

 Now you have a model that has been trained to learn the relationship between `celsius_q` and `fahrenheit_a`. You can use the predict method to have it calculate the Fahrenheit degrees for a previously unknown Celsius degrees.

 So, for example, if the Celsius value is 100, what do you think the Fahrenheit result will be? Take a guess before you run this code.

 %% Cell type:code id: tags:

 ``` python
 print(model.predict(<---- your code here ---->))
 ```

 %% Cell type:markdown id: tags:

 The correct answer is $100 \times 1.8 + 32 = 212$, so our model is doing really well.

 ### To review


 *   We created a model with a Dense layer
 *   We trained it with 3500 examples (7 pairs, over 500 epochs).

 Our model tuned the variables (weights) in the Dense layer until it was able to return the correct Fahrenheit value for any Celsius value. (Remember, 100 Celsius was not part of our training data.)



 %% Cell type:markdown id: tags:

 ## Looking at the layer weights

 Finally, let's print the internal variables of the Dense layer.

 %% Cell type:code id: tags:

 ``` python
 print("These are the layer variables: {}".format(l0.get_weights()))
 ```

 %% Cell type:markdown id: tags:

 The first variable is close to ~1.8 and the second to ~32. These values (1.8 and 32) are the actual variables in the real conversion formula.

 This is really close to the values in the conversion formula. We'll explain this in an upcoming video where we show how a Dense layer works, but for a single neuron with a single input and a single output, the internal math looks the same as [the equation for a line](https://en.wikipedia.org/wiki/Linear_equation#Slope%E2%80%93intercept_form), $y = mx + b$, which has the same form as the conversion equation, $f = 1.8c + 32$.

 Since the form is the same, the variables should converge on the standard values of 1.8 and 32, which is exactly what happened.

 With additional neurons, additional inputs, and additional outputs, the formula becomes much more complex, but the idea is the same.

 ### A little experiment

 Just for fun, what if we created more Dense layers with different units, which therefore also has more variables?

 %% Cell type:code id: tags:

 ``` python
 l0 = tf.keras.layers.Dense(units=4, input_shape=[1])
 l1 = tf.keras.layers.Dense(units=4)
 l2 = tf.keras.layers.Dense(units=1)
 model = tf.keras.Sequential([l0, l1, l2])
 model.compile(loss='mean_squared_error', optimizer=tf.keras.optimizers.Adam(0.1))
 model.fit(celsius_q, fahrenheit_a, epochs=500, verbose=False)
 print("Finished training the model")
 print(model.predict([100.0]))
 print("Model predicts that 100 degrees Celsius is: {} degrees Fahrenheit".format(model.predict([100.0])))
 print("These are the l0 variables: {}".format(l0.get_weights()))
 print("These are the l1 variables: {}".format(l1.get_weights()))
 print("These are the l2 variables: {}".format(l2.get_weights()))
 ```

 %% Cell type:markdown id: tags:

 As you can see, this model is also able to predict the corresponding Fahrenheit value really well. But when you look at the variables (weights) in the `l0` and `l1` layers, they are nothing even close to ~1.8 and ~32. The added complexity hides the "simple" form of the conversion equation.


 %% Cell type:markdown id: tags:

 # Exercise 2 : O-Rings seen with Logistic Regression

 %% Cell type:markdown id: tags:

 This notebook calculates a logistic regression using Keras. It's basically meant to show the principles of Keras.

+### Background
+
+The Space Shuttle Challenger exploded 73 second after liftoff on January 28th, 1986. The disaster claimed the lives of all seven astronauts on board, including school teacher Christa McAuliffe.
+
+The details surrounding this disaster were very involved. For the purposes of this analysis, it is sufficient to point out that engineers that manufactured the large boosters that launched the rocket were aware of the possible failures that could happen during cold temperatures. They tried to prevent the launch, but were ultimately ignored and disaster ensued.
+
+The main concern of engineers in launching the Challenger was the evidence that the large O-rings sealing the several sections of the boosters could fail in cold temperatures.
+
+
 ###  Datset

-We investigate the data set of the challenger flight with broken O-rings (`Y=1`) versus start temperature.
+The lowest temperature of any of the 23 prior launches (before the Challenger explosion) was 53° F. This is evident in the data set shown below. Engineers prior to the Challenger launch suggested that the launch not be attempted below 53°. The “evidence” that the o-rings could fail below 53° was based on a simple conclusion that since the launch at 53° experienced two o-ring failures, it seemed unwise to launch below that temperature. In the following analysis we demonstrate more fully how dangerous it was to launch on this specific day where the outside temperature at the time of the launch was 31°.
+
+The `Broken O-rings` column in the data set below records whether O-rings experienced failures during that particular launch - if they did, the value for `Broken O-rings` is 1. The `Temperature [F]` column lists the outside temperature at the time of launch.
+
+### Goal of Analysis
+
+We investigate the data set of the challenger flight with broken O-rings (`Y=1`) vs start temperature.§

 %% Cell type:code id: tags:

 ``` python
 %matplotlib inline
 import numpy as np
 import tensorflow as tf
 import matplotlib.pyplot as plt
 import matplotlib.image as imgplot
 import numpy as np
 import pandas as pd
 import tempfile
 data = np.asarray(pd.read_csv('./challenger.txt', sep=','), dtype='float32')
 plt.plot(data[:,0], data[:,1], 'o')
 plt.axis([40, 85, -0.1, 1.2])
 plt.xlabel('Temperature [F]')
 plt.ylabel('Broken O-rings')
 ```

 %% Output

    Text(0, 0.5, 'Broken O-rings')



 %% Cell type:code id: tags:

 ``` python
 y_values = data[:,1]
 print(y_values)
 ```

 %% Output

    [0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1.]

 %% Cell type:markdown id: tags:

 ## Mathematical Notes

 We are considering the probability $P(y_i=1|x_i)$ for the class $y_i=1$ given the $i-$th data point $x_i$ ($x_i$ could be a vector). This is given by:

 $
 P(y_i=1 | x_i) = \frac{e^{(b +  x_i w)}}{1 + e^{(b + x_i w)}} = [1 + e^{-(b + x_i w)}]^{-1}
 $

 If we have more than one data point, which we usually do, we have to apply the equation above to each of the N data points. In this case we can used a vectorized version with $x=(x_1,x_2,\ldots,x_N)$ and $y=(y_1,y_2,\ldots,y_N$)

 %% Cell type:markdown id: tags:

 ### Numpy code
 This numpy code, shows the calculation for one value using numpy (like a single forward pass)

 %% Cell type:code id: tags:

 ``` python
 # Data
 N = len(data)
 x = data[:,0]
 y = data[:,1]
 # Initial Value for the weights
 w = -0.20
 b = 20.0
 # predicted probabilities
 p_1 = 1 / (1 + np.exp(-x*w - b))
 # cross-entropy loss function
 cross_entropy = y * np.log(p_1) + (1-y) * np.log(1-p_1)
 print(-np.mean(cross_entropy))
 print(np.round(p_1,3))
 ```

 %% Output

    3.882916
    [0.999 0.998 0.998 0.998 0.999 0.996 0.996 0.998 1.    0.999 0.998 0.988
     0.999 1.    0.999 0.993 0.998 0.978 0.992 0.985 0.993 0.992 1.   ]

 %% Cell type:markdown id: tags:

 ## Better values from intuition

 Now lets try to find better values for $W$ and $b$. Lets assume $W$ is given with $-1$. We want the probability
 for a damage $P(y_i=1 | x_i)$ to be $0.5$.
 Determine an appropriate value for $b$.
 Hint: at which $x$ value should $P(y_i=1 | x_i)$ be $0.5$, look at the data. At this $x$ value the term $1 + e^{-(b + W’ x_i)}$ must be $2$.

 **Solution**

 $P(y=1 | x) = 0.5$ at $x \approx 65$

 $-(b + (-1) x_i) = 0 \rightarrow b = 65$

 %% Cell type:code id: tags:

 ``` python
 w_val = -1
 b_val = 65
 plt.plot(data[:,0], data[:,1], 'o')
 plt.axis([40, 85, -0.1, 1.2])
 x_pred = np.linspace(40,85)
 x_pred = np.resize(x_pred,[len(x_pred),1])
 y_pred = 1 / (1 + np.exp(-x_pred*w_val - b_val))
 plt.plot(x_pred, y_pred)

 # predicted probabilities
 p_1 = 1 / (1 + np.exp(-x*w_val - b_val))

 # cross-entropy loss function
 cross_entropy = -np.mean(y * np.log(p_1) + (1-y) * np.log(1-p_1))

 print(cross_entropy)
 print(np.round(p_1,3))
 ```

 %% Output

    0.9094435
    [0.269 0.007 0.018 0.047 0.119 0.001 0.    0.007 1.    0.881 0.007 0.
     0.119 1.    0.119 0.    0.007 0.    0.    0.    0.    0.    0.999]



 %% Cell type:markdown id: tags:

 We can see that the value of the cross-entropy has decreased from 3.882916 to 0.9094435.

 %% Cell type:markdown id: tags:

 ## TODO : determine the accuracy of this logistic regression model and the value of the cross-entropy function

 %% Cell type:markdown id: tags:

 ## TODO : set up a Keras model

 If there are two labels, we use `binary_crossentropy` as loss function. In this case, we use `sigmoid` as output layer.

 %% Cell type:code id: tags:

 ``` python
 l0 = tf.keras.layers.Dense(<--- your code here ---->)
 model = tf.keras.Sequential([l0])
 model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(0.01))
 model.fit(x, y, epochs=10000, verbose=False)
 ```

 %% Output

    Using TensorFlow backend.

    <tensorflow.python.keras.callbacks.History at 0x14081c048>

 %% Cell type:code id: tags:

 ``` python
 plt.plot(data[:,0], data[:,1], 'o')
 plt.axis([40, 85, -0.1, 1.2])
 x_pred = np.linspace(40,85)
 x_pred = np.resize(x_pred,[len(x_pred),1])
 y_pred = <---- your code here ---->
 plt.plot(x_pred, y_pred)
 ```

 %% Cell type:markdown id: tags:

 # Exercise 3 : MNIST and Multinomial Logistic Regression

 %% Cell type:markdown id: tags:

 In this exercise we use multinomial logistic regression to predict the handwritten digits of the MNIST dataset.

 %% Cell type:markdown id: tags:

 ## TODO : read MNIST data and compute validation accuracy for a multinomial logistic regression model, see [Multinomial Logistic Regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression)

 If there are several labels, then we use `categorical_crossentropy` as loss function and the output layer should be a `softmax` layer.

 %% Cell type:code id: tags:

 ``` python
 from __future__ import absolute_import, division, print_function, unicode_literals
 from tensorflow.keras.datasets import mnist
 from tensorflow.keras.utils import to_categorical


 # Import TensorFlow and TensorFlow Datasets
 import tensorflow as tf

 # Helper libraries
 import math
 import numpy as np
 import matplotlib.pyplot as plt

 # Load MNIST data
 (X_train, y_train), (X_test, y_test) = mnist.load_data()

 # One-hot-encoded label vector
 y_train_cat = to_categorical(y_train, 10)
 y_test_cat = to_categorical(y_test, 10)

 model = tf.keras.Sequential()
 model.add(tf.keras.layers.Flatten(<---- your code here ---->))
 model.add(tf.keras.layers.Dense(<---- your code here ---->))
 model.compile(<---- your code here ---->)

 history = model.fit(<---- your code here ---->)
 ```

 %% Cell type:markdown id: tags:

 ## TODO : use different regularization terms, see [Keras Regularizer](https://keras.io/regularizers/)

 %% Cell type:code id: tags:

 ``` python
 # Import TensorFlow and TensorFlow Datasets
 import tensorflow as tf

 # Helper libraries
 import math
 import numpy as np
 import matplotlib.pyplot as plt

 # Load MNIST data
 (X_train, y_train), (X_test, y_test) = mnist.load_data()

 # One-hot-encode label vector
 y_train_cat = to_categorical(y_train, 10)
 y_test_cat = to_categorical(y_test, 10)

 # Define Network
 model = tf.keras.Sequential()
 model.add(tf.keras.layers.Flatten(<---- your code here ---->)
 model.add(tf.keras.layers.Dense(<---- your code here ---->
                                kernel_regularizer=<---- your code here ---->))

 # Compile Network
 model.compile(<---- your code here ---->)

 # Fit Network
 history = model.fit(<---- your code here ---->)
 ```

 %% Cell type:markdown id: tags:

 # Exercise 4 : Prediction of House Prices

 %% Cell type:markdown id: tags:

 In this exercise, we’ll attempt to predict the median price of homes in a given Boston
 suburb in the mid-1970s, given data points about the suburb at the time, such as the
 crime rate, the local property tax rate, and so on. The dataset has relatively few data points: only
 506, split between 404 training samples and 102 test samples. And each feature in the
 input data (for example, the crime rate) has a different scale. For instance, some values
 are proportions, which take values between 0 and 1, others take values between 1
 and 12, others between 0 and 100, and so on.

 %% Cell type:markdown id: tags:

 ### Loading the Boston housing dataset

 %% Cell type:code id: tags:

 ``` python
 from tensorflow.keras.datasets import boston_housing
 (train_data, train_targets), (test_data, test_targets) = (boston_housing.load_data())
 ```

 %% Output

    Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz
    57344/57026 [==============================] - 0s 0us/step
    65536/57026 [==================================] - 0s 0us/step

 %% Cell type:markdown id: tags:

 Let’s look at the data:

 %% Cell type:code id: tags:

 ``` python
 print(train_data.shape)
 print(test_data.shape)
 ```

 %% Output

    (404, 13)
    (102, 13)

 %% Cell type:markdown id: tags:

 As you can see, we have 404 training samples and 102 test samples, each with 13
 numerical features, such as per capita crime rate, average number of rooms per dwelling,
 accessibility to highways, and so on.
 The targets are the median values of owner-occupied homes, in thousands of dollars:

 %% Cell type:code id: tags:

 ``` python
 train_targets
 ```

 %% Output

    array([15.2, 42.3, 50. , 21.1, 17.7, 18.5, 11.3, 15.6, 15.6, 14.4, 12.1,
           17.9, 23.1, 19.9, 15.7,  8.8, 50. , 22.5, 24.1, 27.5, 10.9, 30.8,
           32.9, 24. , 18.5, 13.3, 22.9, 34.7, 16.6, 17.5, 22.3, 16.1, 14.9,
           23.1, 34.9, 25. , 13.9, 13.1, 20.4, 20. , 15.2, 24.7, 22.2, 16.7,
           12.7, 15.6, 18.4, 21. , 30.1, 15.1, 18.7,  9.6, 31.5, 24.8, 19.1,
           22. , 14.5, 11. , 32. , 29.4, 20.3, 24.4, 14.6, 19.5, 14.1, 14.3,
           15.6, 10.5,  6.3, 19.3, 19.3, 13.4, 36.4, 17.8, 13.5, 16.5,  8.3,
           14.3, 16. , 13.4, 28.6, 43.5, 20.2, 22. , 23. , 20.7, 12.5, 48.5,
           14.6, 13.4, 23.7, 50. , 21.7, 39.8, 38.7, 22.2, 34.9, 22.5, 31.1,
           28.7, 46. , 41.7, 21. , 26.6, 15. , 24.4, 13.3, 21.2, 11.7, 21.7,
           19.4, 50. , 22.8, 19.7, 24.7, 36.2, 14.2, 18.9, 18.3, 20.6, 24.6,
           18.2,  8.7, 44. , 10.4, 13.2, 21.2, 37. , 30.7, 22.9, 20. , 19.3,
           31.7, 32. , 23.1, 18.8, 10.9, 50. , 19.6,  5. , 14.4, 19.8, 13.8,
           19.6, 23.9, 24.5, 25. , 19.9, 17.2, 24.6, 13.5, 26.6, 21.4, 11.9,
           22.6, 19.6,  8.5, 23.7, 23.1, 22.4, 20.5, 23.6, 18.4, 35.2, 23.1,
           27.9, 20.6, 23.7, 28. , 13.6, 27.1, 23.6, 20.6, 18.2, 21.7, 17.1,
            8.4, 25.3, 13.8, 22.2, 18.4, 20.7, 31.6, 30.5, 20.3,  8.8, 19.2,
           19.4, 23.1, 23. , 14.8, 48.8, 22.6, 33.4, 21.1, 13.6, 32.2, 13.1,
           23.4, 18.9, 23.9, 11.8, 23.3, 22.8, 19.6, 16.7, 13.4, 22.2, 20.4,
           21.8, 26.4, 14.9, 24.1, 23.8, 12.3, 29.1, 21. , 19.5, 23.3, 23.8,
           17.8, 11.5, 21.7, 19.9, 25. , 33.4, 28.5, 21.4, 24.3, 27.5, 33.1,
           16.2, 23.3, 48.3, 22.9, 22.8, 13.1, 12.7, 22.6, 15. , 15.3, 10.5,
           24. , 18.5, 21.7, 19.5, 33.2, 23.2,  5. , 19.1, 12.7, 22.3, 10.2,
           13.9, 16.3, 17. , 20.1, 29.9, 17.2, 37.3, 45.4, 17.8, 23.2, 29. ,
           22. , 18. , 17.4, 34.6, 20.1, 25. , 15.6, 24.8, 28.2, 21.2, 21.4,
           23.8, 31. , 26.2, 17.4, 37.9, 17.5, 20. ,  8.3, 23.9,  8.4, 13.8,
            7.2, 11.7, 17.1, 21.6, 50. , 16.1, 20.4, 20.6, 21.4, 20.6, 36.5,
            8.5, 24.8, 10.8, 21.9, 17.3, 18.9, 36.2, 14.9, 18.2, 33.3, 21.8,
           19.7, 31.6, 24.8, 19.4, 22.8,  7.5, 44.8, 16.8, 18.7, 50. , 50. ,
           19.5, 20.1, 50. , 17.2, 20.8, 19.3, 41.3, 20.4, 20.5, 13.8, 16.5,
           23.9, 20.6, 31.5, 23.3, 16.8, 14. , 33.8, 36.1, 12.8, 18.3, 18.7,
           19.1, 29. , 30.1, 50. , 50. , 22. , 11.9, 37.6, 50. , 22.7, 20.8,
           23.5, 27.9, 50. , 19.3, 23.9, 22.6, 15.2, 21.7, 19.2, 43.8, 20.3,
           33.2, 19.9, 22.5, 32.7, 22. , 17.1, 19. , 15. , 16.1, 25.1, 23.7,
           28.7, 37.2, 22.6, 16.4, 25. , 29.8, 22.1, 17.4, 18.1, 30.3, 17.5,
           24.7, 12.6, 26.5, 28.7, 13.3, 10.4, 24.4, 23. , 20. , 17.8,  7. ,
           11.8, 24.4, 13.8, 19.4, 25.2, 19.4, 19.4, 29.1])

 %% Cell type:markdown id: tags:

 The prices are typically between 10000 and 50000 USD. If that sounds cheap, remember
 that this was the mid-1970s, and these prices aren’t adjusted for inflation.

 %% Cell type:markdown id: tags:

 ### Preparing the data

 It would be problematic to feed into a neural network values that all take wildly different ranges. The model might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice for dealing with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation. This is easily done in NumPy.

 %% Cell type:markdown id: tags:

 ### Normalizing the data

 %% Cell type:code id: tags:

 ``` python
 mean = train_data.mean(axis=0)
 train_data -= mean
 std = train_data.std(axis=0)
 train_data /= std
 test_data -= mean
 test_data /= std
 ```

 %% Cell type:markdown id: tags:

 Note that the quantities used for normalizing the test data are computed using the
 training data. You should never use any quantity computed on the test data in your
 workflow, even for something as simple as data normalization.

 %% Cell type:markdown id: tags:

 ### TODO: Building your model

 Because so few samples are available, we’ll use a very small model with two intermediate layers, each with 64 units, each followed by a `relu` activation function. In general, the less training data you have, the worse overfitting will be, and using a small model is one way to mitigate overfitting.

 %% Cell type:markdown id: tags:

 #### Model definition

 %% Cell type:code id: tags:

 ``` python
 def build_model():
    model = keras.Sequential([
    <---- your code here ---->,
    <---- your code here ---->,
    <---- your code here ---->
    ])
    model.compile(optimizer="rmsprop", loss=<---- your code here ---->, metrics=["mae"])
    return model
 ```

 %% Cell type:markdown id: tags:

 The model ends with a single unit and no activation (it will be a linear layer). This is a
 typical setup for scalar regression (a regression where you’re trying to predict a single
 continuous value). Applying an activation function would constrain the range the output
 can take; for instance, if you applied a sigmoid activation function to the last layer,
 the model could only learn to predict values between 0 and 1. Here, because the last
 layer is purely linear, the model is free to learn to predict values in any range.
 Note that we compile the model with the `mse` loss function — _mean squared error_, the
 square of the difference between the predictions and the targets. This is a widely used
 loss function for regression problems.
 We’re also monitoring a new metric during training: _mean absolute error_ (`MAE`). It’s the
 absolute value of the difference between the predictions and the targets. For instance, an
 MAE of 0.5 on this problem would mean your predictions are off by 500 on average.

 %% Cell type:markdown id: tags:

 ### Validating your approach using K-fold validation

 To evaluate our model while we keep adjusting its parameters (such as the number of
 epochs used for training), we could split the data into a training set and a validation set, as we did in the previous examples. But because we have so few data points, the validation set would end up being very small (for instance, about 100 examples). As a consequence, the validation scores might change a lot depending on which data points we chose for validation and which we chose for training: the validation scores might have a high variance with regard to the validation split. This would prevent us from reliably evaluating our model.

 The best practice in such situations is to use $K$-fold cross-validation. It consists of splitting the available data into K partitions (typically $K = 4$ or $5$), instantiating $K$ identical models, and training each one on $K - 1$ partitions while evaluating on the remaining partition. The validation score for the model used is then the average of the $K$ validation scores obtained. In terms of code, this is straightforward.

 %% Cell type:markdown id: tags:

 #### K-fold validation

 %% Cell type:code id: tags:

 ``` python
 import numpy as np
 from tensorflow import keras
 from tensorflow.keras import layers
 k = 4
 num_val_samples = len(train_data) // k
 num_epochs = 100
 all_scores = []
 for i in range(k):
    print(f"Processing fold #{i}")
    # Prepares the validation data: data from partition k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    # Prepares the training data: data from all other partitions
    partial_train_data = np.concatenate([train_data[:i * num_val_samples], train_data[(i + 1) * num_val_samples:]],axis=0)
    partial_train_targets = np.concatenate([train_targets[:i * num_val_samples], train_targets[(i + 1) * num_val_samples:]],axis=0)
    # Builds the Keras model (already compiled)
    model = build_model()
    # Trains the model (in silent mode, verbose=0)
    history=model.fit(partial_train_data, partial_train_targets, epochs=num_epochs, batch_size=16, verbose=0)
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    all_scores.append(val_mae)
 ```

 %% Output

    Processing fold #0
    Processing fold #1
    Processing fold #2
    Processing fold #3

 %% Cell type:markdown id: tags:

 Running this with `num_epochs = 100` yields the following results:

 %% Cell type:code id: tags:

 ``` python
 all_scores
 ```

 %% Output

    [1.9184445142745972,
     2.4037296772003174,
     2.4944815635681152,
     2.4431681632995605]

 %% Cell type:code id: tags:

 ``` python
 np.mean(all_scores)
 ```

 %% Output

    2.3149559795856476

 %% Cell type:markdown id: tags:

 The different runs do indeed show rather different validation scores, from 1.9 to 2.49.
 The average (2.3) is a much more reliable metric than any single score—that’s the
 entire point of K-fold cross-validation. In this case, we’re off by 2310 USD on average, which is significant considering that the prices range from 10000 to 50000.
 Let’s try training the model a bit longer: 500 epochs. To keep a record of how well
 the model does at each epoch, we’ll modify the training loop to save the per-epoch
 validation score log for each fold.

 %% Cell type:markdown id: tags:

 #### Saving the validation logs at each fold

 %% Cell type:code id: tags:

 ``` python
 num_epochs = 500
 all_mae_histories = []
 for i in range(k):
    print(f"Processing fold #{i}")
    # Prepares the validation data: data from partition #k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    # Prepares the training data: data from all other partitions
    partial_train_data = np.concatenate([train_data[:i * num_val_samples], train_data[(i + 1) * num_val_samples:]],axis=0)
    partial_train_targets = np.concatenate([train_targets[:i * num_val_samples], train_targets[(i + 1) * num_val_samples:]],axis=0)
    # Builds the Keras model (already compiled)
    model = build_model()
    # Trains the model (in silent mode, verbose=0)
    history = model.fit(partial_train_data, partial_train_targets,
    validation_data=(val_data, val_targets), epochs=num_epochs, batch_size=16, verbose=0)
    mae_history = history.history["val_mae"]
    all_mae_histories.append(mae_history)
 ```

 %% Output

    Processing fold #0
    Processing fold #1
    Processing fold #2
    Processing fold #3

 %% Cell type:markdown id: tags:

 We can then compute the average of the per-epoch MAE scores for all folds.

 %% Cell type:markdown id: tags:

 #### Building the history of successive mean K-fold validation scores

 %% Cell type:code id: tags:

 ``` python
 average_mae_history = [
 np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]
 ```

 %% Cell type:markdown id: tags:

 #### Plotting validation scores

 %% Cell type:code id: tags:

 ``` python
 import matplotlib.pyplot as plt
 plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
 plt.xlabel("Epochs")
 plt.ylabel("Validation MAE")
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 It may be a little difficult to read the plot, due to a scaling issue: the validation MAE
 for the first few epochs is dramatically higher than the values that follow. Let’s omit
 the first 10 data points, which are on a different scale than the rest of the curve.

 %% Cell type:markdown id: tags:

 #### Plotting validation scores, excluding the first 10 data points

 %% Cell type:code id: tags:

 ``` python
 truncated_mae_history = average_mae_history[10:]
 plt.plot(range(1, len(truncated_mae_history) + 1), truncated_mae_history)
 plt.xlabel("Epochs")
 plt.ylabel("Validation MAE")
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 As you can see in Figure above, validation MAE stops improving significantly after
 120–140 epochs (this number includes the 10 epochs we omitted). Past that point,
 we start overfitting.
 Once you’re finished tuning other parameters of the model (in addition to the
 number of epochs, you could also adjust the size of the intermediate layers), you can
 train a final production model on all of the training data, with the best parameters,
 and then look at its performance on the test data.

 %% Cell type:markdown id: tags:

 #### Training the final model

 %% Cell type:code id: tags:

 ``` python
 # Gets a fresh, compiled model
 model = build_model()
 # Trains it on the entirety of the data
 model.fit(train_data, train_targets,
 epochs=130, batch_size=16, verbose=0)
 test_mse_score, test_mae_score = model.evaluate(<---- your code here ---->)
 ```

 %% Cell type:markdown id: tags:

 Here’s the final result:

 %% Cell type:code id: tags:

 ``` python
 test_mae_score
 ```

 %% Cell type:markdown id: tags:

 ### Generating predictions on new data
 When calling `predict()` on our binary classification model, we retrieved a scalar score between 0 and 1 for each input sample. With our multiclass classification model, we retrieved a probability distribution over all classes for each sample. Now, with this scalar regression model, `predict()` returns the model’s guess for the sample’s price in thousands of dollars:

 %% Cell type:code id: tags:

 ``` python
 predictions = <---- your code here ---->
 predictions[0]
 ```

--- a/notebooks/Block_2/Solutions to Exercises - Block 2.ipynb
+++ b/notebooks/Block_2/Solutions to Exercises - Block 2.ipynb
@@ -547,10 +547,24 @@
   "source": [
    "This notebook calculates a logistic regression using Keras. It's basically meant to show the principles of Keras.\n",
    "\n",
+    "### Background\n",
+    "\n",
+    "The Space Shuttle Challenger exploded 73 second after liftoff on January 28th, 1986. The disaster claimed the lives of all seven astronauts on board, including school teacher Christa McAuliffe.\n",
+    "\n",
+    "The details surrounding this disaster were very involved. For the purposes of this analysis, it is sufficient to point out that engineers that manufactured the large boosters that launched the rocket were aware of the possible failures that could happen during cold temperatures. They tried to prevent the launch, but were ultimately ignored and disaster ensued.\n",
+    "\n",
+    "The main concern of engineers in launching the Challenger was the evidence that the large O-rings sealing the several sections of the boosters could fail in cold temperatures.\n",
+    "\n",
+    "\n",
    "###  Datset\n",
    "\n",
-    "We investigate the data set of the challenger flight with broken O-rings (`Y=1`\n",
-    ") vs start temperature."
+    "The lowest temperature of any of the 23 prior launches (before the Challenger explosion) was 53° F. This is evident in the data set shown below. Engineers prior to the Challenger launch suggested that the launch not be attempted below 53°. The “evidence” that the o-rings could fail below 53° was based on a simple conclusion that since the launch at 53° experienced two o-ring failures, it seemed unwise to launch below that temperature. In the following analysis we demonstrate more fully how dangerous it was to launch on this specific day where the outside temperature at the time of the launch was 31°.\n",
+    "\n",
+    "The `Broken O-rings` column in the data set below records whether O-rings experienced failures during that particular launch - if they did, the value for `Broken O-rings` is 1. The `Temperature [F]` column lists the outside temperature at the time of launch.\n",
+    "\n",
+    "### Goal of Analysis\n",
+    "\n",
+    "We investigate the data set of the challenger flight with broken O-rings (`Y=1`) vs start temperature."
   ]
  },
  {

 %% Cell type:markdown id: tags:

 # Exercise 1 : Conversion from Celsius to Fahrenheit (Simple Regression Analysis)

 %% Cell type:markdown id: tags:

 The problem we will solve is to convert from Celsius to Fahrenheit, where the approximate formula is:

 $$ f = c \times 1.8 + 32 $$


 Of course, it would be simple enough to create a conventional Python function that directly performs this calculation, but that wouldn't be machine learning.


 Instead, we will give TensorFlow some sample Celsius values (0, 8, 15, 22, 38) and their corresponding Fahrenheit values (32, 46, 59, 72, 100).
 Then, we will train a model that figures out the above formula through the training process. This is a _simple regression analysis_ problem.

 %% Cell type:markdown id: tags:

 ## Import dependencies

 First, import TensorFlow. Here, we're calling it `tf` for ease of use. We also tell it to only display errors.

 Next, import [NumPy](http://www.numpy.org/) as `np`. Numpy helps us to represent our data as highly performant lists.

 %% Cell type:code id: tags:

 ``` python
 from __future__ import absolute_import, division, print_function, unicode_literals
 import tensorflow as tf
 print(tf.__version__)
 import numpy as np
 ```

 %% Output

    2.7.1

 %% Cell type:code id: tags:

 ``` python
 import logging
 logger = tf.get_logger()
 logger.setLevel(logging.ERROR)
 ```

 %% Cell type:markdown id: tags:

 ## Set up training data

 As we saw before, supervised Machine Learning is all about figuring out an algorithm given a set of inputs and outputs. Since the task in this exercise is to create a model that can give the temperature in Fahrenheit when given the degrees in Celsius, we create two lists `celsius_q` and `fahrenheit_a` that we can use to train our model.

 %% Cell type:code id: tags:

 ``` python
 celsius_q    = np.array([-40, -10,  0,  8, 15, 22,  38],  dtype=float)
 fahrenheit_a = np.array([-40,  14, 32, 46, 59, 72, 100],  dtype=float)

 for i,c in enumerate(celsius_q):
  print("{} degrees Celsius = {} degrees Fahrenheit".format(c, fahrenheit_a[i]))
 ```

 %% Output

    -40.0 degrees Celsius = -40.0 degrees Fahrenheit
    -10.0 degrees Celsius = 14.0 degrees Fahrenheit
    0.0 degrees Celsius = 32.0 degrees Fahrenheit
    8.0 degrees Celsius = 46.0 degrees Fahrenheit
    15.0 degrees Celsius = 59.0 degrees Fahrenheit
    22.0 degrees Celsius = 72.0 degrees Fahrenheit
    38.0 degrees Celsius = 100.0 degrees Fahrenheit

 %% Cell type:markdown id: tags:

 ### Some Machine Learning terminology

 - **Feature** — The input(s) to our model. In this case, a single value — the degrees in Celsius.

 - **Labels/response variable** — The output our model predicts. In this case, a single value — the degrees in Fahrenheit. In a classification setting, we would predict labels (discrete classes), in a regression setting, we predict a continuous response variable, such as Fahrenheit.

 - **Example** — A pair of inputs/outputs used during training. In our case a pair of values from `celsius_q` and `fahrenheit_a` at a specific index, such as `(22,72)`.


 %% Cell type:markdown id: tags:

 ## 1. Define the Network

 Next create the model. We will use the simplest possible model we can, a Dense network. Since the problem is straightforward, this network will require only a single layer, with a single neuron.

 ### Build a layer

 We'll call the layer `l0` and create it by instantiating `tf.keras.layers.Dense` with the following configuration:

 *   `input_shape=[1]` — This specifies that the input to this layer is a single value. That is, the shape is a one-dimensional array with one member. Since this is the first (and only) layer, that input shape is the input shape of the entire model. The single value is a floating point number, representing degrees Celsius.

 *   `units=1` — This specifies the number of units in the layer. The number of units defines how many internal variables the layer has to try to learn how to solve the problem (more later). Since this is the final layer, it is also the size of the model's output — a single float value representing degrees Fahrenheit. (In a multi-layered network, the size and shape of the layer would need to match the `input_shape` of the next layer.)

 %% Cell type:code id: tags:

 ``` python
 l0 = tf.keras.layers.Dense(units=1, input_shape=[1])
 ```

 %% Cell type:markdown id: tags:

 ### Assemble layers into the model

 Once layers are defined, they need to be assembled into a model. The Sequential model definition takes a list of layers as argument, specifying the calculation order from the input to the output.

 This model has just a single layer, `l0`.

 %% Cell type:code id: tags:

 ``` python
 model = tf.keras.Sequential([
  tf.keras.layers.Dense(units=1, input_shape=[1])
 ])
 ```

 %% Cell type:code id: tags:

 ``` python
 model.summary()
 ```

 %% Output

    Model: "sequential_11"
    _________________________________________________________________
     Layer (type)                Output Shape              Param #
    =================================================================
     dense_15 (Dense)            (None, 1)                 2
    
    =================================================================
    Total params: 2
    Trainable params: 2
    Non-trainable params: 0
    _________________________________________________________________

 %% Cell type:markdown id: tags:

 ## 2. Compile the network, with loss and optimizer functions

 Before training, the model has to be compiled. When compiled for training, the model is given:

 - **Loss function** — A way of measuring how far off predictions are from the desired outcome. (The measured difference is called the "loss".)

 - **Optimizer function** — A way of adjusting internal values in order to reduce the loss.

 %% Cell type:code id: tags:

 ``` python
 model.compile(loss='mean_squared_error',
              optimizer=tf.keras.optimizers.Adam(0.1))
 ```

 %% Cell type:markdown id: tags:

 These are used during training (`model.fit()`, below) to first calculate the loss at each point, and then improve it. In fact, the act of calculating the current loss of a model and then improving it is precisely what training is.

 During training, the optimizer function is used to calculate adjustments to the model's internal variables. The goal is to adjust the internal variables until the model (which is really a math function) mirrors the actual equation for converting Celsius to Fahrenheit.

 `TensorFlow` uses numerical analysis to perform this tuning, and all this complexity is hidden from you so we will not go into the details here. What is useful to know about these parameters are:

 The loss function ([mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)) and the optimizer ([Adam](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)) used here are standard for simple models like this one, but many others are available. It is not important to know how these specific functions work at this point.

 One part of the Optimizer you may need to think about when building your own models is the learning rate (`0.1` in the code above). This is the step size taken when adjusting values in the model. If the value is too small, it will take too many iterations to train the model. Too large, and accuracy goes down. Finding a good value often involves some trial and error, but the range is usually within 0.001 (default), and 0.1

 %% Cell type:markdown id: tags:

 ## 3. Fit the model

 Train the model by calling the `fit` method.

 During training, the model takes in Celsius values, performs a calculation using the current internal variables (called "weights") and outputs values which are meant to be the Fahrenheit equivalent. Since the weights are initially set randomly, the output will not be close to the correct value. The difference between the actual output and the desired output is calculated using the loss function, and the optimizer function directs how the weights should be adjusted.

 This cycle of calculate, compare, adjust is controlled by the `fit` method. The first argument is the inputs, the second argument is the desired outputs. The `epochs` argument specifies how many times this cycle should be run, and the `verbose` argument controls how much output the method produces.

 %% Cell type:code id: tags:

 ``` python
 history = model.fit(celsius_q, fahrenheit_a, epochs=500, verbose=False)
 print("Finished training the model")
 ```

 %% Output

    Finished training the model

 %% Cell type:markdown id: tags:

 ## 4. Evaluate the Model - Display training statistics

 The `fit` method returns a history object. We can use this object to plot how the loss of our model goes down after each training epoch. A high loss means that the Fahrenheit degrees the model predicts is far from the corresponding value in `fahrenheit_a`.

 We'll use [Matplotlib](https://matplotlib.org/) to visualize this (you could use another tool). As you can see, our model improves very quickly at first, and then has a steady, slow improvement until it is very near "perfect" towards the end.


 %% Cell type:code id: tags:

 ``` python
 import matplotlib.pyplot as plt
 plt.xlabel('Epoch Number')
 plt.ylabel("Loss Magnitude")
 plt.plot(history.history['loss'])
 ```

 %% Output

    [<matplotlib.lines.Line2D at 0x7ff9ec12a190>]



 %% Cell type:markdown id: tags:

 ## 5. Use the model to predict values

 Now you have a model that has been trained to learn the relationship between `celsius_q` and `fahrenheit_a`. You can use the predict method to have it calculate the Fahrenheit degrees for a previously unknown Celsius degrees.

 So, for example, if the Celsius value is 100, what do you think the Fahrenheit result will be? Take a guess before you run this code.

 %% Cell type:code id: tags:

 ``` python
 print(model.predict([100.0]))
 ```

 %% Output

    [[211.31021]]

 %% Cell type:markdown id: tags:

 The correct answer is $100 \times 1.8 + 32 = 212$, so our model is doing really well.

 ### To review


 *   We created a model with a Dense layer
 *   We trained it with 3500 examples (7 pairs, over 500 epochs).

 Our model tuned the variables (weights) in the Dense layer until it was able to return the correct Fahrenheit value for any Celsius value. (Remember, 100 Celsius was not part of our training data.)



 %% Cell type:markdown id: tags:

 ## Looking at the layer weights

 Finally, let's print the internal variables of the Dense layer.

 %% Cell type:code id: tags:

 ``` python
 print("These are the layer variables: {}".format(model.get_weights()))
 ```

 %% Output

    These are the layer variables: [array([[1.8242955]], dtype=float32), array([28.880667], dtype=float32)]

 %% Cell type:markdown id: tags:

 The first variable is close to ~1.8 and the second to ~32. These values (1.8 and 32) are the actual variables in the real conversion formula.

 This is really close to the values in the conversion formula. We can show how a Dense layer works, but for a single neuron with a single input and a single output, the internal math looks the same as [the equation for a line](https://en.wikipedia.org/wiki/Linear_equation#Slope%E2%80%93intercept_form), $y = mx + b$, which has the same form as the conversion equation, $f = 1.8c + 32$.

 Since the form is the same, the variables should converge on the standard values of 1.8 and 32, which is exactly what happened.

 With additional neurons, additional inputs, and additional outputs, the formula becomes much more complex, but the idea is the same.

 ### A little experiment

 Just for fun, what if we created more Dense layers with different units, which therefore also has more variables?

 %% Cell type:code id: tags:

 ``` python
 l0 = tf.keras.layers.Dense(units=4, input_shape=[1])
 l1 = tf.keras.layers.Dense(units=4)
 l2 = tf.keras.layers.Dense(units=1)
 model = tf.keras.Sequential([l0, l1, l2])
 model.compile(loss='mean_squared_error', optimizer=tf.keras.optimizers.Adam(0.1))
 model.fit(celsius_q, fahrenheit_a, epochs=500, verbose=False)
 print("Finished training the model")
 print(model.predict([100.0]))
 print("Model predicts that 100 degrees Celsius is: {} degrees Fahrenheit".format(model.predict([100.0])))
 print("These are the l0 variables: {}".format(l0.get_weights()))
 print("These are the l1 variables: {}".format(l1.get_weights()))
 print("These are the l2 variables: {}".format(l2.get_weights()))
 ```

 %% Output

    Finished training the model
    [[211.74742]]
    Model predicts that 100 degrees Celsius is: [[211.74742]] degrees Fahrenheit
    These are the l0 variables: [array([[0.5968949 , 0.02139384, 0.01172368, 0.35185102]], dtype=float32), array([ 3.484818 , -3.0073252, -2.7715163,  3.010582 ], dtype=float32)]
    These are the l1 variables: [array([[ 0.20319672, -0.41102245,  0.7771168 , -0.7402758 ],
           [-0.7281887 ,  0.9473572 , -0.15004049,  0.12950596],
           [-0.96478873,  0.05949384, -0.2531712 ,  0.72464406],
           [-0.10805991, -0.88057184,  1.0134443 , -0.15393376]],
          dtype=float32), array([ 1.8714209, -3.126011 ,  3.3828623, -2.6109185], dtype=float32)]
    These are the l2 variables: [array([[ 0.23399544],
           [-0.9353689 ],
           [ 1.1917399 ],
           [-0.6491724 ]], dtype=float32), array([3.2284596], dtype=float32)]

 %% Cell type:markdown id: tags:

 As you can see, this model is also able to predict the corresponding Fahrenheit value really well. But when you look at the variables (weights) in the `l0` and `l1` layers, they are nothing even close to ~1.8 and ~32. The added complexity hides the "simple" form of the conversion equation.


 %% Cell type:markdown id: tags:

 # Exercise 2 : O-Rings seen with Logistic Regression

 %% Cell type:markdown id: tags:

 This notebook calculates a logistic regression using Keras. It's basically meant to show the principles of Keras.

+### Background
+
+The Space Shuttle Challenger exploded 73 second after liftoff on January 28th, 1986. The disaster claimed the lives of all seven astronauts on board, including school teacher Christa McAuliffe.
+
+The details surrounding this disaster were very involved. For the purposes of this analysis, it is sufficient to point out that engineers that manufactured the large boosters that launched the rocket were aware of the possible failures that could happen during cold temperatures. They tried to prevent the launch, but were ultimately ignored and disaster ensued.
+
+The main concern of engineers in launching the Challenger was the evidence that the large O-rings sealing the several sections of the boosters could fail in cold temperatures.
+
+
 ###  Datset

-We investigate the data set of the challenger flight with broken O-rings (`Y=1`
-) vs start temperature.
+The lowest temperature of any of the 23 prior launches (before the Challenger explosion) was 53° F. This is evident in the data set shown below. Engineers prior to the Challenger launch suggested that the launch not be attempted below 53°. The “evidence” that the o-rings could fail below 53° was based on a simple conclusion that since the launch at 53° experienced two o-ring failures, it seemed unwise to launch below that temperature. In the following analysis we demonstrate more fully how dangerous it was to launch on this specific day where the outside temperature at the time of the launch was 31°.
+
+The `Broken O-rings` column in the data set below records whether O-rings experienced failures during that particular launch - if they did, the value for `Broken O-rings` is 1. The `Temperature [F]` column lists the outside temperature at the time of launch.
+
+### Goal of Analysis
+
+We investigate the data set of the challenger flight with broken O-rings (`Y=1`) vs start temperature.

 %% Cell type:code id: tags:

 ``` python
 %matplotlib inline
 import numpy as np
 import tensorflow as tf
 import matplotlib.pyplot as plt
 import matplotlib.image as imgplot
 import numpy as np
 import pandas as pd
 import tempfile
 data = np.asarray(pd.read_csv('./challenger.txt', sep=','), dtype='float32')
 plt.plot(data[:,0], data[:,1], 'o')
 plt.axis([40, 85, -0.1, 1.2])
 plt.xlabel('Temperature [F]')
 plt.ylabel('Broken O-rings')
 ```

 %% Output

    Text(0, 0.5, 'Broken O-rings')



 %% Cell type:code id: tags:

 ``` python
 y_values = data[:,1]
 print(y_values)
 ```

 %% Output

    [0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1.]

 %% Cell type:markdown id: tags:

 ## Mathematical Notes

 We are considering the likelihood $P(y_i=1|x_i)$ for the class $y_i=1$ given the $i-$th data point $x_i$ ($x_i$ could be a vector). This is given by:

 $
 P(y_i=1 | x_i) = \frac{e^{(b +  x_i w)}}{1 + e^{(b + x_i w)}} = [1 + e^{-(b + x_i w)}]^{-1}
 $

 If we have more than one data point, which we usually do, we have to apply the equation above to each of the N data points. In this case we can use a vectorized version with $x=(x_1,x_2,\ldots,x_N)$ and $y=(y_1,y_2,\ldots,y_N$)

 %% Cell type:markdown id: tags:

 ### Numpy code
 This numpy code, shows the calculation for one value using `NumPy` (like a single forward pass)

 %% Cell type:code id: tags:

 ``` python
 # Data
 N = len(data)
 x = data[:,0]
 y = data[:,1]
 # Initial Value for the weights
 w = -0.20
 b = 20.0
 # predicted probabilities
 p_1 = 1 / (1 + np.exp(-x*w - b))
 # cross-entropy loss function
 cross_entropy = -np.mean(y * np.log(p_1) + (1-y) * np.log(1-p_1))
 print(cross_entropy)
 print(np.round(p_1,3))
 ```

 %% Output

    3.882916
    [0.999 0.998 0.998 0.998 0.999 0.996 0.996 0.998 1.    0.999 0.998 0.988
     0.999 1.    0.999 0.993 0.998 0.978 0.992 0.985 0.993 0.992 1.   ]

 %% Cell type:markdown id: tags:

 ## Better values from intuition

 Now lets try to find better values for $W$ and $b$. Lets assume $W$ is given with $-1$. We want the probability
 for a dammage $P(y_i=1 | x_i)$ to be $0.5$.
 Determine an appropriate value for $b$.
 Hint: at which $x$ value should $P(y_i=1 | x_i)$ be $0.5$, look at the data. At this $x$ value the term $1 + e^{-(b + W’ x_i)}$ must be $2$.

 **Solution**

 $P(y=1 | x) = 0.5$ at $x \approx 65$

 $-(b + (-1) x_i) = 0 \rightarrow b = 65$

 %% Cell type:code id: tags:

 ``` python
 w_val = -1
 b_val = 65
 plt.plot(data[:,0], data[:,1], 'o')
 plt.axis([40, 85, -0.1, 1.2])
 x_pred = np.linspace(40,85)
 x_pred = np.resize(x_pred,[len(x_pred),1])
 y_pred = 1 / (1 + np.exp(-x_pred*w_val - b_val))
 plt.plot(x_pred, y_pred)

 # predicted probabilities
 p_1 = 1 / (1 + np.exp(-x*w_val - b_val))

 # cross-entropy loss function
 cross_entropy = -np.mean(y * np.log(p_1) + (1-y) * np.log(1-p_1))

 print(cross_entropy)
 print(np.round(p_1,3))
 ```

 %% Output

    0.9094435
    [0.269 0.007 0.018 0.047 0.119 0.001 0.    0.007 1.    0.881 0.007 0.
     0.119 1.    0.119 0.    0.007 0.    0.    0.    0.    0.    0.999]



 %% Cell type:markdown id: tags:

 We can see that the value of the cross-entropy has decreased from 3.882916 to 0.9094435.

 %% Cell type:markdown id: tags:

 ## TODO : determine the accuracy of this logistic regression model

 %% Cell type:code id: tags:

 ``` python
 y_pred = np.round(p_1, decimals=0).astype('int')
 accuracy = np.mean(y==y_pred)
 print("Accuracy: ", accuracy)
 ```

 %% Output

    Accuracy:  0.8695652173913043

 %% Cell type:markdown id: tags:

 ## TODO : set up a Keras model

 If there are two labels, we use `binary_crossentropy` as loss function. In this case, we use `sigmoid` as output layer.

 %% Cell type:code id: tags:

 ``` python
 l0 = tf.keras.layers.Dense(units=1, activation = tf.nn.sigmoid, input_shape=[1])
 model = tf.keras.Sequential([l0])
 model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(0.01), metrics=['accuracy'])
 model.fit(x, y, epochs=10000, verbose=False)
 ```

 %% Output

    <keras.callbacks.History at 0x7ff9e419c450>

 %% Cell type:code id: tags:

 ``` python
 plt.plot(data[:,0], data[:,1], 'o')
 plt.axis([40, 85, -0.1, 1.2])
 x_pred = np.linspace(40,85)
 x_pred = np.resize(x_pred,[len(x_pred),1])
 y_pred = model.predict(x_pred)
 plt.plot(x_pred, y_pred)
 ```

 %% Output

    [<matplotlib.lines.Line2D at 0x7ff9ec0405d0>]



 %% Cell type:code id: tags:

 ``` python
 print(model.get_weights())
 ```

 %% Output

    [array([[-0.23217289]], dtype=float32), array([15.043545], dtype=float32)]

 %% Cell type:code id: tags:

 ``` python
 w_val = model.get_weights()[0]
 b_val = model.get_weights()[1]
 plt.plot(data[:,0], data[:,1], 'o')
 plt.axis([40, 85, -0.1, 1.2])
 x_pred = np.linspace(40,85)
 x_pred = np.resize(x_pred,[len(x_pred),1])
 y_pred = 1 / (1 + np.exp(-x_pred*w_val - b_val))
 plt.plot(x_pred, y_pred)

 # predicted probabilities
 p_1 = 1 / (1 + np.exp(-x*w_val - b_val))

 # cross-entropy loss function
 cross_entropy = -np.mean(y * np.log(p_1) + (1-y) * np.log(1-p_1))
 print("Cross-entropy: ", cross_entropy)
 y_pred = np.round(p_1, decimals=0).astype('int')
 accuracy = np.mean(y==y_pred)
 print("Accuracy: ", accuracy)
 ```

 %% Output

    Cross-entropy:  0.4416346
    Accuracy:  0.8695652173913043



 %% Cell type:code id: tags:

 ``` python
 model.evaluate(x, y)
 ```

 %% Output

    1/1 [==============================] - 0s 118ms/step - loss: 0.4416 - accuracy: 0.8696

    [0.4416346251964569, 0.8695651888847351]

 %% Cell type:markdown id: tags:

 The value of the cross-entropy loss function could be decreased from 0.9094435 to 0.4416346251964569

 %% Cell type:markdown id: tags:

 # Exercise 3 : MNIST and Multinomial Logistic Regression

 %% Cell type:markdown id: tags:

 In this exercise we use multinomial logistic regression to predict the number of the handwritten digits of the MNIST dataset.

 %% Cell type:markdown id: tags:

 ## TODO : read MNIST data and compute validation accuracy for a multinomial logistic regression model, see [Multinomial Logistic Regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression)

 If there are several labels, then we use `categorical_crossentropy` as loss function and the output layer should be a `softmax` layer.

 %% Cell type:code id: tags:

 ``` python
 from __future__ import absolute_import, division, print_function, unicode_literals
 from tensorflow.keras.datasets import mnist
 from tensorflow.keras.utils import to_categorical


 # Import TensorFlow and TensorFlow Datasets
 import tensorflow as tf

 # Helper libraries
 import math
 import numpy as np
 import matplotlib.pyplot as plt

 # Load MNIST data
 (X_train, y_train), (X_test, y_test) = mnist.load_data()

 # One-hot-encoded label vector
 y_train_cat = to_categorical(y_train, 10)
 y_test_cat = to_categorical(y_test, 10)

 model = tf.keras.Sequential()
 model.add(tf.keras.layers.Flatten(input_shape=(28, 28)))
 model.add(tf.keras.layers.Dense(10, activation=tf.nn.softmax, batch_input_shape=(None, 784)))
 model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

 history = model.fit(X_train,
                    y_train_cat,
                    epochs=10,
                    validation_data=(X_test, y_test_cat))
 ```

 %% Output

    Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
    11493376/11490434 [==============================] - 0s 0us/step
    11501568/11490434 [==============================] - 0s 0us/step
    Epoch 1/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 320.4593 - accuracy: 0.8417 - val_loss: 242.9400 - val_accuracy: 0.8801
    Epoch 2/10
    1875/1875 [==============================] - 8s 4ms/step - loss: 257.2356 - accuracy: 0.8691 - val_loss: 381.1338 - val_accuracy: 0.8126
    Epoch 3/10
    1875/1875 [==============================] - 8s 5ms/step - loss: 246.9380 - accuracy: 0.8741 - val_loss: 225.4031 - val_accuracy: 0.8802
    Epoch 4/10
    1875/1875 [==============================] - 8s 4ms/step - loss: 238.4442 - accuracy: 0.8784 - val_loss: 312.3720 - val_accuracy: 0.8513
    Epoch 5/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 238.5885 - accuracy: 0.8783 - val_loss: 195.5100 - val_accuracy: 0.9018
    Epoch 6/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 238.2655 - accuracy: 0.8773 - val_loss: 253.4705 - val_accuracy: 0.8795
    Epoch 7/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 237.9472 - accuracy: 0.8798 - val_loss: 250.7999 - val_accuracy: 0.8824
    Epoch 8/10
    1875/1875 [==============================] - 8s 5ms/step - loss: 237.2793 - accuracy: 0.8802 - val_loss: 285.5836 - val_accuracy: 0.8707
    Epoch 9/10
    1875/1875 [==============================] - 8s 4ms/step - loss: 235.1660 - accuracy: 0.8808 - val_loss: 277.7219 - val_accuracy: 0.8809
    Epoch 10/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 233.3158 - accuracy: 0.8822 - val_loss: 263.0470 - val_accuracy: 0.8765

 %% Cell type:code id: tags:

 ``` python
 test_loss, test_accuracy = model.evaluate(X_test, y_test_cat)
 print('Accuracy on test dataset:', test_accuracy)
 ```

 %% Output

    313/313 [==============================] - 1s 4ms/step - loss: 263.0470 - accuracy: 0.8765
    Accuracy on test dataset: 0.8765000104904175

 %% Cell type:markdown id: tags:

 ## TODO : use different regularization terms, see [Keras Regularizer](https://keras.io/regularizers/)

 %% Cell type:code id: tags:

 ``` python
 # Import TensorFlow and TensorFlow Datasets
 import tensorflow as tf

 # Helper libraries
 import math
 import numpy as np
 import matplotlib.pyplot as plt

 # Load MNIST data
 (X_train, y_train), (X_test, y_test) = mnist.load_data()

 # One-hot-encode label vector
 y_train_cat = to_categorical(y_train, 10)
 y_test_cat = to_categorical(y_test, 10)

 # Define Network
 model = tf.keras.Sequential()
 model.add(tf.keras.layers.Flatten(input_shape=(28, 28)))
 model.add(tf.keras.layers.Dense(10,
                                activation=tf.nn.softmax,
                                batch_input_shape=(None, 784),
                                kernel_regularizer=tf.keras.regularizers.l2(0.01)))

 # Compile Network
 model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

 # Fit Network
 history = model.fit(X_train,
                    y_train_cat,
                    epochs=10,
                    validation_data=(X_test, y_test_cat))
 ```

 %% Output

    Epoch 1/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 338.3148 - accuracy: 0.8376 - val_loss: 217.6844 - val_accuracy: 0.8881
    Epoch 2/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 287.3254 - accuracy: 0.8573 - val_loss: 281.7627 - val_accuracy: 0.8726
    Epoch 3/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 288.3959 - accuracy: 0.8602 - val_loss: 320.5165 - val_accuracy: 0.8475
    Epoch 4/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 296.8785 - accuracy: 0.8590 - val_loss: 222.1539 - val_accuracy: 0.8899
    Epoch 5/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 283.6608 - accuracy: 0.8618 - val_loss: 460.9875 - val_accuracy: 0.8016
    Epoch 6/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 277.9357 - accuracy: 0.8639 - val_loss: 245.5682 - val_accuracy: 0.8875
    Epoch 7/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 287.6865 - accuracy: 0.8610 - val_loss: 252.3817 - val_accuracy: 0.8678
    Epoch 8/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 284.3436 - accuracy: 0.8620 - val_loss: 228.4689 - val_accuracy: 0.8884
    Epoch 9/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 282.4845 - accuracy: 0.8623 - val_loss: 258.1628 - val_accuracy: 0.8879
    Epoch 10/10
    1875/1875 [==============================] - 9s 5ms/step - loss: 285.5985 - accuracy: 0.8620 - val_loss: 484.2548 - val_accuracy: 0.7784

 %% Cell type:code id: tags:

 ``` python
 # Evaluate Network
 test_loss, test_accuracy = model.evaluate(X_test, y_test_cat)
 print('Accuracy on test dataset:', test_accuracy)
 ```

 %% Output

    313/313 [==============================] - 1s 3ms/step - loss: 484.2548 - accuracy: 0.7784
    Accuracy on test dataset: 0.7784000039100647

 %% Cell type:markdown id: tags:

 # Exercise 4 : Prediction of House Prices

 %% Cell type:markdown id: tags:

 In this exercise, we’ll attempt to predict the median price of homes in a given Boston
 suburb in the mid-1970s, given data points about the suburb at the time, such as the
 crime rate, the local property tax rate, and so on. The dataset has relatively few data points: only
 506, split between 404 training samples and 102 test samples. And each feature in the
 input data (for example, the crime rate) has a different scale. For instance, some values
 are proportions, which take values between 0 and 1, others take values between 1
 and 12, others between 0 and 100, and so on.

 %% Cell type:markdown id: tags:

 ### Loading the Boston housing dataset

 %% Cell type:code id: tags:

 ``` python
 from tensorflow.keras.datasets import boston_housing
 (train_data, train_targets), (test_data, test_targets) = (boston_housing.load_data())
 ```

 %% Output

    Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz
    57344/57026 [==============================] - 0s 0us/step
    65536/57026 [==================================] - 0s 0us/step

 %% Cell type:markdown id: tags:

 Let’s look at the data:

 %% Cell type:code id: tags:

 ``` python
 print(train_data.shape)
 print(test_data.shape)
 ```

 %% Output

    (404, 13)
    (102, 13)

 %% Cell type:markdown id: tags:

 As you can see, we have 404 training samples and 102 test samples, each with 13
 numerical features, such as per capita crime rate, average number of rooms per dwelling,
 accessibility to highways, and so on.
 The targets are the median values of owner-occupied homes, in thousands of dollars:

 %% Cell type:code id: tags:

 ``` python
 train_targets
 ```

 %% Output

    array([15.2, 42.3, 50. , 21.1, 17.7, 18.5, 11.3, 15.6, 15.6, 14.4, 12.1,
           17.9, 23.1, 19.9, 15.7,  8.8, 50. , 22.5, 24.1, 27.5, 10.9, 30.8,
           32.9, 24. , 18.5, 13.3, 22.9, 34.7, 16.6, 17.5, 22.3, 16.1, 14.9,
           23.1, 34.9, 25. , 13.9, 13.1, 20.4, 20. , 15.2, 24.7, 22.2, 16.7,
           12.7, 15.6, 18.4, 21. , 30.1, 15.1, 18.7,  9.6, 31.5, 24.8, 19.1,
           22. , 14.5, 11. , 32. , 29.4, 20.3, 24.4, 14.6, 19.5, 14.1, 14.3,
           15.6, 10.5,  6.3, 19.3, 19.3, 13.4, 36.4, 17.8, 13.5, 16.5,  8.3,
           14.3, 16. , 13.4, 28.6, 43.5, 20.2, 22. , 23. , 20.7, 12.5, 48.5,
           14.6, 13.4, 23.7, 50. , 21.7, 39.8, 38.7, 22.2, 34.9, 22.5, 31.1,
           28.7, 46. , 41.7, 21. , 26.6, 15. , 24.4, 13.3, 21.2, 11.7, 21.7,
           19.4, 50. , 22.8, 19.7, 24.7, 36.2, 14.2, 18.9, 18.3, 20.6, 24.6,
           18.2,  8.7, 44. , 10.4, 13.2, 21.2, 37. , 30.7, 22.9, 20. , 19.3,
           31.7, 32. , 23.1, 18.8, 10.9, 50. , 19.6,  5. , 14.4, 19.8, 13.8,
           19.6, 23.9, 24.5, 25. , 19.9, 17.2, 24.6, 13.5, 26.6, 21.4, 11.9,
           22.6, 19.6,  8.5, 23.7, 23.1, 22.4, 20.5, 23.6, 18.4, 35.2, 23.1,
           27.9, 20.6, 23.7, 28. , 13.6, 27.1, 23.6, 20.6, 18.2, 21.7, 17.1,
            8.4, 25.3, 13.8, 22.2, 18.4, 20.7, 31.6, 30.5, 20.3,  8.8, 19.2,
           19.4, 23.1, 23. , 14.8, 48.8, 22.6, 33.4, 21.1, 13.6, 32.2, 13.1,
           23.4, 18.9, 23.9, 11.8, 23.3, 22.8, 19.6, 16.7, 13.4, 22.2, 20.4,
           21.8, 26.4, 14.9, 24.1, 23.8, 12.3, 29.1, 21. , 19.5, 23.3, 23.8,
           17.8, 11.5, 21.7, 19.9, 25. , 33.4, 28.5, 21.4, 24.3, 27.5, 33.1,
           16.2, 23.3, 48.3, 22.9, 22.8, 13.1, 12.7, 22.6, 15. , 15.3, 10.5,
           24. , 18.5, 21.7, 19.5, 33.2, 23.2,  5. , 19.1, 12.7, 22.3, 10.2,
           13.9, 16.3, 17. , 20.1, 29.9, 17.2, 37.3, 45.4, 17.8, 23.2, 29. ,
           22. , 18. , 17.4, 34.6, 20.1, 25. , 15.6, 24.8, 28.2, 21.2, 21.4,
           23.8, 31. , 26.2, 17.4, 37.9, 17.5, 20. ,  8.3, 23.9,  8.4, 13.8,
            7.2, 11.7, 17.1, 21.6, 50. , 16.1, 20.4, 20.6, 21.4, 20.6, 36.5,
            8.5, 24.8, 10.8, 21.9, 17.3, 18.9, 36.2, 14.9, 18.2, 33.3, 21.8,
           19.7, 31.6, 24.8, 19.4, 22.8,  7.5, 44.8, 16.8, 18.7, 50. , 50. ,
           19.5, 20.1, 50. , 17.2, 20.8, 19.3, 41.3, 20.4, 20.5, 13.8, 16.5,
           23.9, 20.6, 31.5, 23.3, 16.8, 14. , 33.8, 36.1, 12.8, 18.3, 18.7,
           19.1, 29. , 30.1, 50. , 50. , 22. , 11.9, 37.6, 50. , 22.7, 20.8,
           23.5, 27.9, 50. , 19.3, 23.9, 22.6, 15.2, 21.7, 19.2, 43.8, 20.3,
           33.2, 19.9, 22.5, 32.7, 22. , 17.1, 19. , 15. , 16.1, 25.1, 23.7,
           28.7, 37.2, 22.6, 16.4, 25. , 29.8, 22.1, 17.4, 18.1, 30.3, 17.5,
           24.7, 12.6, 26.5, 28.7, 13.3, 10.4, 24.4, 23. , 20. , 17.8,  7. ,
           11.8, 24.4, 13.8, 19.4, 25.2, 19.4, 19.4, 29.1])

 %% Cell type:markdown id: tags:

 The prices are typically between 10000 and 50000 USD. If that sounds cheap, remember
 that this was the mid-1970s, and these prices aren’t adjusted for inflation.

 %% Cell type:markdown id: tags:

 ### Preparing the data

 It would be problematic to feed into a neural network values that all take wildly different ranges. The model might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice for dealing with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation. This is easily done in NumPy.

 %% Cell type:markdown id: tags:

 ### Normalizing the data

 %% Cell type:code id: tags:

 ``` python
 mean = train_data.mean(axis=0)
 train_data -= mean
 std = train_data.std(axis=0)
 train_data /= std
 test_data -= mean
 test_data /= std
 ```

 %% Cell type:markdown id: tags:

 Note that the quantities used for normalizing the test data are computed using the
 training data. You should never use any quantity computed on the test data in your
 workflow, even for something as simple as data normalization.

 %% Cell type:markdown id: tags:

 ### Building your model

 Because so few samples are available, we’ll use a very small model with two intermediate layers, each with 64 units. In general, the less training data you have, the worse overfitting will be, and using a small model is one way to mitigate overfitting.

 %% Cell type:markdown id: tags:

 #### Model definition

 %% Cell type:code id: tags:

 ``` python
 def build_model():
    model = keras.Sequential([
    layers.Dense(64, activation="relu"),
    layers.Dense(64, activation="relu"),
    layers.Dense(1)
    ])
    model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
    return model
 ```

 %% Cell type:markdown id: tags:

 The model ends with a single unit and no activation (it will be a linear layer). This is a
 typical setup for scalar regression (a regression where you’re trying to predict a single
 continuous value). Applying an activation function would constrain the range the output
 can take; for instance, if you applied a sigmoid activation function to the last layer,
 the model could only learn to predict values between 0 and 1. Here, because the last
 layer is purely linear, the model is free to learn to predict values in any range.
 Note that we compile the model with the `mse` loss function — _mean squared error_, the
 square of the difference between the predictions and the targets. This is a widely used
 loss function for regression problems.
 We’re also monitoring a new metric during training: _mean absolute error_ (`MAE`). It’s the
 absolute value of the difference between the predictions and the targets. For instance, an
 MAE of 0.5 on this problem would mean your predictions are off by 500 on average.

 %% Cell type:markdown id: tags:

 ### Validating your approach using K-fold validation

 To evaluate our model while we keep adjusting its parameters (such as the number of
 epochs used for training), we could split the data into a training set and a validation set, as we did in the previous examples. But because we have so few data points, the validation set would end up being very small (for instance, about 100 examples). As a consequence, the validation scores might change a lot depending on which data points we chose for validation and which we chose for training: the validation scores might have a high variance with regard to the validation split. This would prevent us from reliably evaluating our model.

 The best practice in such situations is to use $K$-fold cross-validation. It consists of splitting the available data into K partitions (typically $K = 4$ or $5$), instantiating $K$ identical models, and training each one on $K - 1$ partitions while evaluating on the remaining partition. The validation score for the model used is then the average of the $K$ validation scores obtained. In terms of code, this is straightforward.

 %% Cell type:markdown id: tags:

 #### K-fold validation

 %% Cell type:code id: tags:

 ``` python
 import numpy as np
 from tensorflow import keras
 from tensorflow.keras import layers
 k = 4
 num_val_samples = len(train_data) // k
 num_epochs = 100
 all_scores = []
 for i in range(k):
    print(f"Processing fold #{i}")
    # Prepares the validation data: data from partition k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    # Prepares the training data: data from all other partitions
    partial_train_data = np.concatenate([train_data[:i * num_val_samples], train_data[(i + 1) * num_val_samples:]],axis=0)
    partial_train_targets = np.concatenate([train_targets[:i * num_val_samples], train_targets[(i + 1) * num_val_samples:]],axis=0)
    # Builds the Keras model (already compiled)
    model = build_model()
    # Trains the model (in silent mode, verbose=0)
    history=model.fit(partial_train_data, partial_train_targets, epochs=num_epochs, batch_size=16, verbose=0)
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    all_scores.append(val_mae)
 ```

 %% Output

    Processing fold #0
    Processing fold #1
    Processing fold #2
    Processing fold #3

 %% Cell type:markdown id: tags:

 Running this with `num_epochs = 100` yields the following results:

 %% Cell type:code id: tags:

 ``` python
 all_scores
 ```

 %% Output

    [1.9184445142745972,
     2.4037296772003174,
     2.4944815635681152,
     2.4431681632995605]

 %% Cell type:code id: tags:

 ``` python
 np.mean(all_scores)
 ```

 %% Output

    2.3149559795856476

 %% Cell type:markdown id: tags:

 The different runs do indeed show rather different validation scores, from 1.9 to 2.49.
 The average (2.3) is a much more reliable metric than any single score—that’s the
 entire point of K-fold cross-validation. In this case, we’re off by 2310 USD on average, which is significant considering that the prices range from 10000 to 50000.
 Let’s try training the model a bit longer: 500 epochs. To keep a record of how well
 the model does at each epoch, we’ll modify the training loop to save the per-epoch
 validation score log for each fold.

 %% Cell type:markdown id: tags:

 #### Saving the validation logs at each fold

 %% Cell type:code id: tags:

 ``` python
 num_epochs = 500
 all_mae_histories = []
 for i in range(k):
    print(f"Processing fold #{i}")
    # Prepares the validation data: data from partition #k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    # Prepares the training data: data from all other partitions
    partial_train_data = np.concatenate([train_data[:i * num_val_samples], train_data[(i + 1) * num_val_samples:]],axis=0)
    partial_train_targets = np.concatenate([train_targets[:i * num_val_samples], train_targets[(i + 1) * num_val_samples:]],axis=0)
    # Builds the Keras model (already compiled)
    model = build_model()
    # Trains the model (in silent mode, verbose=0)
    history = model.fit(partial_train_data, partial_train_targets,
    validation_data=(val_data, val_targets), epochs=num_epochs, batch_size=16, verbose=0)
    mae_history = history.history["val_mae"]
    all_mae_histories.append(mae_history)
 ```

 %% Output

    Processing fold #0
    Processing fold #1
    Processing fold #2
    Processing fold #3

 %% Cell type:markdown id: tags:

 We can then compute the average of the per-epoch MAE scores for all folds.

 %% Cell type:markdown id: tags:

 #### Building the history of successive mean K-fold validation scores

 %% Cell type:code id: tags:

 ``` python
 average_mae_history = [
 np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]
 ```

 %% Cell type:markdown id: tags:

 #### Plotting validation scores

 %% Cell type:code id: tags:

 ``` python
 import matplotlib.pyplot as plt
 plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
 plt.xlabel("Epochs")
 plt.ylabel("Validation MAE")
 plt.show()
 ```

 %% Output



 %% Cell type:markdown id: tags:

 It may be a little difficult to read the plot, due to a scaling issue: the validation MAE
 for the first few epochs is dramatically higher than the values that follow. Let’s omit
 the first 10 data points, which are on a different scale than the rest of the curve.

 %% Cell type:markdown id: tags:

 #### Plotting validation scores, excluding the first 10 data points

 %% Cell type:code id: tags:

 ``` python
 truncated_mae_history = average_mae_history[10:]
 plt.plot(range(1, len(truncated_mae_history) + 1), truncated_mae_history)
 plt.xlabel("Epochs")
 plt.ylabel("Validation MAE")
 plt.show()
 ```

 %% Output



 %% Cell type:markdown id: tags:

 As you can see in Figure above, validation MAE stops improving significantly after
 120–140 epochs (this number includes the 10 epochs we omitted). Past that point,
 we start overfitting.
 Once you’re finished tuning other parameters of the model (in addition to the
 number of epochs, you could also adjust the size of the intermediate layers), you can
 train a final production model on all of the training data, with the best parameters,
 and then look at its performance on the test data.

 %% Cell type:markdown id: tags:

 #### Training the final model

 %% Cell type:code id: tags:

 ``` python
 # Gets a fresh, compiled model
 model = build_model()
 # Trains it on the entirety of the data
 model.fit(train_data, train_targets,
 epochs=130, batch_size=16, verbose=0)
 test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)
 ```

 %% Output

    4/4 [==============================] - 0s 3ms/step - loss: 17.8865 - mae: 2.7644

 %% Cell type:markdown id: tags:

 Here’s the final result:

 %% Cell type:code id: tags:

 ``` python
 test_mae_score
 ```

 %% Output

    2.7643771171569824

 %% Cell type:markdown id: tags:

 We’re still off by a bit under 2800 USD. It’s an improvement! Just like with the two previous tasks, you can try varying the number of layers in the model, or the number  of units per layer, to see if you can squeeze out a lower test error.

 %% Cell type:markdown id: tags:

 ### Generating predictions on new data
 When calling `predict()` on our binary classification model, we retrieved a scalar score between 0 and 1 for each input sample. With our multiclass classification model, we retrieved a probability distribution over all classes for each sample. Now, with this scalar regression model, `predict()` returns the model’s guess for the sample’s price in thousands of dollars:

 %% Cell type:code id: tags:

 ``` python
 predictions = model.predict(test_data)
 predictions[0]
 ```

 %% Output

    array([8.708372], dtype=float32)

 %% Cell type:markdown id: tags:

 The first house in the test set is predicted to have a price of about 8700 USD.