diff --git a/README.md b/README.md index 1a226a88e387bc1d179b85e3026d77748b606e86..d2b6e1c6cf1935945aad867800ae221f2e452ae6 100644 --- a/README.md +++ b/README.md @@ -3,36 +3,55 @@ ## Introduction This is a Renku project - basically a git repository with some -bells and whistles. You'll find we have already created some -useful things like `data` and `notebooks` directories and -a `Dockerfile`. +bells and whistles. This is how to do the exercises on Renku. -## Working with the project +## Fork the Project -The simplest way to start your project is right from the Renku -platform - just click on the `Environments` tab and start a new session. -This will start an interactive environment right in your browser. +After Clicking on the ADML HSLU HS22 Project, you have a button "Fork" on the overview tab. When clicking on it you can duplicate this repository in your own namespace. -To work with the project anywhere outside the Renku platform, -click the `Settings` tab where you will find the -git repo URLs - use `git` to clone the project on whichever machine you want. + -### Changing interactive environment dependencies +Below you see the prompt you get when clicking on the Fork Button, you don't need to change anything just click on "Fork Project". (You won't get the title error as I do, because you don't have this repository yet.) -Initially we install a very minimal set of packages to keep the images small. -However, you can add python and conda packages in `requirements.txt` and -`environment.yml` to your heart's content. If you need more fine-grained -control over your environment, please see [the documentation](https://renku.readthedocs.io/en/latest/user/advanced_interfaces.html#dockerfile-modifications). + -## Project configuration +This has the advantage that you can save the changes to your notebook while working on it. When now going on your project page you have two "ADML HSLU HS22" project one with your namespace - this is the one you need to use from now on. +**Always use the ADML HSLU HS22 project with your name below** -Project options can be found in `.renku/renku.ini`. In this -project there is currently only one option, which specifies -the default type of environment to open, in this case `/lab` for -JupyterLab. You may also choose `/tree` to get to the "classic" Jupyter -interface. + + +## Starting a Session + +Click on the project and go to the Sessions tab, and click on the "New Session button". You should then see the below screen. + + + +When opening a session the first time after forking the project, you will encounter the Docker Image no available Error. This is not a problem, just click on building the branch image. It will now build your docker image. This can take some minutes. Afterwards you have a running session which looks like this: + + + +Click on the Open button and your Session will start. Or click on the three dots and choose open in new tab to have a better readability. Now you should see the folder structure as it is in the repository, open the notebook folder and you can see the notebooks adn open them by clicking on them. + + + +To save your work you can click on the save button from time to time. + + + +Then in the menu bar on the left, click on the third symbol (git-symbol). + + + +Then click on the plus, next to the "Changed" tab, like this your changes are added in the repository. + + + +In a last step, write a short description of what you did e.g. "Worked on Exercise 08B", and then click on the commit button. In this way, when opening a news session your changes will be restored. + + + +## Advanced Use + +If you are familiar with git you can use the project as a normal git repository, open it on GitLab and even open Sessions from different commits. -## Moving forward -Once you feel at home with your project, we recommend that you replace -this README file with your own project documentation! Happy data wrangling! diff --git a/img/add_change.png b/img/add_change.png new file mode 100644 index 0000000000000000000000000000000000000000..e9057c098db53d9518d0ac6a416bfbc2867a4a5c Binary files /dev/null and b/img/add_change.png differ diff --git a/img/commit.png b/img/commit.png new file mode 100644 index 0000000000000000000000000000000000000000..52612c61ed52683ac997ff1a56dfa1158a4681ad Binary files /dev/null and b/img/commit.png differ diff --git a/img/folder_structure.png b/img/folder_structure.png new file mode 100644 index 0000000000000000000000000000000000000000..11f3c3841630804f4c87bda14e309e8ef540d33d Binary files /dev/null and b/img/folder_structure.png differ diff --git a/img/fork.png b/img/fork.png new file mode 100644 index 0000000000000000000000000000000000000000..4fcd5e458592fda2aa64cae00252289624b40b93 Binary files /dev/null and b/img/fork.png differ diff --git a/img/fork_prompt.png b/img/fork_prompt.png new file mode 100644 index 0000000000000000000000000000000000000000..d504c4ed7d3736b9a0a3c13d1b9ede331ea20207 Binary files /dev/null and b/img/fork_prompt.png differ diff --git a/img/git_interface.png b/img/git_interface.png new file mode 100644 index 0000000000000000000000000000000000000000..444eb06a17ae4d97afe944995ec4928dd40d402c Binary files /dev/null and b/img/git_interface.png differ diff --git a/img/saving.png b/img/saving.png new file mode 100644 index 0000000000000000000000000000000000000000..4239eb5f0b5db9d63dbd8372f3286882a8a3467c Binary files /dev/null and b/img/saving.png differ diff --git a/img/session_overview.png b/img/session_overview.png new file mode 100644 index 0000000000000000000000000000000000000000..0d581b8bcffa1b494eb25778b15dbf26918bb274 Binary files /dev/null and b/img/session_overview.png differ diff --git a/img/your_project.png b/img/your_project.png new file mode 100644 index 0000000000000000000000000000000000000000..5c89b639a71fc94865ed7a2507ee8113d3dbdb02 Binary files /dev/null and b/img/your_project.png differ diff --git a/notebooks/01A Data Quality Assessment/Data Quality Assessment Examples.ipynb b/notebooks/01A Data Quality Assessment/Data Quality Assessment Examples.ipynb deleted file mode 100644 index 5631bfe133cc6911f2ae5c400208347c52b464be..0000000000000000000000000000000000000000 --- a/notebooks/01A Data Quality Assessment/Data Quality Assessment Examples.ipynb +++ /dev/null @@ -1,378 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Data Quality Assessment" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "ename": "ModuleNotFoundError", - "evalue": "No module named 'wordcloud'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", - "Input \u001b[0;32mIn [1]\u001b[0m, in \u001b[0;36m<cell line: 7>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mlinear_model\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m LinearRegression\n\u001b[1;32m 6\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmetrics\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m mean_squared_error\n\u001b[0;32m----> 7\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mwordcloud\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m WordCloud\n\u001b[1;32m 8\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mwarnings\u001b[39;00m\n\u001b[1;32m 9\u001b[0m warnings\u001b[38;5;241m.\u001b[39mfilterwarnings(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mignore\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", - "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'wordcloud'" - ] - } - ], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "import seaborn as sns\n", - "import matplotlib.pyplot as plt\n", - "from sklearn.linear_model import LinearRegression\n", - "from sklearn.metrics import mean_squared_error\n", - "from wordcloud import WordCloud\n", - "import warnings\n", - "warnings.filterwarnings(\"ignore\")\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df = pd.read_csv(\"cars.csv\")\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Skewed Data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "horsepower = df[\"Horsepower\"]\n", - "horsepower.plot.hist()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"Mean:\", horsepower.mean())\n", - "print(\"Mode:\", int(horsepower.mode()))\n", - "print(\"\")\n", - "print(\"Mean - Mode = \", horsepower.mean() - int(horsepower.mode()))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Log-Transform Data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "horsepower_log = horsepower.apply(lambda x: np.log(x))\n", - "horsepower_log.plot.hist()\n", - "\n", - "print(\"Mean:\", horsepower_log.mean())\n", - "print(\"Mode:\", int(horsepower_log.mode()))\n", - "print(\"\\nMean - Mode = \", horsepower_log.mean() - int(horsepower_log.mode()))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Boxplots" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Horsepower" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fig, ax = plt.subplots(figsize=(12,8))\n", - "horsepower.plot.box(ax=ax)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "horsepower.describe()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "q3 = horsepower.describe().loc['75%']\n", - "q1 = horsepower.describe().loc['25%']\n", - "iqr = q3 - q1\n", - "upper_boundary = q3 + 1.5 * iqr\n", - "lower_boundary = q1 - 1.5 * iqr\n", - "\n", - "print(\"Upper boundary:\", upper_boundary, \"Lower boundary:\", lower_boundary)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Year" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "year = df[\"Year\"]\n", - "\n", - "fig, ax = plt.subplots(figsize=(12,8))\n", - "year.plot.box(ax=ax)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "year.describe()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "q3 = year.describe().loc['75%']\n", - "q1 = year.describe().loc['25%']\n", - "iqr = q3 - q1\n", - "upper_boundary = q3 + 1.5 * iqr\n", - "lower_boundary = q1 - 1.5 * iqr\n", - "\n", - "print(\"Upper boundary:\", upper_boundary, \"Lower boundary:\", lower_boundary)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Correlation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plt.subplots(figsize=(10, 8))\n", - "sns.heatmap(df.corr(), annot=True, cmap='RdYlGn_r', linewidths=0.5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Dummy Variable Trap" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "colors = pd.get_dummies(df.Color)\n", - "colors.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plt.subplots(figsize=(10, 8))\n", - "sns.heatmap(colors.corr(), annot=True, cmap='RdYlGn_r', linewidths=0.5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Avoid Trap" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "colors = pd.get_dummies(df.Color, drop_first=True)\n", - "colors.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Numerical Encoding of Text" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### TF-IDF" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.feature_extraction.text import TfidfVectorizer" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "corpus = [\n", - " \"The Limmat flows out of the lake.\", \n", - " \"The bears are in the bear pit near the river.\",\n", - " \"The Rhône flows out of Lake Geneva.\",\n", - " ]\n", - "\n", - "vectorizer = TfidfVectorizer()\n", - "vectors= vectorizer.fit_transform(corpus)\n", - "feature_names = vectorizer.get_feature_names()\n", - "dense = vectors.todense().tolist()\n", - "pd.DataFrame(dense, columns=feature_names).transpose()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Word Embeddings" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "from gensim.models import KeyedVectors\n", - "\n", - "vectors = KeyedVectors.load(\"../../cc.de.300-distilled.vec\", mmap=\"r\")\n", - "vectors.syn0norm = vectors.syn0" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vectors[\"Mann\"]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "most_similar = vectors.wv.most_similar(positive=[\"Frau\", \"König\"], negative=[\"Mann\"])[0]\n", - "print(\"König - Mann + Frau = \", most_similar)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Profile Report" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas_profiling\n", - "profile = df.profile_report(html={'style':{'full_width':True}})\n", - "# Save report\n", - "profile.to_file(output_file=\"c\")\n", - "profile" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.12" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/01A Data Quality Assessment/Data Quality Assessment_Solution.ipynb b/notebooks/01A Data Quality Assessment/Data Quality Assessment.ipynb similarity index 100% rename from notebooks/01A Data Quality Assessment/Data Quality Assessment_Solution.ipynb rename to notebooks/01A Data Quality Assessment/Data Quality Assessment.ipynb diff --git a/notebooks/02A ML Fundamentals/Machine Learning Fundamentals - AutoScout.ipynb b/notebooks/02A ML Fundamentals/Machine Learning Fundamentals - AutoScout.ipynb index cfc69f13611b82e36fcc714bcfb71f5d60e9f55a..0513748e7da43eaf641d4c2f306dfbd8536a13fb 100644 --- a/notebooks/02A ML Fundamentals/Machine Learning Fundamentals - AutoScout.ipynb +++ b/notebooks/02A ML Fundamentals/Machine Learning Fundamentals - AutoScout.ipynb @@ -35,7 +35,7 @@ "\n", "import warnings\n", "from sklearn.exceptions import DataConversionWarning\n", - "warnings.filterwarnings(action='ignore', category=DataConversionWarning)\n", + "warnings.filterwarnings(action='ignore')\n", "\n", "%matplotlib inline" ] diff --git a/notebooks/07B Support Vector Machines/Support Vector Machine as Demo.ipynb b/notebooks/07B Support Vector Machines/Support Vector Machine as Demo.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..b1255cad192ce113e013859152f8112cc78df698 --- /dev/null +++ b/notebooks/07B Support Vector Machines/Support Vector Machine as Demo.ipynb @@ -0,0 +1,520 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Support Vector Machines" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for both **classification** and **regression**. SVMs establish a hyperplane that separates the two classes by maximizing the margin." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from scipy import stats\n", + "\n", + "import random\n", + "import seaborn\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "import pylab as pl\n", + "\n", + "import sklearn\n", + "\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.datasets import make_blobs, make_circles\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "from sklearn.svm import SVC\n", + "from sklearn.metrics import accuracy_score, f1_score\n", + "\n", + "seaborn.set()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A Simple Example" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X, y = make_blobs(n_samples=60, centers=2, random_state=0, cluster_std=0.60)\n", + "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", + "plt.xlim(-1, 3.5);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We fit a support vector machine with a linear kernel." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "solution2": "shown" + }, + "outputs": [], + "source": [ + "clf = SVC(kernel='linear')\n", + "clf.fit(X, y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We plot the decision boundary. In the following plot the dashed lines touch the *support vectors*, which are stored in the ``support_vectors_`` attribute of the classifier." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_svc_decision_function(clf, ax=None):\n", + " \"\"\"Plot the decision function for a 2D SVC\"\"\"\n", + " if ax is None:\n", + " ax = plt.gca()\n", + " x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)\n", + " y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)\n", + " Y, X = np.meshgrid(y, x)\n", + " P = np.zeros_like(X)\n", + " for i, xi in enumerate(x):\n", + " for j, yj in enumerate(y):\n", + " P[i, j] = clf.decision_function([[xi, yj]])\n", + " # plot the margins\n", + " ax.contour(X, Y, P, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])\n", + " \n", + "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", + "plot_svc_decision_function(clf)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We additoonally highlight the support vectors" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", + "plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=200, facecolors='none',edgecolors=\"black\");\n", + "plot_svc_decision_function(clf)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The dataset above was non-overlapping (or linearly separable), which means we could come up with a hyperplane that separated the dataset perfectly. Let us now consider a dataset where no perfect separation is possible. In this case the SVM tries to minimize the datapoints lying on the wrong side of the hyperplane. These datapoints are considered support vectors as well." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At first, we generate the datapoints of the first class by sampling from a normal distribution with standard deviation 1.3 and mean (2,4)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "num_entries=100\n", + "X=np.zeros((2*num_entries,2))\n", + "\n", + "for i in range(0,num_entries):\n", + " X[i,0]=np.random.normal()*1.3+2\n", + " X[i,1]=np.random.normal()*1.3+4\n", + "y = num_entries*[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we sample the data points from the second class with standard deviation 1.0 and mean (1,0). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "solution2": "shown" + }, + "outputs": [], + "source": [ + "for i in range(num_entries,2*num_entries):\n", + " X[i,0]=np.random.normal()+1\n", + " X[i,1]=np.random.normal()\n", + "y2 = num_entries*[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let us combine the class vectors `y` and `y2`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y.extend(y2)\n", + "\n", + "assert len(X) == len(y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let us visualize the generated data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", + "_ = plt.xlim(-1, 3.5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now fit a linear SVM to find the best separating hyperplane." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "solution2": "shown" + }, + "outputs": [], + "source": [ + "clf = SVC(kernel='linear')\n", + "clf.fit(X, y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let us again visualize the hyperplane and the support vectors." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", + "plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=200, facecolors='none',edgecolors=\"black\");\n", + "plot_svc_decision_function(clf)\n", + "_ = plt.xlim(-1, 3.5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Support Vector Machine with Kernels\n", + "\n", + "Kernels are useful when the decision boundary is not linear. A Kernel is a similarity measure of two data points after projection to some higher dimensional space. Let us generate a data set that is even less linearly separable than the one before." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X_circles, y_circles = make_circles(100, factor=.1, noise=.1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create and visualize a linear SVM and fit it to X and y" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "clf = SVC(kernel='linear').fit(X_circles, y_circles)\n", + "\n", + "plt.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, s=50, cmap='spring')\n", + "plot_svc_decision_function(clf);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The kernel called **radial basis function (rbf)** will do the job" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "solution2": "shown" + }, + "outputs": [], + "source": [ + "clf = SVC(kernel='rbf')\n", + "clf.fit(X_circles, y_circles)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, s=50, cmap='spring')\n", + "plot_svc_decision_function(clf)\n", + "plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=200, facecolors='none');" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Skin Disease Dataset\n", + "\n", + "We want to apply the SVM to segment skin diseases. Each row is an image pixel to which 14 different image filters have been applied (feature engineering, column t0 to t13). The class (target variable) indidates whether the pixel shows healthy skin or a skin disease (labels from medical doctors)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv(\"skin_disease.csv\")\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "SVMs are not very fast. In order to save time, we only use 50000 entries for training and validation. We also display a histogram of the target variable and observe that the data is extremely disbalanced. This is why we will use the f1-score for performance measurement below. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = df.sample(10000)\n", + "\n", + "_ = df['class'].hist()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let us split this dataset into training and validation set (we do not need a test set here)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "train, valid = train_test_split(df, test_size=0.5)\n", + "\n", + "X_train = train.drop('class', axis=1)\n", + "X_valid = valid.drop('class', axis=1)\n", + "\n", + "y_train = train[\"class\"]\n", + "y_valid = valid[\"class\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We train and evaluate an SVM classifier on this dataset, which can take some minutes. \n", + "Let us first use the `rbf` kernel and a `gamma` value of 0.1.\n", + "We measure the f1-score and accuracy on the test set" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "solution2": "hidden", + "solution2_first": true + }, + "outputs": [], + "source": [ + "classifiers = {\n", + " 'SVM with RBF kernel' : SVC(kernel='rbf', gamma=0.1),\n", + " 'SVM with linear kernel' : SVC(),\n", + " 'Decision Tree' : DecisionTreeClassifier(max_depth=5)\n", + "}\n", + "\n", + "\n", + "for name, model in classifiers.items():\n", + "\n", + " model.fit(X_train, y_train)\n", + " y_pred = model.predict(X_valid)\n", + " f1 = f1_score(y_valid, y_pred)\n", + " \n", + " print (\"Performance of {} is {:.3f}:\".format(name, f1))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Playground for Exercises" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = [[0,0,-1], [3,0,-1], [0,2,1], [2,3,1]]\n", + "df = pd.DataFrame(data, columns=['x', 'y', 'label'])\n", + "\n", + "_ = plt.scatter(df['x'], df['y'], c=df['label'], s=50, cmap='rainbow')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "clf = SVC(kernel='linear').fit(df[['x', 'y']].values, df['label'])\n", + "\n", + "plt.scatter(df['x'], df['y'], c=df['label'], s=50, cmap='rainbow')\n", + "plt.quiver([0], [1], [0], [1], angles='xy', scale_units='xy', scale=1)\n", + "plot_svc_decision_function(clf);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = np.linspace(-5,5,100)\n", + "\n", + "plt.scatter(df['x'], df['y'], c=df['label'], s=50, cmap='rainbow')\n", + "plt.plot(x, 1/3 * x + 1, '-r', label='y = 1/3 * x + 1')\n", + "plt.plot(x, 0*x + 1, '-b', label='y = 1')\n", + "plt.quiver([0,0], [1,1], [0,1/3], [1,-1], angles='xy', scale_units='xy', scale=1)\n", + "\n", + "plt.axis((-5,5,-5,5))\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "solution2": "hidden" + }, + "outputs": [], + "source": [ + "y_pred = clf.predict(X_test)\n", + "accuracy = accuracy_score(y_test, y_pred)\n", + "f1 = f1_score(y_test, y_pred)\n", + "\n", + "print (\"f1 SVM:\", f1)\n", + "print (\"accuracy SVM:\", accuracy)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "jupytext": { + "text_representation": { + "extension": ".py", + "format_name": "percent", + "format_version": "1.2", + "jupytext_version": "0.8.6" + } + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/07B Support Vector Machines/Support Vector Machines as Demo.ipynb b/notebooks/07B Support Vector Machines/Support Vector Machines as Demo.ipynb deleted file mode 100644 index c2394b839deed0ce2449213a573f220bc2847718..0000000000000000000000000000000000000000 --- a/notebooks/07B Support Vector Machines/Support Vector Machines as Demo.ipynb +++ /dev/null @@ -1,756 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Support Vector Machines" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for **classification** or for **regression**. SVMs establish a hyperplane that separates the dataset by maximizing the margin." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%matplotlib inline\n", - "import matplotlib.pyplot as plt\n", - "from mpl_toolkits import mplot3d\n", - "\n", - "import numpy as np\n", - "import seaborn; \n", - "\n", - "import sklearn\n", - "from sklearn.linear_model import LinearRegression\n", - "from sklearn.linear_model import LogisticRegression\n", - "from sklearn.datasets import make_blobs\n", - "from sklearn.datasets import make_circles\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.svm import SVC\n", - "from sklearn.metrics import accuracy_score\n", - "from sklearn.metrics import f1_score\n", - "\n", - "import warnings\n", - "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n", - "\n", - "from scipy import stats\n", - "import pylab as pl\n", - "import random\n", - "import pandas as pd\n", - "#from IPython.html.widgets.interaction import interact\n", - "import ipywidgets as widgets\n", - "seaborn.set()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Before starting with the exercises, watch the video from Josh Starmer on youtube." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import YouTubeVideo\n", - "YouTubeVideo('efR1C6CvhmE')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Part 1 - Simple Example" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "X, y = make_blobs(n_samples=60, centers=2,\n", - " random_state=0, cluster_std=0.60)\n", - "xfit = np.linspace(-1, 3.5)\n", - "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", - "plt.xlim(-1, 3.5);" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> Now fit the model by using the [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) class from scikit-learn. Use a `linear` kernel." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "solution2": "hidden", - "solution2_first": true - }, - "outputs": [], - "source": [ - "# START YOUR CODE\n", - "# clf = ...\n", - "# END YOUR CODE" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Click on the dots to display the solution*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - }, - "solution2": "hidden", - "tags": [] - }, - "outputs": [], - "source": [ - "clf = SVC(kernel='linear')\n", - "clf.fit(X, y)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We plot de decision boundary. In the following plot the dashed lines touch a couple of the points known as *support vectors*, which are stored in the ``support_vectors_`` attribute of the classifier." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def plot_svc_decision_function(clf, ax=None):\n", - " \"\"\"Plot the decision function for a 2D SVC\"\"\"\n", - " if ax is None:\n", - " ax = plt.gca()\n", - " x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)\n", - " y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)\n", - " Y, X = np.meshgrid(y, x)\n", - " P = np.zeros_like(X)\n", - " for i, xi in enumerate(x):\n", - " for j, yj in enumerate(y):\n", - " P[i, j] = clf.decision_function([[xi, yj]])\n", - " # plot the margins\n", - " ax.contour(X, Y, P, colors='k',\n", - " levels=[-1, 0, 1], alpha=0.5,\n", - " linestyles=['--', '-', '--'])\n", - " \n", - "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", - "print(clf.decision_function([[2,2],[1,3]]))\n", - "plot_svc_decision_function(clf)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we want to indicate the support vectors by sourrounding circles." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", - "plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],\n", - " s=200, facecolors='none',edgecolors=\"black\");\n", - "plot_svc_decision_function(clf)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The dataset above was non-overlapping, which means we could come up with a hyperplane that separated the dataset perfectly. Let us now consider a dataset where no perfect separation is possible. In this case the SVM tries to minimize the datapoints lying on the wrong side of the hyperplane. These datapoints are considered support vectors as well." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "At first, we generate the datapoints of the first class by sampling from a normal distribution with standard deviation 1.3 and mean (2,4)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "num_entries=100\n", - "X=np.zeros((2*num_entries,2))\n", - "for i in range(0,num_entries):\n", - " X[i,0]=np.random.normal()*1.3+2\n", - " X[i,1]=np.random.normal()*1.3+4\n", - "y = num_entries*[0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> Now we sample the data points from the second class with standard deviation 1.0 and mean (1,0). " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "solution2": "hidden", - "solution2_first": true - }, - "outputs": [], - "source": [ - "# START YOUR CODE\n", - "# y2 = \n", - "# END YOUR CODE" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Click on the dots to display the solution*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - }, - "solution2": "hidden", - "tags": [] - }, - "outputs": [], - "source": [ - "for i in range(num_entries,2*num_entries):\n", - " X[i,0]=np.random.normal()+1\n", - " X[i,1]=np.random.normal()\n", - "y2 = num_entries*[1]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let us combine the class vectors `y` and `y2`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "y.extend(y2)\n", - "\n", - "print (\"len X: \",len(X))\n", - "print (\"len y: \",len(y))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let us visualize the generated data samples. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", - "plt.xlim(-1, 3.5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> Now we train a linear SVM to find the best separating hyperplane." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "solution2": "hidden", - "solution2_first": true - }, - "outputs": [], - "source": [ - "# START YOUR CODE\n", - "# clf = ...\n", - "# END YOUR CODE" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Click on the dots to display the solution*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - }, - "solution2": "hidden", - "tags": [] - }, - "outputs": [], - "source": [ - "clf = SVC(kernel='linear')\n", - "clf.fit(X, y)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let us again visualize the hyperplane and the support vectors." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", - "plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],\n", - " s=200, facecolors='none',edgecolors=\"black\");\n", - "plot_svc_decision_function(clf)\n", - "plt.xlim(-1, 3.5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Part 2 - Support Vector Machine with Kernels Classifier\n", - "\n", - "Kernels are useful when the decision boundary is not linear. A Kernel is some functional transformation of the input data. SVMs have clever tricks to ensure kernel calculations are efficient. In the example below, a linear boundary is not useful in separating the groups of points." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "X_circles, y_circles = make_circles(100, factor=.1, noise=.1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> Create a linear SVM and fit it to X and y" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "solution2": "hidden", - "solution2_first": true - }, - "outputs": [], - "source": [ - "# START YOUR CODE\n", - "# clf = \n", - "# END YOUR CODE" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Click on the dots to display the solution*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - }, - "solution2": "hidden", - "tags": [] - }, - "outputs": [], - "source": [ - "clf = SVC(kernel='linear').fit(X_circles, y_circles)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plt.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, s=50, cmap='spring')\n", - "plot_svc_decision_function(clf);" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A simple model that could be useful is a **radial basis function (rbf)**:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "r = np.exp(-(X_circles[:, 0] ** 2 + X_circles[:, 1] ** 2))\n", - "\n", - "@widgets.interact(elev=[-90, 90], azip=(-180, 180))\n", - "def plot_3D(elev=30, azim=30):\n", - " #fig = plt.figure(figsize=(12,12))\n", - " #ax = fig.add_subplot(1, 1, 1, projection='3d')\n", - " ax = plt.subplot(projection='3d')\n", - " ax.scatter3D(X_circles[:, 0], X_circles[:, 1], r, c=y_circles, s=50, cmap='spring')\n", - " ax.view_init(elev=elev, azim=azim)\n", - " ax.set_xlabel('x')\n", - " ax.set_ylabel('y')\n", - " ax.set_zlabel('r')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In three dimensions, there is a clear separation between the data. \n", - "> Run the SVM with the `rbf` kernel." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "solution2": "hidden", - "solution2_first": true - }, - "outputs": [], - "source": [ - "# START YOUR CODE\n", - "# clf = ... # create an SVM with kernel=\"rbf\" (abbreviation for Radial Basis Function)\n", - "# END YOUR CODE" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Click on the dots to display the solution*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - }, - "solution2": "hidden", - "tags": [] - }, - "outputs": [], - "source": [ - "clf = SVC(kernel='rbf')\n", - "clf.fit(X_circles, y_circles)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plt.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, s=50, cmap='spring')\n", - "plot_svc_decision_function(clf)\n", - "plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],\n", - " s=200, facecolors='none');" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Part 3 - Skin disease dataset\n", - "Now we want to apply the SVM on our skin disease data. Load this dataset using pandas." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df = pd.read_csv(\"skin_disease.csv\")\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In order to save time, we only use 100000 entries for training / testing" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df = df.sample(frac=1) # shuffling the data\n", - "df = df.iloc[0:100000]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let us split this dataset into training and test set" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "train, test = train_test_split(df, test_size=0.5)\n", - "X_train = train.drop('class', axis=1)\n", - "X_test = test.drop('class', axis=1)\n", - "y_train = train[\"class\"]\n", - "y_test = test[\"class\"]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> Now train an SVM classifier on this dataset, which can take some minutes. Use the `rbf` kernel and a `gamma` value of 0.1." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "solution2": "hidden", - "solution2_first": true - }, - "outputs": [], - "source": [ - "# START YOUR CODE\n", - "\n", - "# END YOUR CODE" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Click on the dots to display the solution*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - }, - "solution2": "hidden", - "tags": [] - }, - "outputs": [], - "source": [ - "clf = SVC(kernel='rbf', gamma=0.1)\n", - "clf.fit(X_train, y_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> Next, determine f-score and accuracy on the testset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "solution2": "shown", - "solution2_first": true - }, - "outputs": [], - "source": [ - "# START YOUR CODE\n", - "# print (\"f1 SVM:\", f1)\n", - "# print (\"accuracy SVM:\", accuracy)\n", - "# END YOUR CODE" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "*Click on the dots to display the solution*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - }, - "solution2": "shown", - "tags": [] - }, - "outputs": [], - "source": [ - "y_pred = clf.predict(X_test)\n", - "accuracy = accuracy_score(y_test, y_pred)\n", - "f1 = f1_score(y_test, y_pred)\n", - "\n", - "print (\"f1 SVM:\", f1)\n", - "print (\"accuracy SVM:\", accuracy)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> and compare the result with a logistic regression model. Use the class [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) from scikit-learn." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "solution2": "hidden", - "solution2_first": true - }, - "outputs": [], - "source": [ - "# START YOUR CODE\n", - "# print (\"accuracy logistic regression: \", accuracy)\n", - "# print (\"f1 logistic regression: \", f1)\n", - "# END YOUR CODE" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Click on the dots to display the solution*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - }, - "solution2": "hidden", - "tags": [] - }, - "outputs": [], - "source": [ - "logReg = LogisticRegression()\n", - "logReg.fit(X_train, y_train)\n", - "\n", - "y_pred = logReg.predict(X_test)\n", - "accuracy = accuracy_score(y_test, y_pred)\n", - "f1 = f1_score(y_test, y_pred)\n", - "\n", - "print (\"accuracy logistic regression: \", accuracy)\n", - "print (\"f score logistic regression: \", f1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Final remarks:\n", - "* When using an SVM you need to choose the right values for parameters such as `C` and `gamma`. Model validation can help to determine these optimal values by trial and error.\n", - "* SVMs run in $O(n^3)$ performance. [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) is scalable, [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) does not seem to be scalable. For large data sets try transforming the data to a smaller space and use LinearSVC with rbf." - ] - } - ], - "metadata": { - "jupytext": { - "text_representation": { - "extension": ".py", - "format_name": "percent", - "format_version": "1.2", - "jupytext_version": "0.8.6" - } - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.12" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/09A Convolutional Neural Network/convolutional_neural_network.ipynb b/notebooks/09A Convolutional Neural Network/convolutional_neural_network.ipynb index f5af201860f1b0e089f59eab43a5af61a33b3981..485171886a52e98adf5a99674bee37b7324ea22c 100644 --- a/notebooks/09A Convolutional Neural Network/convolutional_neural_network.ipynb +++ b/notebooks/09A Convolutional Neural Network/convolutional_neural_network.ipynb @@ -50,7 +50,7 @@ "from tensorflow.keras.optimizers import SGD, Adam\n", "\n", "import warnings\n", - "warnings.filterwarnings(\"ignore\", category=FutureWarning)" + "warnings.filterwarnings(\"ignore\")" ] }, { @@ -321,7 +321,7 @@ "# model.add(...)\n", "\n", "# Define stochastic gradient descent with a defined learning rate\n", - "opt = Adam(lr=initial_lr)\n", + "opt = Adam(learning_rate=initial_lr)\n", "\n", "# Compile the model, using the optimizer, cross-entropy loss and accuracy as metric\n", "# model.compile(...)\n", diff --git a/notebooks/09B Transfer Learning/transfer_learning.ipynb b/notebooks/09B Transfer Learning/transfer_learning.ipynb index 02c1c80cddb4570f5c6bb34f96aaa74ff5930984..ddaef6607d72b19b602260812eece3124c685f12 100644 --- a/notebooks/09B Transfer Learning/transfer_learning.ipynb +++ b/notebooks/09B Transfer Learning/transfer_learning.ipynb @@ -47,7 +47,7 @@ "from tensorflow.keras.optimizers import SGD, Adam\n", "\n", "import warnings\n", - "warnings.filterwarnings(\"ignore\", category=FutureWarning)" + "warnings.filterwarnings(\"ignore\")" ] }, { diff --git a/notebooks/11A Clustering/Clustering Examples.ipynb b/notebooks/11A Clustering/Clustering Examples.ipynb deleted file mode 100644 index 105c350e4e7fb77d84462cdaabc67c7bcd5a42ac..0000000000000000000000000000000000000000 --- a/notebooks/11A Clustering/Clustering Examples.ipynb +++ /dev/null @@ -1,815 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Examples for lecture Clustering" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import seaborn as sns; sns.set()\n", - "from sklearn.datasets import make_blobs\n", - "from sklearn.cluster import KMeans\n", - "from tqdm.notebook import tqdm\n", - "import pandas as pd\n", - "from ipywidgets import interact\n", - "\n", - "from mlxtend.preprocessing import TransactionEncoder\n", - "from mlxtend.frequent_patterns import apriori, association_rules\n", - "\n", - "import warnings\n", - "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n", - "\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# k-Means Clustering" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## k-Means from Scratch" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Generate Data" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "X, _ = make_blobs(n_samples=20, centers=3, n_features=2, random_state=0)" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<matplotlib.collections.PathCollection at 0x7f1ebcb4a130>" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<Figure size 864x504 with 1 Axes>" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "fig, ax = plt.subplots(figsize=(12,7))\n", - "ax.scatter(X[:, 0], X[:, 1])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Randomly choose k cluster centers" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "def initialize_centroids(X, k):\n", - " centroids = X.copy()\n", - " np.random.shuffle(centroids)\n", - " return centroids[:k]" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<matplotlib.collections.PathCollection at 0x7f1ed8a2e580>" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<Figure size 864x504 with 1 Axes>" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "centroids = initialize_centroids(X, k=3)\n", - "\n", - "fig, ax = plt.subplots(figsize=(12,7))\n", - "ax.scatter(X[:, 0], X[:, 1])\n", - "ax.scatter(centroids[:, 0], centroids[:, 1], s=300, marker=\"*\", c=\"g\", edgecolor=\"k\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Assign each data point to its nearest cluster center" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "def closest_centroid(X, centroids):\n", - " distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))\n", - " return np.argmin(distances, axis=0)" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([2, 0, 1, 1, 1, 0, 2, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1])" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "closest = closest_centroid(X, centroids)\n", - "closest" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Text(0.5, 1.0, 'Initial')" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<Figure size 1080x504 with 1 Axes>" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "fig, ax = plt.subplots(figsize=(15,7))\n", - "ax.scatter(X[:, 0], X[:, 1])\n", - "ax.scatter(centroids[:, 0], centroids[:, 1], s=300, marker=\"*\", c=\"g\", edgecolor=\"k\")\n", - "ax.set_title(\"Initial\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Update each cluster center to the mean of all assigned data points" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "def move_centroids(X, closest, centroids):\n", - " return np.array([X[closest==k].mean(axis=0) for k in range(centroids.shape[0])])" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([[-1.18890637, 2.86477332],\n", - " [ 2.28681654, 2.14842427],\n", - " [-1.05643028, 5.31335925]])" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "move_centroids(X, closest, centroids)" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/opt/conda/lib/python3.9/site-packages/ipykernel/pylab/backend_inline.py:10: DeprecationWarning: `ipykernel.pylab.backend_inline` is deprecated, directly use `matplotlib_inline.backend_inline`\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "7b2371eca72648f096ab2f8395d31a24", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "interactive(children=(IntSlider(value=0, description='iteration', max=11), Output()), _dom_classes=('widget-in…" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "num_iterations = 10\n", - "centroids_hist = [centroids]\n", - "centroids_moved = centroids\n", - "\n", - "for i in range(num_iterations):\n", - " closest = closest_centroid(X, centroids_moved)\n", - " centroids_moved = move_centroids(X, closest, centroids_moved)\n", - " centroids_hist.append(centroids_moved)\n", - "\n", - "@interact(iteration=(0, len(centroids_hist)))\n", - "def k_means_centroids(iteration=0):\n", - " fig, ax = plt.subplots(figsize=(15,7))\n", - " centroids = centroids_hist[iteration]\n", - " ax.scatter(X[:, 0], X[:, 1])\n", - " ax.scatter(centroids[:, 0], centroids[:, 1], s=300, marker=\"*\", c=\"g\", edgecolor=\"k\")\n", - " ax.set_title(\"Iteration {}\".format(iteration))\n", - " plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Implement the algorithm" - ] - }, - { - "attachments": { - "image.png": { - "image/png": "" - } - }, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "def fit(X, k):\n", - " centroids = initialize_centroids(X, k)\n", - " closest_before = closest_centroid(X, centroids)\n", - "\n", - " while True:\n", - " closest = closest_centroid(X, centroids)\n", - " centroids = move_centroids(X, closest, centroids)\n", - " \n", - " if np.all(closest == closest_before):\n", - " break\n", - " closest_before = closest\n", - " return closest, centroids " - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<matplotlib.collections.PathCollection at 0x7f1ed88a48b0>" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<Figure size 1080x504 with 1 Axes>" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "closest, centroids = fit(X, k=3)\n", - "\n", - "fig, ax = plt.subplots(figsize=(15,7))\n", - "ax.scatter(X[:, 0], X[:, 1])\n", - "ax.scatter(centroids[:, 0], centroids[:, 1], s=300, marker=\"*\", c=\"g\", edgecolor=\"k\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Scikit-Learn" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "model = KMeans(n_clusters=3, random_state=0).fit(X)" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([2, 0, 2, 1, 1, 0, 0, 1, 2, 2, 1, 2, 0, 1, 2, 0, 2, 0, 1, 2],\n", - " dtype=int32)" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model.labels_" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<matplotlib.collections.PathCollection at 0x7f1ed884a190>" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<Figure size 1080x504 with 1 Axes>" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "centroids = model.cluster_centers_\n", - "\n", - "fig, ax = plt.subplots(figsize=(15,7))\n", - "ax.scatter(X[:, 0], X[:, 1])\n", - "ax.scatter(centroids[:, 0], centroids[:, 1], s=300, marker=\"*\", c=\"g\", edgecolor=\"k\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Distortion" - ] - }, - { - "attachments": { - "image.png": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAPgAAAB5CAYAAAAUPex2AAANxUlEQVR4Ae2dLbTUPBCGr0QikUgkEolEIpFIJBKHQyKRSCQSiUQikUgkEsl+59nvvIfc3DTbdpM0TV+xp/9pMjPPTDJNuzd///49+WcZ2AbGtIEbK3ZMxVqv1is2YMDdg3EPbmAbMOADK9dR3FHcgBtwR/CBbcCAD6xcR3BHcANuwB3BB7YBAz6wch3BHcENuAF3BB/YBgz4wMp1BHcEN+AG3BF8YBsw4AMr1xHcEdyAG3BH8IFtwIAPrFxHcEdwA27AHcEHtgEDPrByHcEdwQ24AXcEH9gGDPjAynUEdwQ34AbcEXxgG+gO8C9fvpw+ffp0y+h+/fp1+vjx4/n3+/fvW8ccpRylbAPTNtAV4B8+fDi9e/fu9OrVq9Pjx4/PIAP269evT1+/fj29f//+dHNzcwbdSp1WqmVj2cgGugH8z58/p+fPn5+hBmZAZhvgVVmWDx8+PP/CfV63QdsG0jbQDeB0zQXz27dvz4C/fPnyFtwokch+//79O/ut4LSCLZdjy6UbwBlna3z97NmzM+DsCw2UKH/v3r3T06dPb+0Pz/H6sQ3a+r+t/24Al2IE8aNHj+5ArK67Ir2u8fK2Ui0Py0M20B3ggvjNmzd3ACfZxtj858+fd46pQV7auG0D/2ygO8A1/v78+fMdiEmwPXny5LyfSE+23cr8p0zLwrKIbaA7wBlfE6Xj8TdJOPbjAGgE20T0uEHetpHbBv7ZQFeAE5WBWFE6VBQRnWNMgvn+/fv5EZqScuF5Xv+nXMvCsugKcIAlgscz2WSoTHTh2Thdc4/DbbyyCy+nbaErwK2oaUVZNpbNGhsw4AO/aLDGIHzNWI7EgBtwJyoHtgEDPrByHY3HisZr9GnADbgj+MA2YMAHVu4aj+9rxor6BtyAO4IPbAMGfGDlOhqPFY3X6NOAG3BH8IFtoDjgzA/nnW2mlfb0S72dtsYj+hpHxT3ZQHHA+QpLDDbA85LItT+msca/uc7kwYMHJ+a670k5ruv+ncm3b9/O3xPk/Ykt9FkccOaTA1MMud4Cq9VI3j7jXXIiNZ91iu/P9tQc91p1crn7B3StDnnbEQ74OhG9WmySbT4surbMNdcVB5xK6KMNMWTsX1PJNdfgOV+8eHELdKL/mrJ8zXFBXaN77JzvBsYvROmDJbWDXVjnKoBzAxoRA44Hi9/zDitTYx0hS7DUZ6uuUo22ucw+HQ/ROtVbZIio3u2PHz+aBJtqgGN8RMwYcn0aubVxyqsCe+t7+359glhDLwxRsXlyQ3EE537qVfLqc437x2VWBZxoLY8Vgt6qcXFjgZz6LE228ecLuWtQ6lZtitvo7TLOBH2i1yl5Ai92ER/HTvi0GF30VG9VgLcKNFUBp/FAFcIt77ZVV5kvsi5NdKTGU6Fi1cZwn9fLgLaFHBWF0evU/YEbkKeOT+1XAjjlHKauuWZ/dcCpXDgGFux8FjkXFa9pVO5a7klmM3dOfIw6p7pbOs+A7xdm6TBcomt0XhpwghrlrulFhvVbst4EcKDiO2uCW8vUP5csqXyrc6mvAR8L4pzt1AKcnBRjcx6h5e5f8lgTwKkwQqOrK7i1bNVVuUZoBvw4cMtW0XnJCM78jNZw05ZmgHMzfRlVcLO8NL69BsxS1xpwAx7b0pIxOOfSLWduRlxO7e2mgNMYvogaAs463fctxuNzhWvADXhsK3MB5zxyPnFGfojn4LFQ2AZkEmwx5D2/DGLADXhsy3MAZ6xNnikOXiTbWtl78wiOoPBeqZdEcmOeWMAttw24AY/t7RLgwD3111rs5/q4zBrbmwBOQ2hgHMUZp6QmB9Ro+JIyDbgBj+0lBzhwpwJYaO+txuObAY7AUq+W9vhCyFrAU/PxpeR4kkRKFpzbytPHBtxiG2fO+DT38gXnMI8i95iydF2veUxGfS/BjV7jMXnpNqi8TQGfGo/3Nu1zLeAYCsMOnh6ESgfu2INzDj0Y7sVsJ4ye31Yz/mQgNZfomfbmJh4pKZt6eaNW3a4BvFad1pa7KeBUGkMPjR+Fsx0DsLaBJa5bC3h47/gRYTwXGZB5ZEgPJk7KhOWMtM6LR8g2F8E1d6KlozPghb9hJU+OsvUjyrXqxlyCpgTg3IN58GofS82Jp1tHe4ncc9uMEyCqYYyX6t/rccE7lVwFauSEw2/p9Ax4YcAxQHnzEIBeprJSpxxIGCjnzAEpHGtjuCRkABvAc/eIy1aSMu4JxOf1uj0HXjn+XBe+RvsMeAXAiVwYeQg46z0kmUoCTiSK35NfM5uPqA8AcyN+CAIOSWP8a5fUIyx77voceOX0W9uAAa8AOIbBuDsGXN3YuYZT47ySgFO/2Jm1nsnHUAEnU+I31b2+pAfBC+hT56oLv6RnM1XWkv0GvBLgKIGIIsh76X6WBpx20ja1kyUGv8QI936u4J1KnqkLT6+udVsNeEXA53TdWiu8NOBqox4BCfQ5Dk1d81ZzmWvIWvCSg5gqXzKam4dh6EM+gwCxtlehuhjwSoCjIIydpFPLrKkUO7UsCTiZb8rTXGSGIAKcZa7LyhCGhBNy4pl5D8OXKZnl9gte3kmYOo92Io9w/A288fnYCbIk0lMucCNjbGhtfsCAVwAcr45HR1FrFRMrv9Q2hobSp8rDqDhn6rj2AyZt5Ltc2scy7q7zzDw8rnWMXo4POFpnl1WPa5caf08NS9A/8uSnngrtjttLLoP8BbmEMNmox5Ghc1hSZwNeGHAECtiMy6bGZEsUVPrcEoATfWkfxihIw3rK6LkXTiCe6IMTUXSXQ5nTpQ/v0cu6xt9E2VSdkJHkoONEZcDVNkscJWXFAYFIjwOM93MNdkZZYTnxugEvCDjGjqIxaiJcLOwettcCrmw17aMMfhhebKgYYvxJKzkDjJ3jRChFKSUiAb0H+Sypg8bfkofaRBms4+iIvHL47MdGkI+iOfs0M1BDnbl1oIzwnqnrDHhBwDXW6nk8eS3gQBr+UoCHx+P1OBLhJAAgZZy979P4mzYSgXHsJNIAm3YputKDIc/Afs6N7UMJyhpOzoAXAlxjz6VeuLURrwW8Rj0VudRdr3GPmmVqKCJgiahAGg9JqAORlmOpiKteX2q4k6o/98F5zDnfgBcAXJ4chacU0tO+ngBHXkQ9RXUiWQqAnuQX1kXjbyAK9y9dp+c3txcD2Axr6ClMjfvD+xvwKwHX47DWM7hCJS5Z7wlw4JZTJBrFGfkl7Wp9rsbfc8HM1Y9hDnqZisj0Cukl4PzUQ2RJtz9XLscM+BWAh4/Dtoo8YbLmkrI53hPgRCCMG9kRkZa2ZU57a52jXtvcySu5egA2Y3bKUm+G8wkeOEDlOZCTjnP+nCcPBnwl4Aha2dGtDJM6xM9Tc4bEsZ4AZ0xK1KZrXiPBdEkW1xwHPGS59vl0fG9FZ5JwyAO54ERSgUO9xjmPYQ34CsDxuEqMbGmYGIESPLHBTG33BPhUHfewH+eOLLdw7ugdR4CcgD3lBCRDA74CcHlvPQaRMFsuSbQwhs0pN1UfEkMoPXWMfTgsDHfquPf//9FGIvdS51pCdugbvfMEgkBzqQfH+egzF4j0rL5E/WqW0eSTTXocBmA1G5MrmwQLSlOCKndufIzx3FQyh3OBX4mc+Fpv9/FFVnqPjNf5zemmY7Mau6d0SBlb2nOqTql91QHHYwMWXaRUBWrvQxHhLLGped616+HytwediJxz1CPqqCrgSmww9mktWO6Nt8a56EdXu3U9RjQat2l7ZzVXB9UAJ3ICFI8mlo5551ae8+hG4Zk1mUETIAR1uCTTuqRsn7sfQ7au0rqqArgeh4Vw9bCeS5rYQNIGYrnsWy7FAacLTEKjB6DDOpSYPWVj37exH1F/xQFXxjyEq4d1Z7kNpwFf8Xz7iEJzm+0s9mIDxSP4XhruehrSI9iAAXevxU8WBrYBAz6wco8QodzGfE/MgBtwR/CBbcCAD6xcR7d8dDuCfAy4AXcEH9gGDPjAyj1ChHIb872UZoAzN51JMDXnpYfK5n6ee55Xfigvr48pq2aA80YZM9py79hea2TMNefdbcDmRRdPTx3TaK+1kyNd3wxwvnvNK5w1hcsL+Hxpg14CDsWAG/Ca9raHspsB3loYBtxwt7a5Hu9nwJ1kq9qr6tHoj1Sn6oCT7OJTtrmPFtYQuCO4I3gNu9pbmVUBZ0wM3IyL+apl/LkkPp9LYmztLydsA27Ac/ZxlGPVACeppnewWSeDHn/wkGw3IK795XoFBtyAHwXiXDurAR7+pQyPrmo/IosbacANeGwTR9yuBri64yz5wzeAaylgA27AW9pbr/eqBrgazNdOid6t/9HCgBtw2eCRl9UB158OpGawXZtkUy8hpUADbsBTdnG0fVUB5xEZ0Vv/BQWQSrwh6GuTbLk/sTPgBvxoMKfaWxVw/W2Rsuc8Nmv1bXIAn/Nn7ymheJ+dwyg2UBVwIiwRnMdZLd7u4nEcDoRxP3Bzb57D6zn7KEpzO+yA5tpAVcCpBHARyVv8bTAw00uY+s0Vis8zQKPYQHXARxGU22Ho92gDBtwvmzSdn7BHSPZcZwNuwA34wDZgwAdW7p4jj+teZkhkwA24I/jANmDAB1auo2CZKLhnORpwA+4IPrANGPCBlbvnyOO6l+l9GHAD7gg+sA0Y8IGV6yhYJgruWY4G3IA7gg9sAwZ8YOXuOfK47mV6HwbcgDuCD2wDBnxg5ToKlomCe5bjf5NWXLXdap14AAAAAElFTkSuQmCC" - } - }, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "33.94471513755734" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model.inertia_" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Average Distortion" - ] - }, - { - "attachments": { - "image.png": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAPkAAABpCAYAAAD4KYXTAAAOFklEQVR4Ae2dK5TVMBCGFwcOHBIcEolEIpFIJBKJwyGRSCQSiUQikTiQOJA4uJzvnvPvyXaT3D7StMn9xT3ppm1eM9/MJE27F//+/Tv45zGwDvSrAxcWbr/CtWwtW3TAkDuScSTXuQ4Y8s4FbG9ub27IDbk9eec6YMg7F7A9uT25ITfk9uSd64Ah71zA9uT25IbckNuTd64DhrxzAduT25MbckNuT965DhjyzgVsT25PfhaQf/369fD27dvDjx8/Lr3Wt2/fjnkfPnw4/Pnz5zLfUBiK3nSge8gB/M2bN8ffzZs3j6C/fv368OrVq8Pnz5+PoJPPcW/CdX9ssNCB7iF/8eLF0VPjuS8uLg537949gh0CQN6zZ88MuacuXepA15D//v37AOQA/e7duyPkL1++vCZI4Dfk9nqh4e/puGvIERSgkwI7MP/8+fMK5ITz5BPS9yRY98VGSzrQPeTq6L179w4PHz68BjJwA/mXL1+undO9Tg1MyzpwFpCzqg7IsVAd8DEAEuLQ0yvfqUFvVQdWhRxgPn78eAnQVoP0/v37I+SkYRsUqrPSTj6P01h5D6/xseFuXQdWgZx5MAtdt2/fPjx//nxzaGgDnhyoQ4EpVNfjs6dPn16bs4fX+9jAt6gDxSAHoMePHx9/PJJ68ODBEaw9QE5IHpuPE2nwjJyNMqyuf/r06YoRaFGgbrMN0VAHikGO98Yj8jyaShQi7wFy4E3NtZmvE6anzg8HzH8botZ0oBjkw47vCfJh2/y3QT0nHTDk3uXlKUrnOmDIOxfwOXks9zUeoRlyQ25P3rkOGPLOBWzvFvdu5zQuhtyQ25N3rgOGvHMBn5PHcl/jUYshN+T25J3rQDOQ83IJu9PYnrqHX+xlF3uSuCfxuGw7Ls1ArvfBQ8DZPssLJSV+lK9tuey5D+uJHWNw/G24bZW3V+OhXaOl+tcM5ACl/fAhdMM3y0oNDNtc9Q04XlyJRRHseS9Vn8s5b4PB9wyePHly6WjQN/Ru+FLVHD1pBnI6h4UbwobXDb/COmcQxtzD3nwMCtGDjEzspZcxZfma8wZ6KH8Brq8YcR4ng36h70s/aNIU5HQe7ynIlD569KiaRyWi4P1zGZulAhgK3H+fnwFAf2PfXcCL37hx4wj7379/Z+t4c5ADAWGMAFda+2MPRBV8UcYfgDw/KEsb4lu3bh31+fv379dA1rlfv35dOze2HatBrq+jMs8Y25ix1xHWhGGzQK/tVQGddkx5TZXXXnPzLM75o5LtGQ50Ul8YSukxT2RiukI+oXlsIffOnTtJA5CqZ5hfDHI6qVVuGh2uUPNOuc7pKyzDhkz9m3IEt1I8azivmVrmnOuZPkyBklX83GIh58Jvzs1pk++pbyRwMOhhbuw5P0X/cSKE6ziSXYTrWCE6cOpXcpEMwyHAldb+SAX9JmrJCTc8R/sMeX0IQxmscSynkyt7KuQ4S+6Zol+x+ot58ljhNfJYtBDgSnMQ1WhTrg5D3h/gyLs05EQGLO6WWGtqHnIig3BqAOi1HqvlYE6dM+SGPKUbymfKef/+/UnTQN0bS5uHnE7x+EFeXGnNx2qxgU3lGXJDntIN8pn+obslo9EuIGdwgEeAKy0R6uQEMuecITfkKb0BcB4PD78azFRgyYJyN5AzQLFtr7Ufq6UEqHxDbsilC2GK/qY+Cw74nA+vn3LcDeR0OrbtdYvHajkBGHJDPtQPvDQvRwGzHjUr5dn7bh6hDRu+1d+xba+AtVV7hvXOhZx+SfDDNAzlWJ8Ynufv8Jphm1r9mzEp+Uh2yTgsWV2PPSHSlJN06Yayrjy5hBTb9lpyIUP1zEnnQk4oh7XXnnmEzzF54S4qyg8VRMd7gWHOmMXu4R9i0LfURiT6y/jU2na8BPKYUQ7zYvvaY2OSyusScrzWcNvrXh6rzYVcAmRRRuCSDjdKMHdjiyT9RfF0X2+pDHkKACBhfEhr9H0J5Gu3r0vIGTQNegjEHh6rLYWcvoVTErxVuLhI+eTl9seHSoVRCCOB8Nyej7U3IjUNIcJB9mPHYWlfpW+5cmjPFoa3W8gZbFnzEPRalj0l7BKQU3b4pRyiFsJT9Xf4CCbVFgDn3tbeiwdcZJpqN/3C0PFL9b10fpeQM9B0bOmPFfHSAx6WF1vUCD1feG2N41KQo8gsyMiA8RSB4ylrD5RB2DvlHo0RhgSjMvfHXHqul1Ukw95utSdMBRzePMxf81h15upAPlyXu2aNc7M9OcrBIC794ZHW6JjKxMMptBMQKInO105LQU67CVXDvQEpz1a6j9S7VO7cz1jMadve5uP0oUvI5whnq3vCba8oyFbtoN6SkFOelEsGLLXavGWfS9cto31qPl7Ta0oOub4258lzndnbOb3Rg6cjRN2yfSUhJ0ohTB9OSXi8dKqP3Lv2VOlUG+ac13ycCCZ2fzgfHytrxgFIx14fq9eQb/jxepSZxSUtTsUEVDOvFOR4MYwWP441T8VbsOCU82LsomIuzZy21nPkUmOsfqYiMsHGdEB1suYQe4LAGDB+pNzHWMydPqpe1RlL7clXMAQoPxZ/yiOlmHBK5pWAHI+D9x4arnDFnZAWAzdsOwtmCulRbhQvFfYO793D35qPpxbd9KGF8FNMgBy2nfHDCAB12HfKTq3Ih9eFZenYkK8AsAY3l2r1ObVhInfvWueWQo6C0i8gHq5OS3kBl19s3z71S2FRfgzFWn1do1zNx2OePHx3QU8MyBtGKxhDYNY4qJ04hJjxwGAMy9A9Sg35BpDLostrSRhbp3MgRxkVXgMlAKOkMYVUvwW6wlEpvby7ds7Fyth6jFL1az5O34Ado6ZrWXcBfIXz6i9Ah/sGuI77p/QbQ8FPdcVSQ14Zcgn6lPWNCWvtvDmQM5+MPbKK9S92HXlDpaYdGIrYXHXtMZhbvuRKJAPQMmD0L4xQ8LycA3CMY1ifpiil90oY8oqQy0Mh+NDSh4Le8ngO5KXby7jgCWNGonRdJcsDbLzwmKcHqXopg76nzof5RD0YljEGwZBXgpyQCgHG5qKh8LY83gPkMoSChRQl3XJcxtSNbIF8OJcec6+uwbvn1iEoGyOILhH9kBLxnHIYhrwC5AgHuFGEU/MnCXyLdA+QE7JKcVHePby4c0oWmo8ToZ26NndeG6Ni3pk8pgKMCWOETqFLGBatZaTKNuQrQy5FRRjhIktKIKXzh6975srfA+QAgzFkrAhfYwqf68MW5zQfH86x57QFGWAsFMGwSMfUBc+tKEEp9eE8TtVjyFeGHAEB+BTYTglt7Hk8A0oz9vo9QE5b8VAoeSsLb5qPl5pWYOjov0AX1KEccR6E9hiYMD92bMhXhJxHZAA+XD2OCaJ0HkrAKu4UxdsL5KXHYu3yGGeiD8Z87bpUPnJlWoMBwCjkokRDvhLkWGEAZx4lwdRKUTbqHRPKhW3i8Y6e4Yb5Oubc1DJ1b88pc+LaUQeOgygRyElzBkZz95wM0NUpDiFX1pRzs181VSV0nMHAQzEQoSCwfuSj2IRbpLGwSGVNSZlHYmXZpVSqzLH1o3DM6RDa1Dki45NTFtpwapFnbDt93bIvwwIuq/Ho7RiZcH1uzMeUkbt/7rnFkDNfkXUCOH40hjCawZFCAyLhVomVXAaLuRK/mgMHoBg04NavZv1zhez7lsHe+vgtghyA8dAaBLw2yo9Hjy2CEd5yPje3UVmpFGPB/AwvXmNVGLAJodV2wU1awmCl+un88wazpPwXQY4HD8NVlB7lB/ZYI7VCmpuTxu4L8wQbdVH3Gj/qIBzHkIRQD49jhixsq48N6h50YBHkAKZQHa8OFPzCeXnYSUJ5QJnryYeh8hC6mn/Tz9prAeFY+tgGZKwOLII8rATYgSwM38PzAMH5uXDg/WtCfKoupiRh/3xs6PaqA8Ugx6sDRiqE1eOupdsS9zqQbpch36sOFIOcxTAgT6024/lyRmCvA+R2Gd7WdaAI5ArFU5s4NF8H8nC+Hh63PpBuv43BXnWgCOR6syc1T42F6oT15O91YNwuQ9uLDhSBXKveqY38enQWnucxlT25QeoFpD33owjkmo+zjTXWWT0/13yd6zAMsWudZ/CtA2V1YDHkzLeZa+e+toEH59EZ1wI6XpxjC7OsMD2eHs+YDiyGnEKBNje/ZmGOjf6E7eyGc5huZYwpo/PW0YsikG8pHHbP5QxMqbZhqIhISEuV6XLWUWqP69VxbRpyPZpb60UR7c0n+mA6ktsHYMW6qlgej/2MR9OQo0hsd9WCXmnFEuQsFGpH31p1lW67y9sPZFvLonnIaw2gITc0tXStdD2GfOSnqQy5IS8NX63ymoSc1XmgY8W+xocjEIYhN+S1oCxdT3OQs9jGQhir3GyNHe6XVz5QTv3p3fjYIBtyQx7TixbymoOcx1jaWcd35LTJRoPNx/Smwq3rc/8B1ZAbculYa2mTkGuQ2U5b6/10Q27IpXetpc1BrgHGm/Pcesn34lTWmNSQG/IxerLHa5qFnNdaa/5HDUNuyPcI8Jg2NQk5q+vMxfUmGxtUtB8eD08IP+eXeh+egTTkhnwMUHu8pknIWXwjVNcCHLALclbfWSWf81N5MUEZckMe04sW8pqEnEdnhOoMMGDmPHApIfDfYDAsfAWnVJkux4ajhg40CTlem5dS9PrqWu+mY0Dw4EQKYfhPHr8aAnIdNgRLdaBJyJd22vcbnHPSAUM+cu/6OSmF+9qXETTkhtzTjs51wJB3LmB75b688hx5GnJDbk/euQ4Y8s4FPMfy+56+vL8hN+T25J3rgCHvXMD2yn155TnyNOSG3J68cx34D6fTBQTcrFQAAAAAAElFTkSuQmCC" - } - }, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "1.6972357568778669" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model.inertia_ / len(X)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Convergence" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<matplotlib.collections.PathCollection at 0x7f1ebca93640>" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAsoAAAGiCAYAAAD+2eDqAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAAsTAAALEwEAmpwYAAASn0lEQVR4nO3dUYhmd3nH8d/sLq5gEk2GiRojRrfuY5paJCoqraYFtRe1WCMoQcxFL9Rc6G2VVlIVJIhFUFP0RghKA1KWaC+CpaChCVG06EUC+RvTJCaa6DqxxBXdmMz0Yt5IUh/dmXln9+zZ/Xxgmd2ze8488N+d893znvfMyubmZgAAgKfbN/UAAABwOhLKAADQEMoAANAQygAA0BDKAADQEMoAANA4cKI/UFWfTPL2JJckefkY447F9sNJbkiymmQ9ydVjjLtP3qgAAHDqbOeK8k1J3pDk/v+3/XNJrh9jHE5yfZLP7+1oAAAwnROG8hjj1jHGA0/dVlUXJrk8yY2LTTcmubyq1vZ+RAAAOPVOeOvF7/HCJD8aYzyRJGOMJ6rqx4vtR7d5jINJXp3koSRP7HIOAAD4Q/YneX6Sbyc5vpMddxvKe+HVSf5rws8PAMDZ4/VJbt3JDrsN5QeSvKCq9i+uJu9PctFi+3Y9lCQ///kvs7GxucsxmMrq6jlZXz829RjskvWbN+s3X9Zu3qzfPO3bt5Lzz39WsmjPndhVKI8xflpV30tyVZIvLT5+d4yx3dsuksXtFhsbm0J5pqzbvFm/ebN+82Xt5s36zdqOb/U94Zv5qurTVfVgkouT/GdV3bn4rfcleX9VfT/J+xe/BgCAM8IJryiPMT6Q5APN9ruSvOZkDAUAAFPznfkAAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKAhlAEAoCGUAQCgIZQBAKBxYNkDVNVbknwsycrix0fGGEeWPS4AAExpqVCuqpUkX0zy+jHGHVX1p0luq6qbxhgbezIhp5Xb73w4R265J488ejwXnHcwV15xKK+77HlTjwUAJ41z39lr6SvKSTaSPHvx8+ckeUgkn5luv/Ph3HDzXXns8a3lXX/0eG64+a4k8QUDgDOSc9/Zbal7lMcYm0nekeQrVXV/kpuSXL0Hc3EaOnLLPb/9QvGkxx7fyJFb7ploIgA4uZz7zm7L3npxIMmHkrx1jHFbVf1Zki9X1R+PMY5t5xirq+csMwKn0COPHv+929fWzj3F07AsazZv1m++rN28OPed3Za99eIVSS4aY9yWJItY/mWSS5N8ezsHWF8/lo2NzSXH4FS44LyDWW++YFxw3sEcPfqLCSZit9bWzrVmM2b95svazY9z3/zt27ey6wuzyz4e7sEkF1dVJUlVXZrkuUm8HnEGuvKKQ3nGgaf/lXnGgX258opDE00EACeXc9/ZbakrymOMh6vqmiT/VlVP3sDzd2OMR5YfjdPNk29a8M5fAM4Wzn1nt5XNzclue7gkyb1uvZgnLx/Om/WbN+s3X9Zu3qzfPD3l1osXJ7lvR/uejIEAAGDuhDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0Dix7gKp6ZpJPJXljkl8nuX2M8Z5ljwsAAFNaOpSTfCJbgXx4jLFZVc/dg2MCAMCklgrlqjonydVJLh5jbCbJGOMnezEYAABMadkryoeSrCe5tqr+MsmxJP84xrh16ckAAGBCK5ubm7veuaouT/LfSd41xvjXqnpNkn9P8kdjjEdPsPslSe7d9ScHAIDte3GS+3ayw7JXlH+Y5PEkNybJGONbVfWzJIeTfGc7B1hfP5aNjd3HOtNYWzs3R4/+Yuox2CXrN2/Wb76s3bxZv3nat28lq6vn7G7fZT7xGONnSb6e5E1JUlWHk1yY5AfLHBcAAKa2F0+9eF+SL1TVPyf5TZJ3jzH+dw+OCwAAk1k6lMcY/5PkL5YfBQAATh++Mx8AADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANIQyAAA0hDIAADSEMgAANPYslKvq2qrarKo/2atjAgDAVPYklKvq8iSvTXL/XhwPAACmtnQoV9XBJNcnuWb5cQAA4PRwYA+O8dEkXxpj3FdVO955dfWcPRiBKaytnTv1CCzB+s2b9Zsvazdv1u/sslQoV9XrkrwqyQd3e4z19WPZ2NhcZgwmsLZ2bo4e/cXUY7BL1m/erN98Wbt5s37ztG/fyq4vzC5768UVSS5Ncm9V3Zfk4iRfq6o3L3lcAACY1FJXlMcY1yW57slfL2L5LWOMO5acCwAAJuU5ygAA0NiLN/P91hjjkr08HgAATMUVZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaAhlAABoCGUAAGgIZQAAaBxYZueqWk3yxSSHkjyW5O4k7x1jHN2D2TgN3X7nwzlyyz155NHjueC8g7nyikN53WXPm3osADhpnPvOXsteUd5M8okxRo0xXp7kniTXLT8Wp6Pb73w4N9x8V9YfPZ7NJOuPHs8NN9+V2+98eOrRAOCkcO47uy0VymOMR8YY33jKpm8medFSE3HaOnLLPXns8Y2nbXvs8Y0cueWeiSYCgJPLue/sttStF09VVfuSXJPkqzvZb3X1nL0agZPskUeP/97ta2vnnuJpWJY1mzfrN1/Wbl6c+85uexbKST6T5FiSz+5kp/X1Y9nY2NzDMThZLjjvYNabLxgXnHcwR4/+YoKJ2K21tXOt2YxZv/mydvPj3Dd/+/at7PrC7J489aKqPpnkpUneOcbYONGfZ56uvOJQnnHg6X9lnnFgX6684tBEEwHAyeXcd3Zb+opyVX08ySuT/PUYo399gjPCk+/w9c5fAM4Wzn1nt5XNzd3f9lBVlyW5I8n3k/xqsfneMcbbtrH7JUnudevFPHn5cN6s37xZv/mydvNm/ebpKbdevDjJfTvZd6krymOMO5OsLHMMAAA4HfnOfAAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0BDKAADQEMoAANAQygAA0Diw7AGq6nCSG5KsJllPcvUY4+5ljwsAAFPaiyvKn0ty/RjjcJLrk3x+D44JAACTWuqKclVdmOTyJG9abLoxyWeram2McfQEu+9Pkn37VpYZgQlZu3mzfvNm/ebL2s2b9Zufp6zZ/p3uu+ytFy9M8qMxxhNJMsZ4oqp+vNh+olB+fpKcf/6zlhyBqayunjP1CCzB+s2b9Zsvazdv1m/Wnp/knp3ssPQ9ykv4dpLXJ3koyRMTzgEAwJlrf7Yi+ds73XHZUH4gyQuqav/iavL+JBcttp/I8SS3Lvn5AQDgRHZ0JflJS72Zb4zx0yTfS3LVYtNVSb67jfuTAQDgtLayubm51AGq6mXZejzc+Ul+nq3Hw409mA0AACazdCgDAMCZyHfmAwCAhlAGAICGUAYAgIZQBgCAxiTfcKSqDmfrSRmrSdaz9aSMu6eYhZ2pqtUkX0xyKMljSe5O8l6PBJyXqro2yT8lefkY446Jx2GbquqZST6V5I1Jfp3k9jHGe6adiu2oqrck+ViSlcWPj4wxjkw7Fb9PVX0yyduTXJKnfJ3UL/PQrd9u+2WqK8qfS3L9GONwkuuTfH6iOdi5zSSfGGPUGOPl2XqA93UTz8QOVNXlSV6b5P6pZ2HHPpGtQD68+Pf34YnnYRuqaiVbJ+h3jzFekeTdSW6oKq/qnr5uSvKG/O7XSf0yDzfld9dvV/1yyv+RVtWFSS5PcuNi041JLq+qtVM9Czs3xnhkjPGNp2z6ZpIXTTQOO1RVB7P1xf2aqWdhZ6rqnCRXJ/nwGGMzScYYP5l2KnZgI8mzFz9/TpKHxhgb043DHzLGuHWM8bTvMqxf5qNbv932yxT/m31hkh+NMZ5IksXHHy+2MyOLqyHXJPnq1LOwbR9N8qUxxn1TD8KOHcrWS73XVtV3quobVfXnUw/FiS3+Y/OOJF+pqvuzdbXr6kmHYjf0yxliJ/3iZR+W8Zkkx5J8dupBOLGqel2SVyX5l6lnYVf2J3lJku+OMV6V5O+THKmq86YdixOpqgNJPpTkrWOMFyX5myRfXrxKAJx62+6XKUL5gSQvqKr9SbL4eNFiOzOxuFH+pUne6eXD2bgiyaVJ7q2q+5JcnORrVfXmSadiu36Y5PEsXvYdY3wryc+SHJ5yKLblFUkuGmPcliSLj7/M1r9H5kO/nAF22i+nPJTHGD9N8r0kVy02XZWtKySemjATVfXxJK9M8rdjjONTz8P2jDGuG2NcNMa4ZIxxSZIHk/zVGOM/Jh6NbRhj/CzJ15O8Kfntu+8vTPKDKediWx5McnFVVZJU1aVJnputNxMxE/pl/nbTLyubm5snd6pGVb0sW49XOT/Jz7P1eJVxygdhx6rqsiR3JPl+kl8tNt87xnjbdFOxG4urym/xeLj5qKqXJPlCth5N9Zsk/zDGuHnaqdiOqnpXkg9m6019SXLtGOOm6SbiD6mqTye5MsnzsvXKzfoY4zL9Mg/d+mXrfQI77pdJQhkAAE533swHAAANoQwAAA2hDAAADaEMAAANoQwAAA2hDAAADaEMAAANoQwAAI3/Azh4O49Ddt/qAAAAAElFTkSuQmCC\n", - "text/plain": [ - "<Figure size 864x504 with 1 Axes>" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "X = np.array([\n", - " [2, 2],\n", - " [2, 8],\n", - " [10, 8],\n", - " [10, 2]\n", - "])\n", - "\n", - "fig, ax = plt.subplots(figsize=(12,7))\n", - "ax.set_xlim(0, 12)\n", - "ax.set_ylim(0, 10)\n", - "ax.scatter(X[:, 0], X[:, 1])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Our implementation" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<matplotlib.collections.PathCollection at 0x7f1ebc9a7670>" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<Figure size 1080x504 with 1 Axes>" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "closest, centroids = fit(X, k=2)\n", - "\n", - "fig, ax = plt.subplots(figsize=(15,7))\n", - "ax.scatter(X[:, 0], X[:, 1])\n", - "ax.scatter(centroids[:, 0], centroids[:, 1], s=300, marker=\"*\", c=\"g\", edgecolor=\"k\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Scikit-Learn" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<matplotlib.collections.PathCollection at 0x7f1ebc998400>" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<Figure size 1080x504 with 1 Axes>" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "model = KMeans(n_clusters=2, random_state=0, n_init=10, init=\"k-means++\").fit(X)\n", - "centroids = model.cluster_centers_\n", - "\n", - "fig, ax = plt.subplots(figsize=(15,7))\n", - "ax.scatter(X[:, 0], X[:, 1])\n", - "ax.scatter(centroids[:, 0], centroids[:, 1], s=300, marker=\"*\", c=\"g\", edgecolor=\"k\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Number of Clusters" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<matplotlib.collections.PathCollection at 0x7f1ebc9db2b0>" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<Figure size 864x504 with 1 Axes>" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "X, _ = make_blobs(n_samples=100, centers=4, n_features=2, random_state=0)\n", - "fig, ax = plt.subplots(figsize=(12,7))\n", - "ax.scatter(X[:, 0], X[:, 1])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Elbow Plot" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "data": { - "application/json": { - "ascii": false, - "bar_format": null, - "colour": null, - "elapsed": 0.19910264015197754, - "initial": 0, - "n": 0, - "ncols": null, - "nrows": null, - "postfix": null, - "prefix": "", - "rate": null, - "total": 9, - "unit": "it", - "unit_divisor": 1000, - "unit_scale": false - }, - "application/vnd.jupyter.widget-view+json": { - "model_id": "ef68a6d987de4bb3a56051c1363d3767", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - " 0%| | 0/9 [00:00<?, ?it/s]" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/plain": [ - "<AxesSubplot:xlabel='x'>" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<Figure size 576x432 with 1 Axes>" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "distortions = []\n", - "k_values = range(1,10)\n", - "for k in tqdm(k_values):\n", - " model = KMeans(n_clusters=k, random_state=0).fit(X)\n", - " distortions.append(model.inertia_)\n", - " \n", - "pd.DataFrame(dict(x=k_values, y=distortions)).plot(x=\"x\", y=\"y\", xticks=k_values, grid=True, figsize=(8,6))" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.12" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/11A Clustering/Clustering.ipynb b/notebooks/11A Clustering/Clustering.ipynb index 1460ec9dad8053d61552d2c8728f8049acc47bb0..b327d1df7a44bedd69f0ba19c144c059bce67495 100644 --- a/notebooks/11A Clustering/Clustering.ipynb +++ b/notebooks/11A Clustering/Clustering.ipynb @@ -31,7 +31,7 @@ "\n", "from tqdm.notebook import tqdm\n", "import warnings\n", - "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n", + "warnings.filterwarnings(\"ignore\")\n", "\n", "%matplotlib inline" ] @@ -211,9 +211,6 @@ { "cell_type": "markdown", "metadata": { - "jupyter": { - "source_hidden": true - }, "solution2": "hidden", "tags": [] }, diff --git a/notebooks/12A Association Rules/Association Rules Examples.ipynb b/notebooks/12A Association Rules/Association Rules Examples.ipynb deleted file mode 100644 index 59d06fd321364a6adf1755a4c6fe71ad75a797d1..0000000000000000000000000000000000000000 --- a/notebooks/12A Association Rules/Association Rules Examples.ipynb +++ /dev/null @@ -1,152 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Examples for lecture Association Rules" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import seaborn as sns; sns.set()\n", - "from tqdm.notebook import tqdm\n", - "import pandas as pd\n", - "from ipywidgets import interact\n", - "\n", - "from mlxtend.preprocessing import TransactionEncoder\n", - "from mlxtend.frequent_patterns import apriori, association_rules\n", - "import warnings\n", - "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Association Rules" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "transactions = [\n", - " [\"burger\", \"salad\", \"coke\", \"ice cream\"],\n", - " [\"burger\", \"salad\", \"coke\", \"ice cream\"],\n", - " [\"burger\", \"fries\", \"coke\", \"pie\"],\n", - " [\"burger\", \"salad\", \"coke\", \"choc bar\"],\n", - " [\"burger\", \"salad\", \"coke\", \"muffin\"],\n", - " [\"sandwich\", \"fries\", \"fanta\", \"pie\"],\n", - " [\"sandwich\", \"fries\", \"coke\", \"pie\"],\n", - " [\"sandwich\", \"onion rings\", \"water\", \"muffin\"]\n", - "]\n", - "\n", - " \n", - "transactions = pd.DataFrame(data={\"Items\":transactions}, index=range(1,9))\n", - "transactions.index.name = 'Id'\n", - "\n", - "with pd.option_context('display.max_colwidth', 80):\n", - " print(transactions)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Apriori Algorithm" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Encode ransactions" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "te = TransactionEncoder()\n", - "te_ary = te.fit_transform(transactions.Items.values.tolist())\n", - "df = pd.DataFrame(te_ary, columns=te.columns_)\n", - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Extract frequent item sets" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "freq_itemsets = apriori(df, use_colnames=True, min_support=0.5)\n", - "freq_itemsets" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3. Generate rules" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Execute the following code to show the solution\n", - "association_rules(freq_itemsets, metric='confidence', min_threshold=0.75)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.12" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}