Skip to content
Snippets Groups Projects
Jupyter Notebook Block 1 - Introduction to Image Classification.ipynb 401 KiB
Newer Older
   "outputs": [
    {
     "data": {
      "image/png": "\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "plt.imshow(dists, interpolation='none')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "7855beeb-d7e2-4ea7-994f-1adfa5a2c886"
    }
   },
   "source": [
    "Let us now predict labels and run the code below: We use $k = 1$ (which is Nearest Neighbor)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "219d7522-e633-4136-aa98-9abe80ca7bf3"
    }
   },
   "outputs": [],
   "source": [
    "y_test_pred = classifier.predict_labels(dists, k=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "f083926f-4bd0-488f-8ba9-e77dc946fac8"
    }
   },
   "source": [
    "We compute and print the fraction of correctly predicted examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "f1ac90b4-5005-4940-9663-0bfd9574dc8c"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Got 137 / 500 correct => accuracy: 0.274000\n"
     ]
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "num_correct = np.sum(y_test_pred == y_test)\n",
    "accuracy = float(num_correct) / num_test\n",
    "print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "7a33b48c-c106-4903-ba68-769ce91ccb8b"
    }
   },
   "source": [
    " Let us now predict labels and run the code below: We use k = 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "7a4433f3-d7d4-4b7c-bd21-6f6d5272c837"
    }
   },
   "outputs": [],
   "source": [
    "y_test_pred = classifier.predict_labels(dists, k=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "8cede653-c157-4396-a534-b4a8741251e2"
    }
   },
   "source": [
    "We compute and print the fraction of correctly predicted examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "445220c9-4974-41a0-a36c-a309d395490b"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Got 141 / 500 correct => accuracy: 0.282000\n"
     ]
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "num_correct = np.sum(y_test_pred == y_test)\n",
    "accuracy = float(num_correct) / len(y_test_pred)\n",
    "print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confusion Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "\n",
      "text/plain": [
       "<Figure size 648x648 with 2 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "# utility function for plotting confusion matrix\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "def plot_confmat(y_true, y_pred):\n",
    "    \"\"\"\n",
    "    Plot the confusion matrix and save to user_files dir\n",
    "    \"\"\"\n",
    "    conf_matrix = confusion_matrix(y_true, y_pred)\n",
    "    fig = plt.figure(figsize=(9,9))\n",
    "    ax = fig.add_subplot(111)\n",
    "    sns.heatmap(conf_matrix,\n",
    "                annot=True,\n",
    "                fmt='.0f')\n",
    "    plt.title('Confusion matrix')\n",
    "    ax.set_xticklabels( classes)\n",
    "    ax.set_yticklabels( classes)\n",
    "    plt.ylabel('True')\n",
    "    plt.xlabel('Predicted')\n",
    "    \n",
    "plot_confmat(y_test, y_test_pred)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "df615e0b-aeeb-4074-abef-075af4118640"
    }
   },
   "source": [
    "## Algebra and Performance of Distance Matrix Computation\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "04f92811-3067-4a08-8227-ed55c42fed50"
    }
   },
   "source": [
    "To ensure that our vectorized implementation is correct, we make sure that it\n",
    "agrees with the naive implementation. There are many ways to decide whether\n",
    "two matrices are similar; one of the simplest is the Frobenius norm. In case\n",
    "you haven't seen it before, the Frobenius norm of two matrices is the square\n",
    "root of the squared sum of differences of all elements; in other words, reshape\n",
    "the matrices into vectors and compute the Euclidean distance between them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "edecc2dc-bbf4-47bb-8902-6910fef3eae0"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Difference was: 0.000000\n",
      "Good! The distance matrices are the same\n",
      "Difference was: 0.000000\n",
      "Good! The distance matrices are the same\n"
     ]
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "dists_two  = classifier.compute_distances_two_loops(X_test)\n",
    "dists_one  = classifier.compute_distances_one_loop(X_test)\n",
    "dists_zero = classifier.compute_distances_no_loops(X_test)\n",
    "\n",
    "\n",
    "difference_two_2_one = np.linalg.norm(dists_two - dists_one, ord='fro')\n",
    "print('Difference was: %f' % (difference_two_2_one, ))\n",
    "if difference_two_2_one < 0.001:\n",
    "  print('Good! The distance matrices are the same')\n",
    "else:\n",
    "  print('Uh-oh! The distance matrices are different')\n",
    "\n",
    "difference_one_2_zero = np.linalg.norm(dists_one - dists_zero, ord='fro')\n",
    "print('Difference was: %f' % (difference_one_2_zero, ))\n",
    "if difference_one_2_zero < 0.001:\n",
    "  print('Good! The distance matrices are the same')\n",
    "else:\n",
    "  print('Uh-oh! The distance matrices are different')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "94c6dacb-929f-4378-b80f-4859256bd7f4"
    }
   },
   "source": [
    "Let's compare how fast the implementations are"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "1d3c6b0c-9a33-4f71-b283-0b1eb8061e77"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Two loop version took 62.240666 seconds\n",
      "One loop version took 75.840250 seconds\n",
      "No loop version took 0.912069 seconds\n"
     ]
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "def time_function(f, *args):\n",
    "  \"\"\"\n",
    "  Call a function f with args and return the time (in seconds) that it took to execute.\n",
    "  \"\"\"\n",
    "  import time\n",
    "  tic = time.time()\n",
    "  f(*args)\n",
    "  toc = time.time()\n",
    "  return toc - tic\n",
    "\n",
    "two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)\n",
    "print('Two loop version took %f seconds' % two_loop_time)\n",
    "\n",
    "one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)\n",
    "print('One loop version took %f seconds' % one_loop_time)\n",
    "\n",
    "no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)\n",
    "print('No loop version took %f seconds' % no_loop_time)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "e55a0c49-3d30-47b3-bbfc-2ba53025a0eb"
    }
   },
   "source": [
    "#  3. k-fold cross validation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "48a7d639-21bd-4b58-892d-c54a818111aa"
    }
   },
   "outputs": [],
   "source": [
    "num_folds = 5\n",
    "\n",
    "k_choices = [1, 3, 5, 7, 9, 10, 12, 15, 18, 20, 50, 100]\n",
    "\n",
    "X_train_folds = []\n",
    "y_train_folds = []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "8b1aa44f-7099-4511-8b20-168c0f37edb9"
    }
   },
   "source": [
    "Split up the training data into folds. After splitting, `X_train_folds` and    \n",
    "`y_train_folds` should each be lists of length `num_folds`, where                \n",
    "`y_train_folds[i]` is the label vector for the points in `X_train_folds[i]`.     "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "ee7f2e26-fa37-45b0-af4c-c225369eedc2"
    }
   },
   "outputs": [],
   "source": [
    "num_train = X_train.shape[0]\n",
    "fold_size = np.ceil(num_train/num_folds).astype('int')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "235c4927-a8f9-475f-83c4-54fb4b1de699"
    }
   },
   "source": [
    "In the case of `num_train = 5000` and 5 folds, we obtain \n",
    "`X_train_folds = np.split(X_train, [1000, 2000, 3000, 4000])`\n",
    "`y_train_folds = np.split(y_train, [1000, 2000, 3000, 4000])`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "dd9d3e91-fb0d-4ea1-8e37-6282e1eea5f5"
    }
   },
   "outputs": [],
   "source": [
    "X_train_folds = np.split(X_train, [(i + 1)*fold_size for i in np.arange(num_folds)])\n",
    "y_train_folds = np.split(y_train, [(i + 1)*fold_size for i in np.arange(num_folds)])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1000, 3072)"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "X_train_folds[1].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "99d20b22-bc30-49c6-85a1-86f153b21fe0"
    }
   },
   "source": [
    "A dictionary holding the accuracies for different values of $k$ that we find\n",
    "when running cross-validation. After running cross-validation,\n",
    "`k_to_accuracies[k]` should be a list of length `num_folds` giving the different\n",
    "accuracy values that we found when using that value of $k$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "a14b3164-b63a-49eb-980e-57c74b2304db"
    }
   },
   "outputs": [],
   "source": [
    "k_to_accuracies = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "369cc408-fb92-4899-9e37-92c02a9ef4c1"
    }
   },
   "source": [
    "We perform $k$-fold cross validation to find the best value of $k$. For each     \n",
    "possible value of $k$, run the $k$-nearest-neighbor algorithm `num_folds` times,   \n",
    "where in each case you use all but one of the folds as training data and the \n",
    "last fold as a validation set. Store the accuracies for all fold and all     \n",
    "values of k in the `k_to_accuracies` dictionary.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "6c869757-5e74-48cc-b7ef-14246b832a99"
    }
   },
   "outputs": [],
   "source": [
    "for k in k_choices:\n",
    "  \n",
    "  k_to_accuracies[k] = []\n",
    "  classifier = KNearestNeighbor()\n",
    "  for i in range(num_folds):\n",
    "      X_cv_training = np.concatenate([x for k, x in enumerate(X_train_folds) if k!=i], axis=0)\n",
    "      y_cv_training = np.concatenate([x for k, x in enumerate(y_train_folds) if k!=i], axis=0)\n",
    "      classifier.train(X_cv_training, y_cv_training)\n",
    "      dists = classifier.compute_distances_no_loops(X_train_folds[i])\n",
    "      y_test_pred = classifier.predict_labels(dists, k=k)\n",
    "      k_to_accuracies[k].append(np.mean(y_train_folds[i] == y_test_pred))\n",
    "  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "c10d6b24-607c-470b-bffd-614c8fa0be2c"
    }
   },
   "source": [
    "We print out the computed accuracies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "d7c42393-850e-4329-91db-5c052fe247e0"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "k = 1, accuracy = 0.263000\n",
      "k = 1, accuracy = 0.257000\n",
      "k = 1, accuracy = 0.264000\n",
      "k = 1, accuracy = 0.278000\n",
      "k = 1, accuracy = 0.266000\n",
      "k = 3, accuracy = 0.239000\n",
      "k = 3, accuracy = 0.249000\n",
      "k = 3, accuracy = 0.240000\n",
      "k = 3, accuracy = 0.266000\n",
      "k = 3, accuracy = 0.254000\n",
      "k = 5, accuracy = 0.248000\n",
      "k = 5, accuracy = 0.266000\n",
      "k = 5, accuracy = 0.280000\n",
      "k = 5, accuracy = 0.292000\n",
      "k = 5, accuracy = 0.280000\n",
      "k = 7, accuracy = 0.261000\n",
      "k = 7, accuracy = 0.279000\n",
      "k = 7, accuracy = 0.268000\n",
      "k = 7, accuracy = 0.288000\n",
      "k = 7, accuracy = 0.276000\n",
      "k = 9, accuracy = 0.259000\n",
      "k = 9, accuracy = 0.283000\n",
      "k = 9, accuracy = 0.270000\n",
      "k = 9, accuracy = 0.285000\n",
      "k = 9, accuracy = 0.285000\n",
      "k = 10, accuracy = 0.265000\n",
      "k = 10, accuracy = 0.296000\n",
      "k = 10, accuracy = 0.276000\n",
      "k = 10, accuracy = 0.284000\n",
      "k = 10, accuracy = 0.280000\n",
      "k = 12, accuracy = 0.260000\n",
      "k = 12, accuracy = 0.295000\n",
      "k = 12, accuracy = 0.279000\n",
      "k = 12, accuracy = 0.283000\n",
      "k = 12, accuracy = 0.280000\n",
      "k = 15, accuracy = 0.252000\n",
      "k = 15, accuracy = 0.289000\n",
      "k = 15, accuracy = 0.278000\n",
      "k = 15, accuracy = 0.282000\n",
      "k = 15, accuracy = 0.274000\n",
      "k = 18, accuracy = 0.266000\n",
      "k = 18, accuracy = 0.275000\n",
      "k = 18, accuracy = 0.281000\n",
      "k = 18, accuracy = 0.284000\n",
      "k = 18, accuracy = 0.282000\n",
      "k = 20, accuracy = 0.270000\n",
      "k = 20, accuracy = 0.279000\n",
      "k = 20, accuracy = 0.279000\n",
      "k = 20, accuracy = 0.282000\n",
      "k = 20, accuracy = 0.285000\n",
      "k = 50, accuracy = 0.271000\n",
      "k = 50, accuracy = 0.288000\n",
      "k = 50, accuracy = 0.278000\n",
      "k = 50, accuracy = 0.269000\n",
      "k = 50, accuracy = 0.266000\n",
      "k = 100, accuracy = 0.256000\n",
      "k = 100, accuracy = 0.270000\n",
      "k = 100, accuracy = 0.263000\n",
      "k = 100, accuracy = 0.256000\n",
      "k = 100, accuracy = 0.263000\n"
     ]
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "for k in sorted(k_to_accuracies):\n",
    "    for accuracy in k_to_accuracies[k]:\n",
    "        print('k = %d, accuracy = %f' % (k, accuracy))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We plot the raw observations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "e81573f1-9d05-44e2-a581-ffa01100b7af"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "for k in k_choices:\n",
    "  accuracies = k_to_accuracies[k]\n",
    "  plt.scatter([k] * len(accuracies), accuracies)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "21f79bed-12f0-4e15-abdd-1105b4467cf0"
    }
   },
   "source": [
    " We plot the trend line with error bars that correspond to standard deviation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "c9af79e8-2cfa-42ed-84fe-efbdadcf65fd"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])\n",
    "accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])\n",
    "plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)\n",
    "plt.title('Cross-validation on k')\n",
    "plt.xlabel('k')\n",
    "plt.ylabel('Cross-validation accuracy')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "301c698f-4817-4bc5-8e35-ee37caebacba"
    }
   },
   "source": [
    " # K-Nearest Neighbor with L1 distance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "ce60718f-a584-4026-8071-292b5943eca4"
    }
   },
   "outputs": [],
   "source": [
    "class KNearestNeighbor_L1(KNearestNeighbor):\n",
    "  \"\"\" a kNN classifier with L1 distance \"\"\"\n",
    "\n",
    "  def __init__(self):\n",
    "    super().__init__()\n",
    "    \n",
    "\n",
    "  def compute_distances_one_loop(self, X):\n",
    "    \"\"\"\n",
    "    We overwrite the compute_distance_one_loop method of the parent class \n",
    "    KNearestNeighbor. \n",
    "    Compute the distance between each test point in X and each training point\n",
    "    in self.X_train using one loop and the L1 distance measure.\n",
    "\n",
    "    Input / Output: Same as compute_distances_two_loops\n",
    "    \"\"\"\n",
    "    num_test = X.shape[0]\n",
    "    num_train = self.X_train.shape[0]\n",
    "    dists = np.zeros((num_test, num_train))\n",
    "    X = X.astype('float')\n",
    "    for i in range(num_test):\n",
    "      dists[i, :] = (np.sum(np.abs(self.X_train - X[i,:]), axis = 1))\n",
    "    return dists \n",
    "  \n",
    "  def compute_distances_two_loops(self, X):\n",
    "    \"\"\"\n",
    "    Compute the distance between each test point in X and each \n",
    "    training point in self.X_train using a nested loop over both \n",
    "    the training data and the test data.\n",
    "\n",
    "    Inputs:\n",
    "    - X: A numpy array of shape (num_test, D) containing test data.\n",
    "\n",
    "    Returns:\n",
    "    - dists: A numpy array of shape (num_test, num_train) where \n",
    "      dists[i, j] is the L1 distance between the ith test \n",
    "      point and the jth training point.\n",
    "    \"\"\"\n",
    "    num_test = X.shape[0]\n",
    "    num_train = self.X_train.shape[0]\n",
    "    dists = np.zeros((num_test, num_train))\n",
    "    X = X.astype('float')\n",
    "    for i in range(num_test):\n",
    "      for j in range(num_train):\n",
    "          dists[i, j] = np.sum(np.abs(self.X_train[j,:] - X[i,:]))\n",
    "        \n",
    "        \n",
    "     \n",
    "    return dists\n",
    "       "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "d5745a61-1071-4704-8b71-6c0d175de9fc"
    }
   },
   "source": [
    "We create an instance nn form the class `KNearestNeighbor_L1`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "235c3d13-a428-4dae-a286-6ea912f8a0b2"
    }
   },
   "outputs": [],
   "source": [
    "classifier = KNearestNeighbor_L1()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "94df5594-5eff-4354-bc83-889aca850336"
    }
   },
   "source": [
    "Call the method train of the `KNearestNeighbor` class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "627b4ca8-b0df-473d-8e53-3bcc2e31acd8"
    }
   },
   "outputs": [],
   "source": [
    "classifier.train(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "b96d32ad-0526-4a52-a91e-dffb4a9e634a"
    }
   },
   "source": [
    "We test our implementation with one loop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "f6ecd69e-e8b4-44a5-8ec1-8aeb47fbc5b5"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(500, 5000)"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "dists = classifier.compute_distances_one_loop(X_test)\n",
    "dists.shape  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "cd4c75ed-d9f1-4f3f-8990-2259f4f2f0d5"
    }
   },
   "source": [
    " Let us now predict labels and run the code below: We use $k = 10$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "606e2720-6672-45f3-ae46-761df5c2066d"
    }
   },
   "outputs": [],
   "source": [
    "y_test_pred = classifier.predict_labels(dists, k=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "3408b28c-0781-4186-b1cc-f8d0040ecf8f"
    }
   },
   "source": [
    "We compute and print the fraction of correctly predicted examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "1919eb5a-988f-4bee-a646-d110372bbca6"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Got 139 / 500 correct => accuracy: 0.278000\n"
     ]
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "num_correct = np.sum(y_test_pred == y_test)\n",
    "accuracy = float(num_correct) / len(y_test_pred)\n",
    "print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The confusion matrix looks as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "\n",
      "text/plain": [
       "<Figure size 648x648 with 2 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "source": [
    "# utility function for plotting confusion matrix\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "def plot_confmat(y_true, y_pred):\n",
    "    \"\"\"\n",
    "    Plot the confusion matrix and save to user_files dir\n",
    "    \"\"\"\n",
    "    conf_matrix = confusion_matrix(y_true, y_pred)\n",
    "    fig = plt.figure(figsize=(9,9))\n",
    "    ax = fig.add_subplot(111)\n",
    "    sns.heatmap(conf_matrix,\n",
    "                annot=True,\n",
    "                fmt='.0f')\n",
    "    plt.title('Confusion matrix')\n",
    "    ax.set_xticklabels( classes)\n",
    "    ax.set_yticklabels( classes)\n",
    "    plt.ylabel('True')\n",
    "    plt.xlabel('Predicted')\n",
    "    \n",
    "plot_confmat(y_test, y_test_pred)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "09892b80-b73f-41f3-8671-04ebc8f58ece"
    }
   },
   "source": [
    "# k-fold cross validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "4d4d5599-4959-4aa9-8250-7ccd99c0eef6"
    }
   },
   "outputs": [],
   "source": [
    "num_folds = 5\n",
    "\n",
    "k_choices = [1, 3, 5, 7, 9, 10, 12, 15, 18, 20, 50, 100]\n",
    "\n",
    "X_train_folds = []\n",
    "y_train_folds = []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "eeb5aeda-0ce5-4581-9fbe-e7605376384a"
    }
   },
   "source": [
    "We Split up the training data into folds. After splitting, `X_train_folds` and    \n",
    "`y_train_folds` should each be lists of length `num_folds`, where                \n",
    "`y_train_folds[i]` is the label vector for the points in `X_train_folds[i]`  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "50f9138b-3378-411f-96a1-5e3fe013e396"
    }
   },
   "outputs": [],
   "source": [
    "num_train = X_train.shape[0]\n",
    "fold_size = np.ceil(num_train/num_folds).astype('int')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "267d72f8-6485-4abd-a1d7-d26b4be5a6cc"
    }
   },
   "source": [
    " In the case of `num_train = 5000` and 5 folds, we obtain \n",
    "`X_train_folds = np.split(X_train, [1000, 2000, 3000, 4000])`\n",
    "`y_train_folds = np.split(y_train, [1000, 2000, 3000, 4000])`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
Mirko Birbaumer's avatar
Mirko Birbaumer committed
   "metadata": {
    "nbpresent": {
     "id": "ca3a1d8c-4b8a-42d6-94e7-793e87cebdea"
    }
   },
   "outputs": [],
   "source": [
    "X_train_folds = np.split(X_train, [(i + 1)*fold_size for i in np.arange(num_folds)])\n",
    "y_train_folds = np.split(y_train, [(i + 1)*fold_size for i in np.arange(num_folds)])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "e1be1d21-0776-4587-9b88-d38e804eab71"
    }
   },
   "source": [