{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SkinAnaliticAI, Skin Cancer Detection with AI Deep Learning\n",
    "\n",
    "## __Evaluation of Harvard Dataset with different AI classiffication techniques using FastClassAI papeline__\n",
    "Author: __Pawel Rosikiewicz__   \n",
    "prosikiewicz@gmail.com      \n",
    "License: __MIT__    \n",
    "ttps://opensource.org/licenses/MIT        \n",
    "Copyright (C) 2021.01.30 Pawel Rosikiewicz        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Goals:\n",
    "* quick EDA on data provided by dataset Authors, \n",
    "* to prepare a summary table with filtered images (eg no missing data) with all metadata that could be used lateron, in in the project \n",
    "* if metadat are not available, you need to find following informaiton on each image:  \n",
    "    * image names, \n",
    "    * original classs label for each image that will correspon dot orinal labels in config files, \n",
    "\n",
    "## Key Observations:\n",
    "* __classed_to_poredict/target__\n",
    "    * presented in dx column, \n",
    "    * 7 classes \n",
    "* __columns with dx_type__,   \n",
    "    * indicates, how the images were classified, From documentarion we know that there were three methods:\n",
    "    * histo - histopatological, using biopsy to classify the lession - considered most reliable (53.7% of images), \n",
    "    * followup - image \n",
    "* __missing data__ \n",
    "    * approximately 0.1% of rows have missing data  \n",
    "    * these were found only in one column: age\n",
    "* __duplicates__\n",
    "    * all image_id are unique, \n",
    "    * howvever, many images are technical duplicates, ie, these are images of the same skin chnages (lesions), taken at different time, angle, magniffication etc.. \n",
    "\n",
    "## Caution\n",
    "* all config files are based on original class labeling, \n",
    "* other labelling dictionaries were created to for example, merge different classes easily with each other, \n",
    "* if classes are not available, or are \"weird\", just select one classyficaiton system and then work with it, \n",
    "* class labels can be very easily changed, in that project"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### standard imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os # allow changing, and navigating files and folders, \n",
    "import sys\n",
    "import shutil\n",
    "import re # module to use regular expressions, \n",
    "import glob # lists names in folders that match Unix shell patterns\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "from tensorflow.keras.preprocessing.image import ImageDataGenerator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'cv2'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-2-d840e6510e5a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     11\u001b[0m \u001b[0;31m# caution, loaded only form basedir,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     12\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0msrc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mEDA_Helpers2\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0meda_helpers\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0msrc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mannotated_pie_charts\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mannotated_pie_chart_with_class_and_group\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mprepare_img_classname_and_groupname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     14\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msrc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata_preparation_tools\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mcopy_and_organize_files_for_keras_image_generators\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     15\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msrc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata_preparation_tools\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mcreate_file_catalogue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/work/amld-2021-workshop/src/utils/annotated_pie_charts.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     42\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msrc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mimage_augmentation\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0;31m# to create batch_labels files,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     43\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msrc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata_loaders\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mload_encoded_imgbatch_using_logfile\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mload_raw_img_batch\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 44\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0msrc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtools_for_plots\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mcreate_class_colors_dict\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     45\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     46\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/work/amld-2021-workshop/src/utils/tools_for_plots.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     27\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mrandom\u001b[0m \u001b[0;31m# functions that use and generate random numbers\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     28\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 29\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mcv2\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     30\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mnp\u001b[0m \u001b[0;31m# support for multi-dimensional arrays and matrices\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     31\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mpandas\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mpd\u001b[0m \u001b[0;31m# library for data manipulation and analysis\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'cv2'"
     ]
    }
   ],
   "source": [
    "# setup basedir\n",
    "basedir = os.path.dirname(os.getcwd())\n",
    "os.chdir(basedir)\n",
    "sys.path.append(basedir)\n",
    "\n",
    "# set up paths for the project\n",
    "PATH_raw = os.path.join(basedir, \"data/raw\")\n",
    "PATH_interim = os.path.join(basedir, \"data/interim\")\n",
    "PATH_interim_dataset_summary_tables = os.path.join(PATH_interim, \"dataset_summary_tables\") # create in that notebook, \n",
    "\n",
    "# caution, loaded only form basedir,\n",
    "import src.utils.EDA_Helpers2 as eda_helpers \n",
    "from src.utils.annotated_pie_charts import annotated_pie_chart_with_class_and_group, prepare_img_classname_and_groupname\n",
    "from src.utils.data_preparation_tools import copy_and_organize_files_for_keras_image_generators\n",
    "from src.utils.data_preparation_tools import create_file_catalogue\n",
    "from src.utils.data_preparation_tools import create_keras_comptatible_file_subset_with_class_folders\n",
    "from src.utils.example_plots import *\n",
    "from src.utils.feature_extraction_tools import encode_images_with_tfhubmodule # with tf.compat.v1 functions, for tf.__version___ >= 1.15\n",
    "from src.utils.clustered_histogram import find_n_examples_in_each_class, clustered_histogram_with_image_examples, calculate_linkage_for_images_with_extracted_features\n",
    "from src.utils.clustered_histogram import add_descriptive_notes_to_each_cluster_in_batch_labels, find_clusters_on_dendrogram, create_clustered_heatmap_with_img_examples\n",
    "from src.utils.data_loaders import load_encoded_imgbatch_using_logfile, load_raw_img_batch\n",
    "from src.utils.example_plots_after_clustering import plot_img_examples, create_spaces_between_img_clusters, plot_img_examples_from_dendrogram\n",
    "from src.utils.annotated_pie_charts import annotated_pie_chart_with_class_and_group, prepare_img_classname_and_groupname\n",
    "from src.utils.tools_for_plots import create_class_colors_dict\n",
    "from src.utils.data_preparation_tools import create_data_subsets\n",
    "\n",
    "# load project configs\n",
    "from src.configs.project_configs import PROJECT_NAME\n",
    "from src.configs.project_configs import CLASS_DESCRIPTION # information on each class, including descriptive class name and diegnostic description - used to help wiht the project\n",
    "from src.configs.tfhub_configs import TFHUB_MODELS # names of TF hub modules that I presenlected for featuress extraction with all relevant info,\n",
    "from src.configs.dataset_configs import DATASET_CONFIGS # names created for clases, assigned to original one, and colors assigned to these classes\n",
    "from src.configs.dataset_configs import CLASS_LABELS_CONFIGS # names created for clases, assigned to original one, and colors assigned to these classes\n",
    "from src.configs.dataset_configs import DROPOUT_VALUE # str, special value to indicate samples to remoce in class labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### variables used to clean the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT_NAME                      = \"SkinAnaliticAI_Harvard_dataset_evaluation\" # \n",
    "DATASET_NAME                      = \"HAM10000\"  # name used in config files to identify all info on that dataset variant\n",
    "\n",
    "# metadata table cleaning \n",
    "ORIGINAL_METADATA_FILENAME        = \"HAM10000_metadata.csv\" # this is something you must check, I had only one csv file in that case\n",
    "COLNAME_ORIGINAL_FILENAMES        = \"image_id\"\n",
    "COLNAME_ORIGINAL_CLASS_LABELS     = \"dx\" # we will use original_labels column later on to have standard name\n",
    "COLNAME_POTENTIAL_DUPLICATES      = [\"lesion_id\"] # each treated separately, \n",
    "COLNAME_ORGINAL_CLASS_TO_REDUCE   = {\"nv\":3000}  # ie. 2500 randomly selected images in nv class will be removed, \n",
    "\n",
    "#.. adding new info to metadata\n",
    "DATA_TYPE                         = \"raw_data\"   # no image augmentation was applied, added to metadata o each raw image, \n",
    "DATASET_VARIANTS                  = DATASET_CONFIGS[DATASET_NAME][\"labels\"] # class labels that will be used, SORT_FILES_WITH   must be included\n",
    "\n",
    "# sorting images into class-labelled folders, \n",
    "SORT_FILES_WITH                   = \"original_labels\" # these class labels will be used to sort images, the other classes will be stored \n",
    "INPUT_DATA_DIRNAME_LIST           = [\"HAM10000_images_part_1\", \"HAM10000_images_part_2\"] # name of files where are located raw files/images that will be ssegragated with one of the class labe systems\n",
    "OUTPUT_DIRNAME                    = DATASET_NAME # dir name in basedir/intrim where sorted images will be stored, "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load anmd explore metadata, \n",
    "* join mutiple files into one df, if required"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.chdir(PATH_raw); \n",
    "\n",
    "# load metadata for images and check the df, \n",
    "img_metadata = pd.read_csv(ORIGINAL_METADATA_FILENAME)\n",
    "display(img_metadata.head(5))\n",
    "    \n",
    "# summarize the table with DATA FRAME EXPLORER\n",
    "data_examples, top_val_names, top_val_perc = eda_helpers.summarize_df(df=img_metadata) \n",
    "eda_helpers.examine_df_visually(data_examples = data_examples, top_values_perc = top_val_perc) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CLEAN THE IMAGE LABELS FROM MISSING DATA AND DUPLICATES, [BALANCE THE CLASSES]\n",
    "* Remove duplicates and rows with missing data, \n",
    "* optionally, reduce the number of images in one unbalanced classes\n",
    "* create or duplicatec column with one class_labels system names \"original_class\"\n",
    "    * here i was using dsata form one source, so that was not an issue, but it becomes one, if you need to used data form many different sources, having different names for the same class\n",
    "\n",
    "* The final table shodul have columns with:\n",
    "    * original_filename - original, and unique file/image name\n",
    "    * original_class - see in the above, \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load-again original metadata \n",
    "'''\n",
    "    re-loading, helps if you introduce some chnages in cleaning process, \n",
    "    but only if df is relatively small, otherwise make a copy here, \n",
    "    or develope cleaning procedure on a smaller subset\n",
    "'''\n",
    "os.chdir(PATH_raw);  \n",
    "dataset_summary_table  = pd.read_csv(ORIGINAL_METADATA_FILENAME)\n",
    "\n",
    "\n",
    "# .......................................................\n",
    "# Global modiffication\n",
    "\n",
    "# (a) find, or create columns with original_filename & original_labels, and name them like that, \n",
    "dataset_summary_table[\"original_filenames\"]  = dataset_summary_table[COLNAME_ORIGINAL_FILENAMES]\n",
    "dataset_summary_table[\"original_labels\"]     = dataset_summary_table[COLNAME_ORIGINAL_CLASS_LABELS]\n",
    "\n",
    "# (b) add columns with project metadata\n",
    "dataset_summary_table[\"project_name\"]       = PROJECT_NAME # no image augmentation was applied\n",
    "dataset_summary_table[\"dataset_name\"]       = DATASET_NAME\n",
    "dataset_summary_table[\"data_type\"]          = DATA_TYPE\n",
    "\n",
    "# (c) keep raw data, to compare them with cleaned data, later on, \n",
    "dataset_summary_table_before_cleaning = dataset_summary_table.copy()\n",
    "dataset_summary_table_before_cleaning[\"dataset_cleaing\"] = \"before_cleaning\"\n",
    "\n",
    "\n",
    "\n",
    "# .......................................................\n",
    "# Remove rows with missing data, and duplicates,\n",
    "\n",
    "# (a) remove rows with missing data\n",
    "print(\"table shape - no chnages: \",dataset_summary_table.shape)\n",
    "dataset_summary_table = dataset_summary_table.dropna(axis=0)\n",
    "print(\"table shape - without na: \", dataset_summary_table.shape)\n",
    "\n",
    "# (b) remove rows with duplicated image_ids\n",
    "for one_colname in COLNAME_POTENTIAL_DUPLICATES:\n",
    "    dataset_summary_table = dataset_summary_table.drop_duplicates(subset=one_colname, keep='first')\n",
    "    dataset_summary_table.reset_index(inplace=True)\n",
    "print(\"table shape - without duplicates: \", dataset_summary_table.shape)\n",
    "\n",
    "# (c) again keep raw data, to compare them with cleaned data, later on, \n",
    "dataset_summary_table_without_NA_duplicates = dataset_summary_table.copy()\n",
    "dataset_summary_table_without_NA_duplicates[\"dataset_cleaing\"] = \"NA_and_dupl_removed\"\n",
    "\n",
    "\n",
    "\n",
    "# .......................................................\n",
    "# Balance the datset: remove random images form selected classes, \n",
    "''' here, for simplicity,  decided to provide numbers, instead of % values, '''\n",
    "\n",
    "# (a) print info,\n",
    "print(\"\\nclass_counts before balancing\")\n",
    "display(dataset_summary_table.loc[:, COLNAME_ORIGINAL_CLASS_LABELS].value_counts())\n",
    "\n",
    "\n",
    "# (b) remove requested number of images from selected classes\n",
    "print(\"\\n\")\n",
    "if len(COLNAME_ORGINAL_CLASS_TO_REDUCE)>0:\n",
    "    for cl, nr in COLNAME_ORGINAL_CLASS_TO_REDUCE.items():\n",
    "        print(f\" - removing {nr} files from {cl}\")\n",
    "        idx_to_remove = np.random.choice(np.where(dataset_summary_table.loc[:, COLNAME_ORIGINAL_CLASS_LABELS]==cl)[0], nr, replace=False)\n",
    "        dataset_summary_table = dataset_summary_table.drop(idx_to_remove, axis=0)\n",
    "        dataset_summary_table.reset_index(drop=True, inplace=True)\n",
    "        \n",
    "# (c) add note on data cleanign status\n",
    "dataset_summary_table[\"dataset_cleaing\"] = \"balanced_and_cleaned\"\n",
    "    \n",
    "# (d) quickly test results, \n",
    "print(\"\\nclass_counts after balancing\")\n",
    "display(dataset_summary_table.loc[:, COLNAME_ORIGINAL_CLASS_LABELS].value_counts())\n",
    "display(dataset_summary_table.head(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save summary tables, \n",
    "PATH_results = os.path.join(PATH_interim, f'dataset_summary_tables')\n",
    "try:\n",
    "    os.mkdir(PATH_results)\n",
    "except:\n",
    "    pass\n",
    "os.chdir(PATH_results)\n",
    "\n",
    "# raw data, without NA and duplicates \n",
    "file_name = f\"{DATASET_NAME}_cleaned__dataset_summary_table.csv\"\n",
    "dataset_summary_table_without_NA_duplicates.to_csv(file_name, header=True, index=False)\n",
    "print(\"saved: \",file_name, \" - shape \", dataset_summary_table_without_NA_duplicates.shape)\n",
    "\n",
    "# cleaned data \n",
    "file_name = f\"{DATASET_NAME}_cleaned_and_balanced__dataset_summary_table.csv\"\n",
    "dataset_summary_table.to_csv(file_name, header=True, index=False)\n",
    "print(\"saved: \",file_name, \" - shape \", dataset_summary_table.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test How Data Cleaning affected each data varinat compositions with images/files in different classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# prepare plot_data\n",
    "'''\n",
    "    concatenated df's need to have different values in dataset_cleaing column\n",
    "    this will also include droppped out files/images (if any included), \n",
    "    that are removed, later-on, optionally while training a model,\n",
    "'''\n",
    "plot_data = pd.concat([dataset_summary_table_before_cleaning, \n",
    "                       dataset_summary_table_without_NA_duplicates],axis=0)\n",
    "\n",
    "\n",
    "    \n",
    "# plot pie charts with number of images in each class, in each classyficaiton system\n",
    "for one_dataset_variant in DATASET_VARIANTS:\n",
    "    \n",
    "    # Prepare the data, and add them to df, \n",
    "    \n",
    "    #.. map new class names on original labels\n",
    "    ds_mapped_classnames = plot_data.loc[:, COLNAME_ORIGINAL_CLASS_LABELS].map(CLASS_LABELS_CONFIGS[one_dataset_variant]['class_labels_dict'])\n",
    "    \n",
    "    #..  add new columns to df, \n",
    "    plot_data[f\"{one_dataset_variant}\"] = ds_mapped_classnames\n",
    "    \n",
    "        \n",
    "    # Pie chart with dataset composition\n",
    "    annotated_pie_chart_with_class_and_group(\n",
    "        title          = f'{one_dataset_variant}',\n",
    "        classnames     = plot_data.loc[:, one_dataset_variant].values.tolist(),\n",
    "        class_colors   = CLASS_LABELS_CONFIGS[one_dataset_variant]['class_labels_colors'],\n",
    "        groupnames     = plot_data.dataset_cleaing.values.tolist(),\n",
    "        # plot aestetics \n",
    "        figsze_scale=1.5,\n",
    "        ax_title_fonsize_scale=0.6,\n",
    "        wedges_fontsize_scale=1,\n",
    "        add_group_item_perc_to_numbers_in_each_pie=False,\n",
    "        title_ha=\"center\",\n",
    "        mid_pie_circle_color=\"lightblue\",\n",
    "        tight_lyout=True,\n",
    "        subplots_adjust_top=0.9,\n",
    "        legend_fontsize_scale=1.5,\n",
    "        legend_loc=(0.05, 0.81),\n",
    "        legend=True, # because each class is annotated, \n",
    "        legend_ncol=2,\n",
    "        n_subplots_in_row=1\n",
    "    )    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# STEP 3. CREATE DATSET WITH IMAGES/FILES SORTED INTO CLASS-NAMED FOLDERS\n",
    "* __IMPORTANT__\n",
    "    * I will crerate dataset folder in basedir/data/interim\n",
    "    * there will be at least three folders, test, valid and train datasubsets\n",
    "    * each folder may be divided into smaller subset, and they may or may not be called as train/valid"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load dataset summary table with all images that will be used\n",
    "os.chdir(PATH_interim_dataset_summary_tables)\n",
    "file_name             = f\"{DATASET_NAME}_cleaned__dataset_summary_table.csv\"\n",
    "dataset_summary_table = pd.read_csv(file_name)\n",
    "dataset_summary_table.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## STEP 3a) Create dataset variant with class-sorted images"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## STEP 3a) CREATE FOLDER FOR EACH DATASET VARIANT, WITH SORTED IMAGES\n",
    "for dt_i, dataset_variant in enumerate(DATASET_VARIANTS):\n",
    "    print(f'- {dt_i} - Preparing: {dataset_variant}')\n",
    "\n",
    "    # exctract filenames and original class labels, \n",
    "    original_filenames      = pd.Series(dataset_summary_table.loc[:, \"original_filenames\"])\n",
    "    original_labels         = pd.Series(dataset_summary_table.loc[:, \"original_labels\"])\n",
    "     \n",
    "    # Find class labels for dataset variant   \n",
    "    class_labels_variant_dict = CLASS_LABELS_CONFIGS[dataset_variant][\"class_labels_dict\"]\n",
    "    new_class_labels = original_labels.map(class_labels_variant_dict)\n",
    "    \n",
    "    # Remove dropout class or files, \n",
    "    '''here you may add the code, to remove speciffic files/images'''\n",
    "    idx_to_remove      = np.where(new_class_labels==DROPOUT_VALUE)[0].tolist()\n",
    "    new_class_labels   = new_class_labels.drop(idx_to_remove).values.tolist()\n",
    "    original_filenames = original_filenames.drop(idx_to_remove).values.tolist()    \n",
    "    \n",
    "    # copy files to temporary directory and organize with new class labels\n",
    "    _, _ =  copy_and_organize_files_for_keras_image_generators(\n",
    "            # ... files description\n",
    "            file_name_list          = original_filenames,   # list, names of files to be copied, if they contain file extension, see ad d nothing below, \n",
    "            class_name_list         = new_class_labels,     # list of classses, same lenght as file_names_list, Caution, no special characters allowed !\n",
    "            # ... inputs\n",
    "            src_path                = PATH_raw,                 # str, path to file, that holds at least one, specified folder with files (eg images.jpg) to copy, to  data_dst_path/class_name/files eg .jpg\n",
    "            src_dataset_name_list   = INPUT_DATA_DIRNAME_LIST,  # names of directories, that shodul be found in input_data_path\n",
    "            # ... outputs\n",
    "            dst_path                = PATH_interim,        # str, path to file, where the new_dataset file will be created, as follow data_dst_path/dataset_name/subset_name/class_name/files eg .jpg\n",
    "            dst_dataset_name        = f\"{DATASET_NAME}__{dataset_variant}\",  # str, used to save the file, in data_dst_path/dataset_name/subset_name/class_name/files eg .jpg\n",
    "            dst_subset_name         = \"All_files_organized_by_class\",    # str, same as above, eg=train data, data_dst_path/dataset_name/subset_name/class_name/files eg .jpg\n",
    "            file_extension_list     = [\".jpg\"],            # file extensions that shoudl be tranferred, dont forget about the dot., the fucntion will also accept \"\" as for no extension\n",
    "            # ...\n",
    "            verbose=False,\n",
    "            track_progres=False,  \n",
    "            return_logfiles=True,                   # returns, list with \n",
    "            create_only_logfiles=False              # bool, special option, the function bechaves in the same way, except it do not copy files, but create logfiles only,\n",
    "                                                    #        with classified, items, grouped in dct where key is a class name, it rtunrs two logfiles, with present and mising files,\n",
    "        ) \n",
    "\n",
    "    # test if the images were copied as expected\n",
    "    _ = create_file_catalogue(\n",
    "            path = os.path.join(PATH_interim, \n",
    "                                f\"{DATASET_NAME}__{dataset_variant}\",\n",
    "                                \"All_files_organized_by_class\"\n",
    "                               ),\n",
    "            searched_class_name_list = None, # if none, catalog, all, \n",
    "            verbose=True)\n",
    "    print(f\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## STEP 3c) CREATE TRAIN, TEST AND VALIDATION DATASET SUBSETS IN EACH DATASET VARIANT\n",
    "* each of them will have 10% of images, with the same proportions of images in each class as in the source\n",
    "* All_files_organized_by_class folder shodul be empty at the end, \n",
    "* IMPORTANT: I'am moving images, instead of copying them, to avoid creating duplicates between subsets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for dt_i, dataset_variant in enumerate(DATASET_VARIANTS):\n",
    "    print(f'- {dt_i} - Preparing subsets for {dataset_variant}')\n",
    "\n",
    "    # names,\n",
    "    src_path                 = os.path.join(PATH_interim, f'{DATASET_NAME}__{dataset_variant}')\n",
    "    src_subset_name          = f\"All_files_organized_by_class\"\n",
    "    dst_path                 = src_path\n",
    "    new_data_subset_names    = [    'test_01', 'test_02',\n",
    "                                    'valid_01', \"valid_02\", \n",
    "                                    \"train_01\", \"train_02\",\"train_03\", \n",
    "                                     \"train_04\",\"train_05\", \"train_06\",\"train_07\"]\n",
    "    file_proportions         = [0.05,0.05]+[0.1]*8+[0.99] # the last one is larger to ensure that all files will be tranferred and none left due to rounding errors,                  \n",
    "        \n",
    "    # create subsets with equal proportins of images from each class\n",
    "    create_data_subsets(\n",
    "                src_path                 = src_path,\n",
    "                src_subset_name          = src_subset_name,\n",
    "                dst_path                 = dst_path,\n",
    "                dst_subset_name_list     = new_data_subset_names,\n",
    "                # ...\n",
    "                new_subset_size          = file_proportions, # list, ==len(dst_subset_name_list)\n",
    "                min_new_subset_size      = 0.025,\n",
    "                # ...\n",
    "                move_files               = True,\n",
    "                random_state_nr          = 0,\n",
    "                fix_random_nr            = True,\n",
    "                verbose                  = False\n",
    "            )            \n",
    "    print(\"done ........................ no more files to transfer\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check how many images and from what class were placed in each subset in each dtataset variant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# find all newly created folders wiht subsets in dataset folder, \n",
    "for dataset_variant in DATASET_VARIANTS:\n",
    "    path = os.path.join(PATH_interim, f\"{DATASET_NAME}__{dataset_variant}\")\n",
    "    os.chdir(path)\n",
    "    \n",
    "    filenames = []\n",
    "    class_long_list = list()\n",
    "    file_long_list = list()\n",
    "    for file in glob.glob(f\"[test|valid|train]*\"):\n",
    "        filenames.append(file)    \n",
    "    \n",
    "    for s_name in filenames:\n",
    "        log = create_file_catalogue(\n",
    "                path = os.path.join(path, s_name),\n",
    "                searched_class_name_list = None, # if none, catalog, all, \n",
    "                verbose=False)\n",
    "        \n",
    "        for one_key in list(log.keys()):\n",
    "            class_long_list.extend([one_key]*len(log[one_key]))\n",
    "            file_long_list.extend([s_name]*len(log[one_key]))\n",
    "\n",
    "    # .. Pie chart with dataset composition\n",
    "    annotated_pie_chart_with_class_and_group(\n",
    "        title=f'Class size/percentage in every data subset created from \\n {DATASET_NAME}, {dataset_variant}',\n",
    "        classnames=class_long_list,\n",
    "        groupnames=file_long_list,\n",
    "        class_colors=CLASS_LABELS_CONFIGS[dataset_variant][\"class_labels_colors\"], \n",
    "        figsze_scale=1.5,\n",
    "        ax_title_fonsize_scale=0.4,\n",
    "        wedges_fontsize_scale=1,\n",
    "        add_group_item_perc_to_numbers_in_each_pie=False,\n",
    "\n",
    "        title_ha=\"center\",\n",
    "        mid_pie_circle_color=\"lightblue\",\n",
    "        tight_lyout=True,\n",
    "        subplots_adjust_top=0.9,\n",
    "        legend_loc=(0.1, 0.89),\n",
    "        legend=True, # because each class is annotated, \n",
    "        legend_ncol=4,\n",
    "        legend_fontsize_scale=4,\n",
    "\n",
    "        n_subplots_in_row=3\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show examples of images from different clasees form different subsets in each dataset variant "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# find all newly created folders wiht subsets in dataset folder, \n",
    "for dataset_variant in DATASET_VARIANTS:\n",
    "    print(f\"\\n - {dataset_variant} - \\n\")\n",
    "    path = os.path.join(PATH_interim, f\"{DATASET_NAME}__{dataset_variant}\")\n",
    "    os.chdir(path)\n",
    "    \n",
    "    # find filenames with images\n",
    "    filenames = []\n",
    "    for file in glob.glob(f\"[test|valid|train]*\"):\n",
    "        filenames.append(file)   \n",
    "        \n",
    "    # reduce number if dataset to display (too much space)\n",
    "    filenames = filenames[0:1]    \n",
    "         \n",
    "    # create data generator with batch size for up to 1000 randomly selceted images\n",
    "    datagen = ImageDataGenerator()#rescale=1/255) \n",
    "    dataiter_dict = dict()\n",
    "    for sn in filenames:\n",
    "        # .. create proper iterator, that allowss loading all availble images, - here it will always load all files, \n",
    "        dataiter_dict[sn]  = datagen.flow_from_directory(\n",
    "                            os.path.join(path, sn), \n",
    "                            target_size=(200, 200),\n",
    "                            batch_size=200, #img_nr_in_one_subset, \n",
    "                            shuffle=True # done later on by my fucntion        \n",
    "        )    \n",
    "\n",
    "\n",
    "    # Plot two examples of each class from each dataset, \n",
    "    for setname in filenames:\n",
    "        display(plot_example_images_using_generator(\n",
    "            dataiter_dict[setname],\n",
    "            title=setname, \n",
    "            pixel_size=200, # only one value, as both heigh and width will be the same\n",
    "            class_n_examples=2)\n",
    "               )"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}