* __Alternative 2.__ Create file structure and copy/past content of src and notebook folders to corresponding directories,
* For that solution, you need to create the file structure for storing scripts, notebooks, input data, etc... for FastClasAI pipeline, you may modify, basedir manually in each notebook, if necessary.
* follow the instruction below
## Step 1. Create basedir file for your project, eg myproject/
* then navigate to that file, and follow the instructions below,
%% Cell type:markdown id: tags:
## Step 2. Setup FastClassAI directory structure in basedir
%% Cell type:code id: tags:
``` python
# imports,
importos# allow changing, and navigating files and folders,
importsys
importre# module to use regular expressions,
importglob# lists names in folders that match Unix shell patterns
# basedir
basedir=os.path.dirname(os.getcwd())
os.chdir(basedir)
sys.path.append(basedir)
print(basedir)# shoudl be ../myproject/
# create folders holing different types of data por notebooks,
files_to_create={
"for whatever I dont use but wish to keep":os.path.join(basedir,"bin"),
"for random notes and materials created on project development":os.path.join(basedir,"notes"),
* HAM10000 dataset, has only one version, at the time of this project development, that was published in 2018.
* __Related Publications__
* Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 (2018). doi: 10.1038/sdata.2018.161 https://www.nature.com/articles/sdata2018161
* data (images and metadata) can be found at the botton of the source site, in section download
* Harvards site, contains 6 files that can be donwloaded. The followinbg three are required for that project:
* HAM10000_images_part_1.zip
* HAM10000_images_part_2.zip
* HAM10000_metadata.tab
* Unpack the files, and store all in basedir/data/raw
%% Cell type:markdown id: tags:
## Step 2. __Download Tf-hub Models used for feature extraction__
### NOTES
* In order to work more reliably, I donwloaded several pretrained models for feature extraction from images, from tf-hub
* My function, in section __Data Preparation__, can also use urls, however, it may be problematic in case of slow internet connection or reteated feature extractions perfomed on different data subsets (timeout occures frequently in these cases)
* Important: the funciton that I implemented in section __Data Preparation__ for feature extraction, accepts models constructed with TF1 and TF2.
### __Module Description__
* __Module name used in the project__
* BiT_M # working name resnet,
* __Full Module Name__
* bit_m-r101x1_1
* __url__
* https://tfhub.dev/google/bit/m-r101x1/1
* __Info__
* __Input Image size__
* (?, 224, 224, 3)
* __Output Feature Number__
* (?, 2048)
* __Short Description__
* Big Transfer (BiT) is a recipe for pre-training image classification models on large supervised datasets and efficiently fine-tuning them on any given target task. The recipe achieves excellent performance on a wide variety of tasks, even when using very few labeled examples from the target dataset.
* This module implements the R101x1 architecture (ResNet-101), trained to perform multi-label classification on ImageNet-21k, a dataset with 14 milion images labeled with 21,843 classes. Its outputs are the 2048-dimensional feature vectors, before the multi-label classification head. This model can be used as a feature extractor or for fine-tuning on a new target task.
images = ... # A batch of images with shape [batch_size, height, width, 3].
features = module(images) # Features with shape [batch_size, 2048].
%% Cell type:markdown id: tags:
---
# PART 3. Prepare Config Files - examples below
---
The goal of that part is to define dataset names, dataset varinat names, what tf hub models you use, colors you asign to each class in a project etc...
* there are 4 basic configs files that must be prepared
* __tfhub_configs.py__
* file that contains info on tf hub modules used for feature extraction
* __project_configs.py__
* basic description of the dataset
* __dataset_configs.py__
* contains dictionaries used to label images in each class, provide colors etc...
* and select classes for statistics
* __config_functions.py__
* .py file with special functions used to select files for data processing and module training,
* additionally there is a config file that contains model parameters used when training various ai models
* this will be descibed later on,
## Notes
* config files with CLASS_COLORS, and CLASS_DESCRIPTION, were prepared based on,
* Links from: https://dermoscopedia.org
* Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 (2018). doi: 10.1038/sdata.2018.161 https://www.nature.com/articles/sdata2018161
%% Cell type:markdown id: tags:
### Step 1. prepare tfhub_configs.py
* this config file contains one dictiory TFHUB_MODELS
* it is used for extracting features from images using dowlonaded tf hub modules,
* each module has unique name and working name that may be more descriptive and used on plots,
* the modules can be donwloaded from tf-hub and stored in basedir/models, or you may add "module_url" to each distionary that is also accepted by FastClassAI function,
"note":"tested on swissroads dataset, where it worked very well"
}
}# end
```
%% Cell type:markdown id: tags:
### create project_configs.py
* two variables are the most important:
* PROJECT_NAME : just a string with a solid project name that will be usxed in the project
* CLASS_DESCRIPTION : that contains description of each class in the original data, plus extra information such as links to external datasources, and class_description (created manually) that may be very usefull later on in the project, while evaliating the results or in EDA
# Purpose: information on each class, used for creating new class arrangment and for providing info on each class,
# Localization: project_configs.py
#
#. "key" : str, class name used in original dataset downloaded form databse
# "original_name" : str, same as the key, but you can introduce other values in case its necessarly
# "class_full_name" : str, class name used on images, saved data etc, (more descriptive then class names, or sometimes the same according to situation)
# "class_group" : str, group of classes, if the classes are hierarchical,
# "class_description" : str, used as notes, or for class description available for the user/client
# "links" : list, with link to more data, on each class
CLASS_DESCRIPTION={
'akiec':{
"original_name":'akiec',
"class_full_name":"squamous_cell_carcinoma",# prevoisly called "Actinic_keratoses" in my dataset, but ths name is easier to find in online resourses, noth names are correct,
"class_group":"Tumour_Benign",
"class_description":"Class that contains two subclasses:(A) Actinic_Keratoses or (B) Bowen’s disease. Actinic Keratoses (Solar Keratoses) and Intraepithelial Carcinoma (Bowen’s disease) are common non-invasive, variants of squamous cell carcinoma that can be treated locally without surgery. These lesions may progress to invasive squamous cell carcinoma – which is usually not pigmented. Both neoplasms commonly show surface scaling and commonly are devoid of pigment, Actinic keratoses are more common on the face and Bowen’s disease is more common on other body sites. Because both types are induced by UV-light the surrounding skin is usually typified by severe sun damaged except in cases of Bowen’s disease that are caused by human papilloma virus infection and not by UV. Pigmented variants exist for Bowen’s disease and for actinic keratoses",
* this is the config file with the largest number of variables,
* it contains information on
* DROPOUT_VALUE : a keword/value that can be introduced to batch labels and will be recognised by FastClassAI function to not use images labelled like that for model training, eg to undersample one or more classes, or to exlude images from some classes in model training,
* CLASS_COLORS
* a dictiionary with colors assigned to original class labels,
* key: original class label, value: color (any name accepted nby Matlotlib)
* CLASS_COLORS_zorder
* because some classes can be merged to build larger classes in different dataset variants,
I created that variale to assign proper colors to a class that emerges from joingin these towo or more classes,
* eg if we join class 1: yellow (zorder=1), and class 2: blue (zorder=100), new class will have blue color,