Skip to content

#81: Decoupling the dataset from the ML logic and a lot more

Alessandro Maissen requested to merge 81-decouple-ml-and-dataset into master

This MR decouples aims to decouple the Dataset from the ML-related functionality. To achieve this, major refactoring and enhancements were necessary in several locations. These are

  • Complete revision of the DataModule
  • Complete revision of the checkpoint logic in CondAEModel to store datamodule parameters (e.g, batch_size) or other extra parameters (e.g., fitting parameters such as the max number of epochs)
  • Created a dependency between DataModule and CondAEModel, i.e, if the model was created according to the DataModule (in particular with CondAEModel.from_datamodule(...)) it is possible to restore the DataModule with fitted transformations and normalisation (without training data) from the model, i.e, the checkpoint
  • Complete revision of the per data block normalisation, as we need this normalisation to be pickable.
  • New DataBlock called TransformableDataBlock to keep track of transformed data objects and its dimensions. This should finally help to solve many problems we had with categorical variables.
  • Added tests for the ML model and DataModule, these are by far not complete but better than having no tests.
  • Adjusted the Semiramis example to the new workflow
  • Other stuff: Removed a lot of unused state, fixed some minor bugs in the categorical encoder

Deferred

There are some things that are not tackled in this part of the MR. This incudes adjustments in the sampler and the plotter, so they might be in a buggy state after the merge. In a second step @sluis will take care of the sampler, while @alessandro.maissen revises the plotter. This MR already adds some comments to locations were further revision is required.

Breaking Changes

  • Many, examples need to be updated. See the Semiramis example.
Edited by Alessandro Maissen

Merge request reports

Loading