#81: Decoupling the dataset from the ML logic and a lot more
This MR decouples aims to decouple the Dataset
from the ML-related functionality. To achieve this, major refactoring and enhancements were necessary in several locations. These are
- Complete revision of the
DataModule
- Complete revision of the checkpoint logic in
CondAEModel
to store datamodule parameters (e.g, batch_size) or other extra parameters (e.g., fitting parameters such as the max number of epochs) - Created a dependency between DataModule and
CondAEModel
, i.e, if the model was created according to the DataModule (in particular withCondAEModel.from_datamodule(...)
) it is possible to restore the DataModule with fitted transformations and normalisation (without training data) from the model, i.e, the checkpoint - Complete revision of the per data block normalisation, as we need this normalisation to be pickable.
- New
DataBlock
calledTransformableDataBlock
to keep track of transformed data objects and its dimensions. This should finally help to solve many problems we had with categorical variables. - Added tests for the ML model and DataModule, these are by far not complete but better than having no tests.
- Adjusted the Semiramis example to the new workflow
- Other stuff: Removed a lot of unused state, fixed some minor bugs in the categorical encoder
Deferred
There are some things that are not tackled in this part of the MR. This incudes adjustments in the sampler and the plotter, so they might be in a buggy state after the merge. In a second step @sluis will take care of the sampler, while @alessandro.maissen revises the plotter. This MR already adds some comments to locations were further revision is required.
Breaking Changes
- Many, examples need to be updated. See the Semiramis example.
Edited by Alessandro Maissen