Data reformatters
To make the data handling more natural for the end-user, I want to allow that the inputs and outputs of some methods can be formatted in a more human-readable manner and closer to the definitions of the data objects. For example here:
- inputs & outputs of the custom callback functions (e.g. analyser callback)
- outputs of the generator Currently they require or return either a flattened dataframe or np.array.
We would add two formats:
- pandas dataframe but where columns correspond exactly with data objects (e.g. column name matched the data object name and a cell contains a list of values if dim>1)
- dictionary, one for each sample, containing data object names as keys. [!] Requires that object name are unique! Preferrably used only for small amounts of samples, e.g. as output of a generator.
We would also allow the user to specify the input/output format with a flag argument (dataformat
) where applicable.
Example:
AnalysisCallback('Analysis function', func_callback = \[analysis_pipeline\], dataset = dataset, dataformat = 'df')
With this the user will be free to decide which input/output format is easiert to handle for them when writing the callback function.
To ease all conversions, we'll add some helper functions that convert data between the formats. See sketch in this notebook. The formats to convert from and to:
- nested list
- dataframe flattened (one value per cell only)
- dataframe with lists (a cell can contain a list of values if a dataobject is has a dimension > 1)
- list of dictionaries
In an intended usage, each row (2,3) or item in the main list (1,4) corresponds to a sample. Columns (2,3), sublists (1) or dictionary (4) contain the values of the data objects. Depending on where it's used, the data involved may be design parameters, performance attributes, requested attributes, input ML, output ML or any cherry-picked subset of data objects.