Enhancing DataBlock and subclasses

This issue discusses problems of DataBlock and its subclasses, and proposes some solutions.

DataBlock by default supports reading from two type of files, either a pickle file or a json. Later, the format npz and png was introduced for DesignRepresentation.

png and json are not implemented at all, cases are just pass'ed
Internally the data is by default stored as a pd.DataFrame, however there is some special case for npz it is saved as a np.array. This is problematic since using the DataBlock.data() getter or setter, one has two treat the two cases separately. It would be more user friendly to handle this under the hood, or agree to one internal storage format.

## Example from Dataset.load()

 if self.design_rep[key].format_file == "npz":
                            self.design_rep[key].data = np.concatenate(dr_mat, axis=0)[(vec_errors == 0)].astype(eval("np.{}".format(d_type)))
                        else:
                            self.design_rep[key].data = pd.concat(dr_mat).iloc[(vec_errors == 0)].reset_index()

This is related to listing 3 of 50#

The uid problem. Some subclasses of DataBlock expect an uid (e.g., PerformanceAttributes and DesignParameters) in the data, while others do not (e.g., InputML and OutputML). This results in special treatments when accessing the data. A good example can be found in Dataset._datamat_forML(...) on line 1147. I am not sure what is required for DataRepresentation, if there is a uid or not. One idea to solve this issue would just be to introduce a new subclass of DataObject which we call MetaData, to store data that is non relevant for the model.
We frequently want to retrieve a subset of the DataObject's with corresponding data matrix of a single or multiple DataBlock's. E.g in plotter.py:264-376, in particular, the methods Plotter._df_mat_aux(...), Plotter._get_blocks(...), Plotter._return_data(...) and Plotter._get_names. These are a lot of helper methods only there to get the right data used for plotting. Similar logic can also be found in the Dataset, when loading the InputML and OutputML data (some issues are tackled in a Draft-MR30, #54 ). So I suggest to add a method like

    def data_objs(self, names: List[str] = None) -> List[DataObject]:
        if names is None:
            return self.dobj_list
        else:
            return [dobj for dobj in self.dobj_list if dobj.name in names]

for getting the desired objects. For getting the data matrix we could add a method of the form

    def data_mat(self, names: List[str] = None):
        data_objs = self.data_objs(names)
        data = np.asarray(self.data)[:, 1:]  # Remove uid column
        return [data[dobj.position_index: dobj.position_index + dobj.dim] for dobj in data_objs]

Implementing above two methods is much easier when we solved the potential issues with the storage format and the the uid column.

name vs dblock_name, two ways of naming the DataBlock - IMO one should be enough, two are confusing. If dblock_name is only used for printing, one could either just use self.__class__.__name__ or implement the method def __repr__(self) if we want something more complicate.
There are two kids of possible normalizations, either per DataBlock normalisation or per DataObejct normalization, while the former is defined by providing an instance of the class Normalization, while the latter is defined by a string or an object with a __call__ function. I was wondering if, per DataObject normalization is not enough. Because briefly looking into the use case of per DataBlock used for the semiramis, normalization is as well applied per DataObject. Also see discussion in #49 (closed)
Changing the names under the hood, results in code duplication. Around 20 occurrences of str.replace(" ", "_") and str.replace("_", " ")
The following assignment self.dobj_list_orig = dobj_list in the constructor is confusing. See discussion in #38
Methods pop(...), append(...) add(...), update_obj(...) deprecated?
To implement or to remove: _apply_transf(...), _check_consistency(...)

Edited Jul 03, 2023 by Alessandro Maissen