Enhancing DataBlock and subclasses
This issue discusses problems of DataBlock
and its subclasses, and proposes some solutions.
DataBlock by default supports reading from two type of files, either a pickle file or a json. Later, the format npz
and png
was introduced for DesignRepresentation
.
-
png
andjson
are not implemented at all, cases are justpass
'ed - Internally the data is by default stored as a
pd.DataFrame
, however there is some special case fornpz
it is saved as anp.array
. This is problematic since using theDataBlock.data()
getter or setter, one has two treat the two cases separately. It would be more user friendly to handle this under the hood, or agree to one internal storage format.
## Example from Dataset.load()
if self.design_rep[key].format_file == "npz":
self.design_rep[key].data = np.concatenate(dr_mat, axis=0)[(vec_errors == 0)].astype(eval("np.{}".format(d_type)))
else:
self.design_rep[key].data = pd.concat(dr_mat).iloc[(vec_errors == 0)].reset_index()
This is related to listing 3 of 50#
-
The
uid
problem. Some subclasses ofDataBlock
expect anuid
(e.g.,PerformanceAttributes
andDesignParameters
) in the data, while others do not (e.g.,InputML
andOutputML
). This results in special treatments when accessing the data. A good example can be found inDataset._datamat_forML(...)
on line 1147. I am not sure what is required forDataRepresentation
, if there is a uid or not. One idea to solve this issue would just be to introduce a new subclass ofDataObject
which we callMetaData
, to store data that is non relevant for the model. -
We frequently want to retrieve a subset of the
DataObject
's with corresponding data matrix of a single or multipleDataBlock
's. E.g inplotter.py:264-376
, in particular, the methodsPlotter._df_mat_aux(...)
,Plotter._get_blocks(...)
,Plotter._return_data(...)
andPlotter._get_names
. These are a lot of helper methods only there to get the right data used for plotting. Similar logic can also be found in theDataset
, when loading the InputML and OutputML data (some issues are tackled in a Draft-MR30, #54 ). So I suggest to add a method like
def data_objs(self, names: List[str] = None) -> List[DataObject]:
if names is None:
return self.dobj_list
else:
return [dobj for dobj in self.dobj_list if dobj.name in names]
for getting the desired objects. For getting the data matrix we could add a method of the form
def data_mat(self, names: List[str] = None):
data_objs = self.data_objs(names)
data = np.asarray(self.data)[:, 1:] # Remove uid column
return [data[dobj.position_index: dobj.position_index + dobj.dim] for dobj in data_objs]
Implementing above two methods is much easier when we solved the potential issues with the storage format and the the uid
column.
-
name
vsdblock_name
, two ways of naming theDataBlock
- IMO one should be enough, two are confusing. Ifdblock_name
is only used for printing, one could either just useself.__class__.__name__
or implement the methoddef __repr__(self)
if we want something more complicate. -
There are two kids of possible normalizations, either per DataBlock normalisation or per
DataObejct
normalization, while the former is defined by providing an instance of the classNormalization
, while the latter is defined by a string or an object with a__call__
function. I was wondering if, per DataObject normalization is not enough. Because briefly looking into the use case of perDataBlock
used for the semiramis, normalization is as well applied perDataObject
. Also see discussion in #49 (closed) -
Changing the names under the hood, results in code duplication. Around 20 occurrences of
str.replace(" ", "_")
andstr.replace("_", " ")
-
The following assignment
self.dobj_list_orig = dobj_list
in the constructor is confusing. See discussion in #38 -
Methods
pop(...)
,append(...)
add(...)
,update_obj(...)
deprecated? -
To implement or to remove:
_apply_transf(...)
,_check_consistency(...)