Skip to content
Snippets Groups Projects
Commit fccd8bbd authored by Chandrasekhar Ramakrishnan's avatar Chandrasekhar Ramakrishnan
Browse files

refactor: initialize converters with the atlas

parent 8877d3c1
No related branches found
No related tags found
2 merge requests!107US Census,!103standardize-data
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
%load_ext autoreload %load_ext autoreload
%autoreload 2 %autoreload 2
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Standardizing the various Covid-19 datasets ## Standardizing the various Covid-19 datasets
This notebook demonstrates the use and usefulness of pulling data from various datasets together in one place. A lot of information gets lost when numbers are compared across entities that are too large. For example, we have excellent data available for Italy broken down by region (and even province). We also have data for Switzerland per Canton. These datasets, however, each have their own schemas and peculiarities - some work is therefore needed upfront to be able to treat them equally. This notebook demonstrates the use and usefulness of pulling data from various datasets together in one place. A lot of information gets lost when numbers are compared across entities that are too large. For example, we have excellent data available for Italy broken down by region (and even province). We also have data for Switzerland per Canton. These datasets, however, each have their own schemas and peculiarities - some work is therefore needed upfront to be able to treat them equally.
We have implemented a set of "converters" to standardize the various datasets to a subset of useful fields. Each converter is aware of the details of each dataset and produces a view of the dataset that is homogenized with the others. In this way, we are able to visualize with simple commands data of very different origins using very simple procedures. We have implemented a set of "converters" to standardize the various datasets to a subset of useful fields. Each converter is aware of the details of each dataset and produces a view of the dataset that is homogenized with the others. In this way, we are able to visualize with simple commands data of very different origins using very simple procedures.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
from pathlib import Path from pathlib import Path
import altair as alt import altair as alt
import pandas as pd import pandas as pd
from covid_19_dashboard import helper, plotting from covid_19_dashboard import helper, plotting
from covid_19_dashboard.converters import CaseConverter from covid_19_dashboard.converters import CaseConverter
from covid_19_dashboard.converters.switzerland import OpenZHCaseConverter from covid_19_dashboard.converters.switzerland import OpenZHCaseConverter
from covid_19_dashboard.converters.covidtracking import CovidtrackingCaseConverter from covid_19_dashboard.converters.covidtracking import CovidtrackingCaseConverter
from covid_19_dashboard.converters.spain import SpainCaseConverter from covid_19_dashboard.converters.spain import SpainCaseConverter
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
converter = CaseConverter("../../data/atlas/")
```
%% Cell type:code id: tags:
``` python
df_list = [] df_list = []
for path in [ for path in [
'../../data/openzh-covid-19', '../../data/openzh-covid-19',
'../../data/covid-19-italy', '../../data/covid-19-italy',
'../../data/covidtracking/', '../../data/covidtracking/',
'../../data/covid-19-spain' '../../data/covid-19-spain'
]: ]:
df_list.append(CaseConverter.read_convert(path)) df_list.append(converter.read_convert(path))
df_all = pd.concat(df_list).reset_index(drop=True) df_all = pd.concat(df_list).reset_index(drop=True)
df_all['date'] = pd.to_datetime(df_all.date) df_all['date'] = pd.to_datetime(df_all.date)
``` ```
%% Output
using: <class 'covid_19_dashboard.converters.covidtracking.CovidtrackingCaseConverter'>
using: <class 'covid_19_dashboard.converters.italy.ItalyCaseConverter'>
using: <class 'covid_19_dashboard.converters.spain.SpainCaseConverter'>
using: <class 'covid_19_dashboard.converters.switzerland.OpenZHCaseConverter'>
using: <class 'covid_19_dashboard.converters.covidtracking.CovidtrackingCaseConverter'>
using: <class 'covid_19_dashboard.converters.italy.ItalyCaseConverter'>
using: <class 'covid_19_dashboard.converters.covidtracking.CovidtrackingCaseConverter'>
using: <class 'covid_19_dashboard.converters.covidtracking.CovidtrackingCaseConverter'>
using: <class 'covid_19_dashboard.converters.italy.ItalyCaseConverter'>
using: <class 'covid_19_dashboard.converters.spain.SpainCaseConverter'>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
df_esp = SpainCaseConverter.read_data('../../data/covid-19-spain') df_esp = SpainCaseConverter.read_data('../../data/covid-19-spain')
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
SpainCaseConverter.convert(df_esp) SpainCaseConverter.convert(df_esp)
``` ```
%% Output
date country region_iso region_label tested positive deceased \
0 2020-02-27 ESP ES-AN Andalucía None 1 NaN
1 2020-02-28 ESP ES-AN Andalucía None 6 NaN
2 2020-02-29 ESP ES-AN Andalucía None 8 NaN
3 2020-03-01 ESP ES-AN Andalucía None 12 NaN
4 2020-03-02 ESP ES-AN Andalucía None 12 NaN
.. ... ... ... ... ... ... ...
622 2020-03-26 ESP ES-RI La Rioja None 995 43.0
623 2020-03-27 ESP ES-RI La Rioja None 1236 55.0
624 2020-03-28 ESP ES-RI La Rioja None 1436 65.0
625 2020-03-29 ESP ES-RI La Rioja None 1629 68.0
626 2020-03-30 ESP ES-RI La Rioja None 1733 71.0
population positive_100k deceased_100k
0 8409738 0.011891 NaN
1 8409738 0.071346 NaN
2 8409738 0.095128 NaN
3 8409738 0.142692 NaN
4 8409738 0.142692 NaN
.. ... ... ...
622 315675 315.197592 13.621604
623 315675 391.541934 17.422982
624 315675 454.898234 20.590797
625 315675 516.037063 21.541142
626 315675 548.982339 22.491486
[627 rows x 10 columns]
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
regions = ['Lombardy', 'Ticino', 'Zürich', 'Madrid', 'New York', 'Washington', 'Louisiana'] regions = ['Lombardy', 'Ticino', 'Zürich', 'Madrid', 'New York', 'Washington', 'Louisiana']
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
base = alt.Chart(df_all[df_all.region_label.isin(regions)]) base = alt.Chart(df_all[df_all.region_label.isin(regions)])
base.mark_line().encode(alt.X('date'), alt.Y('positive', scale=alt.Scale(type='linear')), color='region_label') base.mark_line().encode(alt.X('date'), alt.Y('positive', scale=alt.Scale(type='linear')), color='region_label')
``` ```
%% Output
alt.Chart(...)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
since_df_positive = helper.make_since_df(df_all[df_all.region_label.isin(regions)], region_column='region_label', start_case=100) since_df_positive = helper.make_since_df(df_all[df_all.region_label.isin(regions)], region_column='region_label', start_case=100)
base = alt.Chart(since_df_positive).properties(height=300,width=300) base = alt.Chart(since_df_positive).properties(height=300,width=300)
days_log = plotting.make_region_since_chart(base, 'positive', 'sinceDay0', 'region_label', 'Days since 100th case', 'Cases', 'Cases', 'Region') days_log = plotting.make_region_since_chart(base, 'positive', 'sinceDay0', 'region_label', 'Days since 100th case', 'Cases', 'Cases', 'Region')
days_log_100k = plotting.make_region_since_chart(base, 'positive_100k', 'sinceDay0', 'region_label', 'Days since 100th case', 'Cases/100k', 'Cases/100k', 'Region') days_log_100k = plotting.make_region_since_chart(base, 'positive_100k', 'sinceDay0', 'region_label', 'Days since 100th case', 'Cases/100k', 'Cases/100k', 'Region')
alt.hconcat(days_log, days_log_100k) alt.hconcat(days_log, days_log_100k)
``` ```
%% Output
alt.HConcatChart(...)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
since_df_deceased = helper.make_since_df(df_all[df_all.region_label.isin(regions)], column='deceased', region_column='region_label', start_case=10) since_df_deceased = helper.make_since_df(df_all[df_all.region_label.isin(regions)], column='deceased', region_column='region_label', start_case=10)
base = alt.Chart(since_df_deceased).properties(height=300,width=300) base = alt.Chart(since_df_deceased).properties(height=300,width=300)
days_log = plotting.make_region_since_chart(base, 'deceased', 'sinceDay0', 'region_label', 'Days since 10th death', 'Deaths', 'Deaths', 'Region') days_log = plotting.make_region_since_chart(base, 'deceased', 'sinceDay0', 'region_label', 'Days since 10th death', 'Deaths', 'Deaths', 'Region')
days_log_100k = plotting.make_region_since_chart(base, 'deceased_100k', 'sinceDay0', 'region_label', 'Days since 10th death', 'Deaths/100k', 'Deaths/100k', 'Region') days_log_100k = plotting.make_region_since_chart(base, 'deceased_100k', 'sinceDay0', 'region_label', 'Days since 10th death', 'Deaths/100k', 'Deaths/100k', 'Region')
alt.hconcat(days_log, days_log_100k) alt.hconcat(days_log, days_log_100k)
``` ```
%% Output
alt.HConcatChart(...)
%% Cell type:code id: tags:
```
```
......
...@@ -30,6 +30,47 @@ class CaseConverter(): ...@@ -30,6 +30,47 @@ class CaseConverter():
"deceased_100k", "deceased_100k",
] ]
def __init__(self, atlas_folder):
"""Initialze the converter with the path to the atlas"""
self.atlas_folder = atlas_folder
self.converters = []
self.init_converters()
def init_converters(self):
self.converters = [cls(self.atlas_folder) for cls in CaseConverter._converter_registry]
def read_convert(self, path):
"""Converts the Dataframe into the common format."""
for converter in self.converters:
if converter.can_convert(path):
print(f'Using {converter} for {path}')
return converter.convert(converter.read_data(path))
raise NotImplementedError(f"{path} could not be read and converted.")
class CaseConverterImpl:
"""Base converter class."""
conversion_dict = {}
column_list = []
common_columns = [
"date",
"country",
"region_iso",
"region_label",
"tested",
"positive",
"deceased",
"population",
"positive_100k",
"deceased_100k",
]
def __init__(self, atlas_folder):
"""Initialze the converter with the path to the atlas"""
self.atlas_folder = atlas_folder
@classmethod @classmethod
def can_convert(cls, path): def can_convert(cls, path):
"""Returns true if the class can convert the Dataframe.""" """Returns true if the class can convert the Dataframe."""
...@@ -39,15 +80,6 @@ class CaseConverter(): ...@@ -39,15 +80,6 @@ class CaseConverter():
return False return False
return all([col in df.columns for col in cls.column_list]) return all([col in df.columns for col in cls.column_list])
@classmethod
def read_convert(cls, path):
"""Converts the Dataframe into the common format."""
for converter in cls._converter_registry:
if converter.can_convert(path):
print(f'Using {converter} for {path}')
return converter.convert(converter.read_data(path))
raise NotImplementedError(f"{path} could not be read and converted.")
@classmethod @classmethod
def read_data(cls, path): def read_data(cls, path):
"""Read in the data from a directory path.""" """Read in the data from a directory path."""
......
...@@ -7,7 +7,7 @@ from pathlib import Path ...@@ -7,7 +7,7 @@ from pathlib import Path
import pandas as pd import pandas as pd
from . import CaseConverter from . import CaseConverterImpl as CaseConverter
from .. import helper from .. import helper
......
...@@ -6,7 +6,7 @@ from pathlib import Path ...@@ -6,7 +6,7 @@ from pathlib import Path
import pandas as pd import pandas as pd
from . import CaseConverter from . import CaseConverterImpl as CaseConverter
from .. import helper from .. import helper
......
...@@ -6,7 +6,7 @@ from pathlib import Path ...@@ -6,7 +6,7 @@ from pathlib import Path
import pandas as pd import pandas as pd
from . import CaseConverter from . import CaseConverterImpl as CaseConverter
from .. import helper from .. import helper
# regional populations extracted from wikidata and wikipedia # regional populations extracted from wikidata and wikipedia
......
...@@ -6,7 +6,7 @@ from pathlib import Path ...@@ -6,7 +6,7 @@ from pathlib import Path
import pandas as pd import pandas as pd
from . import CaseConverter from . import CaseConverterImpl as CaseConverter
from .. import helper from .. import helper
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment