Skip to content
Snippets Groups Projects
Commit d20d834d authored by Chandrasekhar Ramakrishnan's avatar Chandrasekhar Ramakrishnan Committed by renku 0.9.1
Browse files

renku run papermill -p ts_folder ./data/covid-19_jhu-csse/ -p wb_path...

renku run papermill -p ts_folder ./data/covid-19_jhu-csse/ -p wb_path ./data/worldbank/SP.POP.TOTL.zip -p geodata_path ./data/geodata/geo_data.csv -p out_folder ./data/covid-19_rates/ --inject-paths notebooks/process/ToRates.ipynb runs/ToRates.run.ipynb
parent 638d6c01
No related branches found
No related tags found
No related merge requests found
arguments: []
baseCommand:
- papermill
class: CommandLineTool
cwlVersion: v1.0
hints: []
inputs:
input_1:
default: ts_folder
inputBinding:
position: 1
prefix: -p
separate: true
shellQuote: true
streamable: false
type: string
input_10:
default: runs/ToRates.run.ipynb
inputBinding:
position: 10
separate: true
shellQuote: true
streamable: false
type: string
input_2:
default:
class: Directory
listing: []
path: ../../data/covid-19_jhu-csse
inputBinding:
position: 2
separate: true
shellQuote: true
streamable: false
type: Directory
input_3:
default: wb_path
inputBinding:
position: 3
prefix: -p
separate: true
shellQuote: true
streamable: false
type: string
input_4:
default:
class: File
path: ../../data/worldbank/SP.POP.TOTL.zip
inputBinding:
position: 4
separate: true
shellQuote: true
streamable: false
type: File
input_5:
default: geodata_path
inputBinding:
position: 5
prefix: -p
separate: true
shellQuote: true
streamable: false
type: string
input_6:
default:
class: File
path: ../../data/geodata/geo_data.csv
inputBinding:
position: 6
separate: true
shellQuote: true
streamable: false
type: File
input_7:
default: out_folder
inputBinding:
position: 7
prefix: -p
separate: true
shellQuote: true
streamable: false
type: string
input_8:
default: data/covid-19_rates
inputBinding:
position: 8
separate: true
shellQuote: true
streamable: false
type: string
input_9:
default:
class: File
path: ../../notebooks/process/ToRates.ipynb
inputBinding:
position: 9
prefix: --inject-paths
separate: true
shellQuote: true
streamable: false
type: File
outputs:
output_0:
outputBinding:
glob: $(inputs.input_10)
streamable: false
type: File
output_1:
outputBinding:
glob: $(inputs.input_8)
streamable: false
type: Directory
permanentFailCodes: []
requirements:
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
listing:
- entry: '$({"listing": [], "class": "Directory"})'
entryname: runs
writable: true
- entry: '$({"listing": [], "class": "Directory"})'
entryname: data/covid-19_rates
writable: true
- entry: $(inputs.input_2)
entryname: data/covid-19_jhu-csse
writable: false
- entry: $(inputs.input_4)
entryname: data/worldbank/SP.POP.TOTL.zip
writable: false
- entry: $(inputs.input_6)
entryname: data/geodata/geo_data.csv
writable: false
- entry: $(inputs.input_9)
entryname: notebooks/process/ToRates.ipynb
writable: false
successCodes: []
temporaryFailCodes: []
source diff could not be displayed: it is stored in LFS. Options to address this: view the blob.
source diff could not be displayed: it is stored in LFS. Options to address this: view the blob.
source diff could not be displayed: it is stored in LFS. Options to address this: view the blob.
%% Cell type:markdown id: tags:
# Convert Series to Rates per 100,000
%% Cell type:code id: tags:
``` python
import pandas as pd
import os
```
%% Cell type:code id: tags:parameters
``` python
ts_folder = "../data/covid-19_jhu-csse/"
wb_path = "../data/worldbank/SP.POP.TOTL.zip"
geodata_path = "../data/geodata/geo_data.csv"
out_folder = None
PAPERMILL_OUTPUT_PATH = None
```
%% Cell type:code id: tags:injected-parameters
``` python
# Parameters
PAPERMILL_INPUT_PATH = "/tmp/fsb4wn_r/notebooks/ToRates.ipynb"
PAPERMILL_INPUT_PATH = "notebooks/process/ToRates.ipynb"
PAPERMILL_OUTPUT_PATH = "runs/ToRates.run.ipynb"
ts_folder = "/tmp/fsb4wn_r/data/covid-19_jhu-csse"
wb_path = "/tmp/fsb4wn_r/data/worldbank/SP.POP.TOTL.zip"
geodata_path = "/tmp/fsb4wn_r/data/geodata/geo_data.csv"
out_folder = "data/covid-19_rates"
ts_folder = "./data/covid-19_jhu-csse/"
wb_path = "./data/worldbank/SP.POP.TOTL.zip"
geodata_path = "./data/geodata/geo_data.csv"
out_folder = "./data/covid-19_rates/"
```
%% Cell type:markdown id: tags:parameters
## Read in JHU CSSE data
I will switch to [xarray](http://xarray.pydata.org/en/stable/), but ATM, it's easier like this...
%% Cell type:code id: tags:
``` python
def read_jhu_covid_region_df(name):
filename = os.path.join(ts_folder, f"time_series_19-covid-{name}.csv")
df = pd.read_csv(filename)
df = df.set_index(['Country/Region', 'Province/State', 'Lat', 'Long'])
df.columns = pd.to_datetime(df.columns)
region_df = df.groupby(level='Country/Region').sum()
loc_df = df.reset_index([2,3]).groupby(level='Country/Region').mean()[['Long', 'Lat']]
return region_df.join(loc_df).set_index(['Long', 'Lat'], append=True)
```
%% Cell type:code id: tags:
``` python
frames_map = {
"confirmed": read_jhu_covid_region_df("Confirmed"),
"deaths": read_jhu_covid_region_df("Deaths"),
"recovered": read_jhu_covid_region_df("Recovered")
}
```
%% Cell type:markdown id: tags:
# Read in World Bank data
%% Cell type:code id: tags:
``` python
import zipfile
zf = zipfile.ZipFile(wb_path)
pop_df = pd.read_csv(zf.open("API_SP.POP.TOTL_DS2_en_csv_v2_821007.csv"), skiprows=4)
```
%% Cell type:markdown id: tags:
There is 2018 pop data for all countries/regions except Eritrea
%% Cell type:code id: tags:
``` python
pop_df[pd.isna(pop_df['2018'])]
```
%% Output
Country Name Country Code Indicator Name Indicator Code 1960 \
67 Eritrea ERI Population, total SP.POP.TOTL 1007590.0
108 Not classified INX Population, total SP.POP.TOTL NaN
1961 1962 1963 1964 1965 ... 2011 \
67 1033328.0 1060486.0 1088854.0 1118159.0 1148189.0 ... 3213972.0
108 NaN NaN NaN NaN NaN ... NaN
2012 2013 2014 2015 2016 2017 2018 2019 Unnamed: 64
67 NaN NaN NaN NaN NaN NaN NaN NaN NaN
108 NaN NaN NaN NaN NaN NaN NaN NaN NaN
[2 rows x 65 columns]
%% Cell type:markdown id: tags:
Fix the country/region names that differ between the World Bank population data and the JHU CSSE data.
%% Cell type:code id: tags:
``` python
region_wb_jhu_map = {
'Brunei Darussalam': 'Brunei',
'Czech Republic': 'Czechia',
'Egypt, Arab Rep.': 'Egypt',
'Hong Kong SAR, China': 'Hong Kong SAR',
'Iran, Islamic Rep.': 'Iran',
'Korea, Rep.': 'Korea, South',
'Macao SAR, China': 'Macao SAR',
'Russian Federation': 'Russia',
'Slovak Republic': 'Slovakia',
'St. Martin (French part)': 'Saint Martin',
'United States': 'US'
}
current_pop_ser = pop_df[['Country Name', '2018']].copy().replace(region_wb_jhu_map).set_index('Country Name')['2018']
data_pop_ser = current_pop_ser[current_pop_ser.index.isin(frames_map['confirmed'].index.levels[0])]
```
%% Cell type:code id: tags:
``` python
# Use this to find the name in the series
# current_pop_ser[current_pop_ser.index.str.contains('Czech')]
```
%% Cell type:markdown id: tags:
There are some regions that we cannot resolve, but we will just ignore these.
%% Cell type:code id: tags:
``` python
frames_map['confirmed'].loc[
frames_map['confirmed'].index.levels[0].isin(data_pop_ser.index) == False
].iloc[:,-2:]
```
%% Output
2020-03-13 00:00:00 \
2020-03-16 00:00:00 \
Country/Region Long Lat
Congo (Brazzaville) 21.7587 -4.0383 1
Congo (Kinshasa) 21.7587 -4.0383 2
Cruise Ship 139.6380 35.4437 696
French Guiana -53.1258 3.9339 5
Guadeloupe -61.5510 16.2650 1
Guernsey -2.5800 49.4500 0
Guernsey -2.5800 49.4500 1
Holy See 12.4534 41.9029 1
Jersey -2.1100 49.1900 0
Martinique -61.0242 14.6415 3
Reunion 55.5364 -21.1151 5
Saint Lucia -60.9789 13.9094 0
Saint Vincent and the Grenadines -61.2872 12.9843 0
Taiwan* 121.0000 23.7000 50
Venezuela -66.5897 6.4238 0
occupied Palestinian territory 35.2332 31.9522 0
Jersey -2.1100 49.1900 2
Martinique -61.0242 14.6415 15
Republic of the Congo 15.5560 -1.4400 1
Saint Lucia -60.9789 13.9094 2
Saint Vincent and the Grenadines -61.2872 12.9843 1
Taiwan* 121.0000 23.7000 67
The Bahamas -76.0000 24.2500 1
The Gambia -16.6000 13.4667 0
Venezuela -66.5897 6.4238 17
2020-03-14 00:00:00
2020-03-17 00:00:00
Country/Region Long Lat
Congo (Kinshasa) 21.7587 -4.0383 2
Congo (Brazzaville) 21.7587 -4.0383 1
Congo (Kinshasa) 21.7587 -4.0383 3
Cruise Ship 139.6380 35.4437 696
French Guiana -53.1258 3.9339 5
Guadeloupe -61.5510 16.2650 1
Guernsey -2.5800 49.4500 1
Holy See 12.4534 41.9029 1
Jersey -2.1100 49.1900 2
Martinique -61.0242 14.6415 9
Reunion 55.5364 -21.1151 6
Saint Lucia -60.9789 13.9094 1
Martinique -61.0242 14.6415 16
Republic of the Congo 15.5560 -1.4400 1
Saint Lucia -60.9789 13.9094 2
Saint Vincent and the Grenadines -61.2872 12.9843 1
Taiwan* 121.0000 23.7000 53
Venezuela -66.5897 6.4238 2
occupied Palestinian territory 35.2332 31.9522 0
Taiwan* 121.0000 23.7000 77
The Bahamas -76.0000 24.2500 1
The Gambia -16.6000 13.4667 1
Venezuela -66.5897 6.4238 33
%% Cell type:markdown id: tags:
# Read in geodata to get additional population numbers
%% Cell type:code id: tags:
``` python
geodata_df = pd.read_csv(geodata_path).drop('Unnamed: 0', axis=1).set_index('name_jhu')
```
%% Cell type:markdown id: tags:
Add in populations for missing countries
%% Cell type:code id: tags:
``` python
missing_countries = frames_map['confirmed'].loc[
frames_map['confirmed'].index.levels[0].isin(data_pop_ser.index) == False
].iloc[:,-2:].reset_index()['Country/Region']
display(geodata_df.loc[geodata_df.index.isin(missing_countries)])
data_pop_ser = data_pop_ser.append(geodata_df.loc[geodata_df.index.isin(missing_countries), 'pop_est'])
```
%% Output
%% Cell type:markdown id: tags:
# Compute rates per 100,000 for regions
%% Cell type:code id: tags:
``` python
def cases_to_rates_df(df):
per_100000_df = df.reset_index([1, 2], drop=True)
per_100000_df = per_100000_df.div(data_pop_ser, 'index').mul(100000).dropna()
per_100000_df.index.name = 'Country/Region'
return per_100000_df
def frames_to_rates(frames_map):
return {k: cases_to_rates_df(v) for k,v in frames_map.items()}
rates_map = frames_to_rates(frames_map)
```
%% Cell type:code id: tags:
``` python
if PAPERMILL_OUTPUT_PATH:
for k, v in rates_map.items():
out_path = os.path.join(out_folder, f"ts_rates_19-covid-{k}.csv")
v.reset_index().to_csv(out_path)
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment