Commit fecea644 authored by CR (covid cron)'s avatar CR (covid cron) Committed by Chandrasekhar Ramakrishnan
Browse files

renku update --with-siblings

parent 336b1e6d
Pipeline #22566 passed with stage
in 38 seconds
class: Workflow
cwlVersion: v1.0
hints: []
inputs:
input_1:
default:
class: File
path: ../../notebooks/covidtracking-dashboard.ipynb
streamable: false
type: File
input_10:
default:
class: File
path: ../../data/ch-population-statistics/ch-population-by-age-canton.xls
streamable: false
type: File
input_11:
default: ts_folder
streamable: false
type: string
input_12:
default: runs/ToRates.run.ipynb
streamable: false
type: string
input_13:
default:
class: Directory
listing: []
path: ../../data/covid-19_jhu-csse
streamable: false
type: Directory
input_14:
default: wb_path
streamable: false
type: string
input_15:
default:
class: File
path: ../../data/worldbank/SP.POP.TOTL.zip
streamable: false
type: File
input_16:
default: geodata_path
streamable: false
type: string
input_17:
default:
class: File
path: ../../data/geodata/geo_data.csv
streamable: false
type: File
input_18:
default: out_folder
streamable: false
type: string
input_19:
default: data/covid-19_rates
streamable: false
type: string
input_2:
default: runs/covidtracking-dashboard.ipynb
streamable: false
type: string
input_20:
default:
class: File
path: ../../notebooks/process/ToRates.ipynb
streamable: false
type: File
input_21:
default: ts_folder
streamable: false
type: string
input_22:
default:
class: Directory
listing: []
path: ../../data/covid-19_jhu-csse
streamable: false
type: Directory
input_23:
default: rates_folder
streamable: false
type: string
input_24:
default: geodata_path
streamable: false
type: string
input_25:
default:
class: File
path: ../../data/geodata/geo_data.csv
streamable: false
type: File
input_26:
default:
class: File
path: ../../notebooks/Dashboard.ipynb
streamable: false
type: File
input_27:
default: runs/Dashboard.run.ipynb
streamable: false
type: string
input_28:
default:
class: File
path: ../../notebooks/examples/italy-covid-19.ipynb
streamable: false
type: File
input_29:
default: runs/italy-covid-19.ipynb
streamable: false
type: string
input_3:
default: data_path
streamable: false
type: string
input_30:
default: data_folder
streamable: false
type: string
input_31:
default:
class: Directory
listing: []
path: ../../data/covid-19-italy
streamable: false
type: Directory
input_4:
default:
class: Directory
listing: []
path: ../../data/covidtracking
streamable: false
type: Directory
input_5:
default:
class: File
path: ../../data/geodata/us_pop_fung_2019.csv
streamable: false
type: File
input_6:
default:
class: File
path: ../../notebooks/openzh-covid-19-dashboard.ipynb
streamable: false
type: File
input_7:
default: runs/openzh-covid-19-dashboard.run.ipynb
streamable: false
type: string
input_8:
default: data_path
streamable: false
type: string
input_9:
default:
class: Directory
listing: []
path: ../../data/openzh-covid-19
streamable: false
type: Directory
outputs:
output_0:
outputSource: step_1/output_0
streamable: false
type: File
output_1:
outputSource: step_5/output_0
streamable: false
type: File
output_2:
outputSource: step_2/output_0
streamable: false
type: File
output_3:
outputSource: step_3/output_1
streamable: false
type: Directory
output_4:
outputSource: step_4/output_0
streamable: false
type: File
output_5:
outputSource: step_3/output_0
streamable: false
type: File
requirements: []
steps:
step_1:
in:
input_1: input_1
input_2: input_2
input_3: input_3
input_4: input_4
input_5: input_5
out:
- output_0
run: 30e7c6a1fdb74ccea2652434f5c83d13_papermill.cwl
step_2:
in:
input_1: input_6
input_2: input_7
input_3: input_8
input_4: input_9
input_5: input_10
out:
- output_0
run: 0415b23203ef422185f0ecf77290cbd3_papermill.cwl
step_3:
in:
input_1: input_11
input_10: input_12
input_2: input_13
input_3: input_14
input_4: input_15
input_5: input_16
input_6: input_17
input_7: input_18
input_8: input_19
input_9: input_20
out:
- output_0
- output_1
run: 41e261c434eb41bcb97fe7c953d91917_papermill.cwl
step_4:
in:
input_2: input_21
input_3: input_22
input_4: input_23
input_5: step_3/output_1
input_6: input_24
input_7: input_25
input_8: input_26
input_9: input_27
out:
- output_0
run: c57735c62b424192b938f9538b4d65c0_papermill.cwl
step_5:
in:
input_1: input_28
input_2: input_29
input_3: input_30
input_4: input_31
out:
- output_0
run: f2c255fbcb87444ea80501341f0b7f1a_papermill.cwl
This source diff could not be displayed because it is stored in LFS. You can view the blob instead.
This source diff could not be displayed because it is stored in LFS. You can view the blob instead.
This diff is collapsed.
%% Cell type:markdown id: tags:
# Convert Series to Rates per 100,000
%% Cell type:code id: tags:
``` python
import pandas as pd
import os
```
%% Cell type:code id: tags:parameters
``` python
ts_folder = "../../data/covid-19_jhu-csse/"
wb_path = "../../data/worldbank/SP.POP.TOTL.zip"
geodata_path = "../../data/geodata/geo_data.csv"
out_folder = None
PAPERMILL_OUTPUT_PATH = None
```
%% Cell type:code id: tags:injected-parameters
``` python
# Parameters
PAPERMILL_INPUT_PATH = "notebooks/process/ToRates.ipynb"
PAPERMILL_INPUT_PATH = "/tmp/o7ht8_17/notebooks/process/ToRates.ipynb"
PAPERMILL_OUTPUT_PATH = "runs/ToRates.run.ipynb"
ts_folder = "./data/covid-19_jhu-csse/"
wb_path = "./data/worldbank/SP.POP.TOTL.zip"
geodata_path = "./data/geodata/geo_data.csv"
out_folder = "./data/covid-19_rates/"
ts_folder = "/tmp/o7ht8_17/data/covid-19_jhu-csse"
wb_path = "/tmp/o7ht8_17/data/worldbank/SP.POP.TOTL.zip"
geodata_path = "/tmp/o7ht8_17/data/geodata/geo_data.csv"
out_folder = "data/covid-19_rates"
```
%% Cell type:markdown id: tags:parameters
## Read in JHU CSSE data
I will switch to [xarray](http://xarray.pydata.org/en/stable/), but ATM, it's easier like this...
%% Cell type:code id: tags:
``` python
def read_jhu_covid_region_df(name):
filename = os.path.join(ts_folder, f"time_series_covid19_{name}_global.csv")
df = pd.read_csv(filename)
df = df.set_index(['Country/Region', 'Province/State', 'Lat', 'Long'])
df.columns = pd.to_datetime(df.columns)
region_df = df.groupby(level='Country/Region').sum()
loc_df = df.reset_index([2,3]).groupby(level='Country/Region').mean()[['Long', 'Lat']]
return region_df.join(loc_df).set_index(['Long', 'Lat'], append=True)
```
%% Cell type:code id: tags:
``` python
frames_map = {
"confirmed": read_jhu_covid_region_df("confirmed"),
"deaths": read_jhu_covid_region_df("deaths"),
}
```
%% Cell type:markdown id: tags:
# Read in World Bank data
%% Cell type:code id: tags:
``` python
import zipfile
zf = zipfile.ZipFile(wb_path)
pop_df = pd.read_csv(zf.open("API_SP.POP.TOTL_DS2_en_csv_v2_821007.csv"), skiprows=4)
```
%% Cell type:markdown id: tags:
There is 2018 pop data for all countries/regions except Eritrea
%% Cell type:code id: tags:
``` python
pop_df[pd.isna(pop_df['2018'])]
```
%%%% Output: execute_result
Country Name Country Code Indicator Name Indicator Code 1960 \
67 Eritrea ERI Population, total SP.POP.TOTL 1007590.0
108 Not classified INX Population, total SP.POP.TOTL NaN
1961 1962 1963 1964 1965 ... 2011 \
67 1033328.0 1060486.0 1088854.0 1118159.0 1148189.0 ... 3213972.0
108 NaN NaN NaN NaN NaN ... NaN
2012 2013 2014 2015 2016 2017 2018 2019 Unnamed: 64
67 NaN NaN NaN NaN NaN NaN NaN NaN NaN
108 NaN NaN NaN NaN NaN NaN NaN NaN NaN
[2 rows x 65 columns]
%% Cell type:markdown id: tags:
Fix the country/region names that differ between the World Bank population data and the JHU CSSE data.
%% Cell type:code id: tags:
``` python
region_wb_jhu_map = {
'Brunei Darussalam': 'Brunei',
'Czech Republic': 'Czechia',
'Egypt, Arab Rep.': 'Egypt',
'Hong Kong SAR, China': 'Hong Kong SAR',
'Iran, Islamic Rep.': 'Iran',
'Korea, Rep.': 'Korea, South',
'Macao SAR, China': 'Macao SAR',
'Russian Federation': 'Russia',
'Slovak Republic': 'Slovakia',
'St. Martin (French part)': 'Saint Martin',
'United States': 'US'
}
current_pop_ser = pop_df[['Country Name', '2018']].copy().replace(region_wb_jhu_map).set_index('Country Name')['2018']
data_pop_ser = current_pop_ser[current_pop_ser.index.isin(frames_map['confirmed'].index.levels[0])]
```
%% Cell type:code id: tags:
``` python
# Use this to find the name in the series
# current_pop_ser[current_pop_ser.index.str.contains('Czech')]
```
%% Cell type:markdown id: tags:
There are some regions that we cannot resolve, but we will just ignore these.
%% Cell type:code id: tags:
``` python
frames_map['confirmed'].loc[
frames_map['confirmed'].index.levels[0].isin(data_pop_ser.index) == False
].iloc[:,-2:]
```
%%%% Output: execute_result
2020-04-04 00:00:00 \
2020-04-05 00:00:00 \
Country/Region Long Lat
Bahamas -77.396300 25.034300 28
Burma 95.956000 21.916200 21
Congo (Brazzaville) 21.758700 -4.038300 22
Congo (Brazzaville) 21.758700 -4.038300 45
Congo (Kinshasa) 21.758700 -4.038300 154
Diamond Princess 0.000000 0.000000 712
Gambia -15.310100 13.443200 4
Holy See 12.453400 41.902900 7
Kyrgyzstan 74.766100 41.204400 144
Laos 102.495496 19.856270 10
Kyrgyzstan 74.766100 41.204400 147
Laos 102.495496 19.856270 11
MS Zaandam 0.000000 0.000000 9
Saint Kitts and Nevis -62.782998 17.357822 9
Saint Kitts and Nevis -62.782998 17.357822 10
Saint Lucia -60.978900 13.909400 14
Saint Vincent and the Grenadines -61.287200 12.984300 7
Syria 38.996815 34.802075 16
Taiwan* 121.000000 23.700000 355
Venezuela -66.589700 6.423800 155
Western Sahara -12.885800 24.215500 0
Syria 38.996815 34.802075 19
Taiwan* 121.000000 23.700000 363
Venezuela -66.589700 6.423800 159
Western Sahara -12.885800 24.215500 4
2020-04-05 00:00:00
2020-04-06 00:00:00
Country/Region Long Lat
Bahamas -77.396300 25.034300 28
Burma 95.956000 21.916200 21
Bahamas -77.396300 25.034300 29
Burma 95.956000 21.916200 22
Congo (Brazzaville) 21.758700 -4.038300 45
Congo (Kinshasa) 21.758700 -4.038300 154
Congo (Kinshasa) 21.758700 -4.038300 161
Diamond Princess 0.000000 0.000000 712
Gambia -15.310100 13.443200 4
Holy See 12.453400 41.902900 7
Kyrgyzstan 74.766100 41.204400 147
Laos 102.495496 19.856270 11
Kyrgyzstan 74.766100 41.204400 216
Laos 102.495496 19.856270 12
MS Zaandam 0.000000 0.000000 9
Saint Kitts and Nevis -62.782998 17.357822 10
Saint Lucia -60.978900 13.909400 14
Saint Vincent and the Grenadines -61.287200 12.984300 7
Syria 38.996815 34.802075 19
Taiwan* 121.000000 23.700000 363
Venezuela -66.589700 6.423800 159
Taiwan* 121.000000 23.700000 373
Venezuela -66.589700 6.423800 165
Western Sahara -12.885800 24.215500 4
%% Cell type:markdown id: tags:
# Read in geodata to get additional population numbers
%% Cell type:code id: tags:
``` python
geodata_df = pd.read_csv(geodata_path).drop('Unnamed: 0', axis=1).set_index('name_jhu')
```
%% Cell type:markdown id: tags:
Add in populations for missing countries
%% Cell type:code id: tags:
``` python
missing_countries = frames_map['confirmed'].loc[
frames_map['confirmed'].index.levels[0].isin(data_pop_ser.index) == False
].iloc[:,-2:].reset_index()['Country/Region']
display(geodata_df.loc[geodata_df.index.isin(missing_countries)])
data_pop_ser = data_pop_ser.append(geodata_df.loc[geodata_df.index.isin(missing_countries), 'pop_est'])
```
%%%% Output: display_data
%% Cell type:markdown id: tags:
# Compute rates per 100,000 for regions
%% Cell type:code id: tags:
``` python
def cases_to_rates_df(df):
per_100000_df = df.reset_index([1, 2], drop=True)
per_100000_df = per_100000_df.div(data_pop_ser, 'index').mul(100000).dropna()
per_100000_df.index.name = 'Country/Region'
return per_100000_df
def frames_to_rates(frames_map):
return {k: cases_to_rates_df(v) for k,v in frames_map.items()}
rates_map = frames_to_rates(frames_map)
```
%% Cell type:code id: tags:
``` python
if PAPERMILL_OUTPUT_PATH:
for k, v in rates_map.items():
out_path = os.path.join(out_folder, f"ts_rates_19-covid-{k}.csv")
v.reset_index().to_csv(out_path)
```
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment