Snippets Groups Projects

Compare History

renku dataset update openzh-covid-19

CI-bot authored 3 years ago

renku-transaction: 124a39add6dd4aa1a989743ceb6a4ed9

5b44e982

Name	Last commit	Last update
.renku
data
figures
notebooks
runs
src/covid-19/covid_19_utils
.dockerignore
.gitattributes
.gitignore
.gitlab-ci.yml
.renkulfsignore
Dockerfile
README.md
covid-19-public-data.Rproj
environment.yml
install.R
requirements.txt

Covid-19 Public Data Collaboration Project

This project aggregates data from various public sources to better understand the spread and effect of COVID-19. The goal is to provide a central place where data, analysis, and discussion can be conducted and shared by a global community struggling to make sense of the current public health emergency.

The main goal is not to create another dashboard, or collection of dashboards, for that matter. There are already many excellent ones available, for example, on https://covid19dashboards.com.

The goal here is different: aggregate data from multiple public sources, standardize the data formats, and make the data around COVID-19 easy to work with. Rather than presenting answers to our questions, we want to make it easy for you to explore the data, and formulate and answer your own questions. Some questions can be answered by looking at just one data source, but many cannot. And it may be worthwhile to run an analysis initially developed against one data source against another. This project aims to make it possible, and hopefully even easy, to do these things. For a simple example, have a look at the Global and Regional COVID-19 summary notebook.

For each data source, we provide a simple summary notebook with interactive figures:

Case data is complemented by population figures from various sources. A summary of all the data can be found in the table below.

Getting started with the project

The project intends provides a place for easy access to the relevant data for the purposes of analysis and collaboration. It is envisioned to be hands-on; with a few clicks you can be analysing the latest data from around the globe. We hope that this will make it easier for domain specialists to team up with analysts and data scientists to tackle the open questions together.

The simplest way to start is to simply browse the available notebooks and take a look at the provided visualizations. If you would like something to be changed or to include some different data, feel free to start a discussion!

When you are ready to get your hands on the data, start by making an account or logging in on https://renkulab.io and forking the project. Then, start an interactive environment and use the hosted JupyterLab or RStudio to explore the data.

If you don't know how to do something shoot us a line on Discourse, chat with us on gitter or open an issue and someone will be able to help out.

Is there a great data source that you wish we had included? Start a discussion!

Working with the data

A summary of the datasets available in this project is in the table below. In order to work more efficiently with the data, we have implemented a set of "converters" to standardize the various datasets to a subset of useful fields. Each converter is aware of the details of each dataset and produces a view of the dataset that is homogenized with the others. In this way, data from different sources can be used efficiently with minimal boilerplate code.

For example, to work with the JHU-CSSE country-level data as well as the more detailed dataset from Spain:

from covid_19_utils.converters import CaseConverter

converter = CaseConverter('./data/atlas')
jhu_df = converter.read_convert('./data/covid-19_jhu-csse')
spain_df = converter.read_convert('./data/covid-19-spain')

The resulting DataFrames have exactly the same structure so they can be used interchangably in any analysis or plotting code. See the Global and Regional COVID-19 summary notebook for a more complete example.

Updating your branch or fork

The data in the main master branch of this project is updated daily - how can you keep your fork or branch up-to-date? We recommend that you do not make changes to the files and directories that are automatically updated so as to avoid merge conflicts as much as possible. This includes the datasets in the data/ directory and the notebooks in notebooks/ and runs/. Especially for notebooks, the easiest way to avoid conflicts would be to simply make a new directory where you put your work.

When you are ready to pull in changes from master, you can do the following from a terminal, when working on your branch or fork:

git remote add upstream https://renkulab.io/gitlab/covid-19/covid-19-public-data.git
git fetch upstream
git merge upstream/master

This will sync your branch or fork with the latest changes from the master branch of the parent repository.

Project structure

data/: contains all of the datasets.

notebooks/: contains the sample notebooks. The ones in the base directory are executed automatically every time the project is updated and their rendered versions can be found in the runs/ directory.

runs/: contains executed (rendered) versions of various pre- and post-processing notebooks.

src/covid-19/covid_19_utils: contains the data converters as well as some useful helper and plotting functions that are used in the sample notebooks.

Dataset Summary

Source	Dataset	Location	Example
Covid-19 Data Repository at JHU CSSE	covid-19_jhu-csse	`data/covid-19_jhu-csse`	notebooks/Dashboard.ipynb
Covid-19 data collected by the ECDC	covid-19-ecdc	`data/covid-19-ecdc`	notebooks/covid-19-ecdc.ipynb
covidtracking.com	covidtracking	`data/covidtracking`	notebooks/covidtracking.ipynb
New York Times Covid-19 Data	covid-19-us-nyt	`data/covid-19-us-nyt`	notebooks/covid-19-us-nyt.ipynb
Swiss Cantonal Data	openzh-covid-19	`data/openzh-covid-19`	notebooks/openzh-covid-19.ipynb
Covid-19 data for Italy	covid-19-italy	`data/covid-19-italy`	notebooks/covid-19-italy.ipynb
Covid-19 data for Chile	covid-19-chile	`data/covid-19-chile`	notebooks/examples-R/covid19-chile.ipynb
Covid-19 data for Spain	covid-19-spain	`data/covid-19-spain`	N/A
Oxford COVID-19 Government Response Tracker	covidtracker	`data/covidtracker`	N/A
Covid-19 tweet IDs	covid-19-tweet-ids	`data/covid-19-tweet-ids`	N/A
Apple Google BU	Measures of social distancing	`data/distancing-metrics`	notebooks/examples/distancing-measures.ipynb

Covid-19 Data Repository JHU CSSE

This is a global Covid-19 dataset updated regularly from Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). The dashboard summarizes this data in combination with population data from the world bank.

Covid-19 Data collected by the ECDC

A global dataset collected by a team of epidemiologists at the European Center for Disease Prevention and Control.

Covid tracking crowdsourcing project

Covid tracking is a crowd-sourced dataset for US state-level data. It is updated by hand by an army of volunteers and notably contains data on testing.

New York Times Covid-19 Dataset

The New York Times Covid-19 Dataset provides open access to data about the covid-19 cases and deaths per U.S. state and county. Please note the geographic exceptions section.

Covid-19 Data for Swiss Cantons

The swiss cantonal data collected by the Zürich statistical office. Parts are updated manually, others are starting to become automated.

Case data for Italy

Detailed data compiled by the Civil Protection of Italy. Includes detailed data on hospitalizations, ICU cases etc. broken down to the provincial level.

Case data for Spain

Data for regions of Spain compiled by Datadista, an investigative journalism team in Spain.

Oxford COVID-19 Government Response Tracker

A compilation of global COVID-19 policy responses led by Blavatnik School of Government at Oxford University. More details here.

Covid-19 related tweet IDs

A collection of tweet-ids related to covid-19 from https://github.com/echen102/COVID-19-TweetIDs.

General

https://data.worldbank.org/indicator/SP.POP.TOTL
https://worldmap.harvard.edu/data/geonode:country_centroids_az8
https://wikidata.com (for population figures)

How is the data updated?

Each morning an automatic pipeline is executed that fetches new data from each of the data sources. renku is then used to run whichever pipelines are necessary to update the pre-processed data and the rendered notebooks in order to reflect the changes in the data.

Derived Dataset Summary

Dataset	Location	Code
Case population rates	`data/covid-19_rates`	notebooks/process/ToRates.ipynb

Contributing

If you are interested in working on this project, we would love to get contributions. We would really like to collect more data sources and make them available here! Please provide ideas for data sources that are relevant to understanding covid-19.

If you want to add a new datasource yourself, see the section Adding a new data source

Data Sources to Add

See the data sources issue.

Adding a new data source

Adding a new data source is easy! To do so, in your fork or branch of the project, do the following:

Create a renku dataset using renku dataset create [dataset name]
Add any files or folders using renku dataset add. See the renku dataset documentation for more details.
Create a notebook that shows how to read and work with the dataset in the notebooks/examples folder
- Protip: use a unique name for the notebook to avoid merge conflicts
Add an issue to the project for any suggestions on things to do with the data

This GitLab is a part of renkulab.io