-
Rok Roškar authoredbf5d081e
Covid-19 Public Data Collaboration Project
This project aggregates data from various public sources to better understand the spread and effect of COVID-19. The goal is to provide a central place where data, analysis, and discussion can be conducted and shared by a global community struggling to make sense of the current public health emergency.
For each data source, we provide a simple summary notebook with interactive figures:
- Summary of global data from from JHU CSSE
- Global data from from ECDC
- U.S. state-level data from covidtracking.com
- U.S. county-level data from the New York Times
- Regional data for Italy from italian Civil Protection
- Switzerland cantonal data collected by the Zürich Statistical Office
Case data is complemented by population figures from various sources. A summary of all the data can be found in the table below.
Getting started with the project
The goal of this project is not to build yet another dashboard - instead, it provides a place for easy access to the relevant data for the purposes of analysis and collaboration. This project is envisioned to be hands-on; with a few clicks you can be analysing the latest data from around the globe.
The simplest way to start is to make an account or logging in and forking the project. Then, start an interactive environment and use the hosted JupyterLab or RStudio to explore the data.
If you don't know how to do something shoot us a line on Discourse, chat with us on gitter or open an issue and someone will be able to help out.
Is there a great data source that you wish we had included? Start a discussion!
Working with the data
A summary of the datasets available in this project is in the table below. In order to work more efficiently with the data, we have implemented a set of "converters" to standardize the various datasets to a subset of useful fields. Each converter is aware of the details of each dataset and produces a view of the dataset that is homogenized with the others. In this way, data from different sources can be used efficiently with minimal boilerplate code.
For example, to work with the JHU-CSSE country-level data as well as the more detailed dataset from Spain:
from covid_19_utils.converters import CaseConverter
converter = CaseConverter('./data/atlas')
jhu_df = converter.read_convert('./data/covid-19_jhu-csse')
spain_df = converter.read_convert('./data/covid-19-spain')
The resulting DataFrames have exactly the same structure so they can be used interchangably in any analysis or plotting code. See the standardization notebook for a more complete example.
Updating your branch or fork
The data in the main master branch of this project is updated daily - how can
you keep your fork or branch up-to-date? We recommend that you do not make
changes to the files and directories that are automatically updated so as to
avoid merge conflicts as much as possible. This includes the datasets in the
data/
directory and the notebooks in notebooks/
and runs/
. Especially for
notebooks, the easiest way to avoid conflicts would be to simply make a new
directory where you put your work.
When you are ready to pull in changes from master, you can do the following from a terminal, when working on your branch or fork:
git remote add upstream https://renkulab.io/gitlab/covid-19/covid-19-public-data.git
git fetch upstream
git merge upstream/master
This will sync your branch or fork with the latest changes from the master branch of the parent repository.