-
Rok Roškar authoredRok Roškar authored
Covid-19 Public Data Collaboration Project
This project aggregates data from various public sources to better understand the spread and effect of COVID-19. The goal is to provide a central place where data, analysis, and discussion can be conducted and shared by a global community struggling to make sense of the current public health emergency.
The project includes simple tools that make it very easy to work with the heterogeneous case data from various sources using common code. For a simple example, have a look at the Global and Regional COVID-19 summary notebook.
For each data source, we provide a summary notebook with interactive figures that can be used as starting points for further exploration:
- Summary of global data from from JHU CSSE
- Global data from from ECDC
- U.S. state-level data from covidtracking.com
- U.S. county-level data from the New York Times
- Regional data for Italy from italian Civil Protection
- Switzerland cantonal data collected by the Zürich Statistical Office
Case data is complemented by population figures from various sources. A summary of all the data can be found in the table below.
Getting started with the project
The project intends provides a place for easy access to the relevant data for the purposes of analysis and collaboration. It is envisioned to be hands-on; with a few clicks you can be analysing the latest data from around the globe. We hope that this will make it easier for domain specialists to team up with analysts and data scientists to tackle the open questions together.
The simplest way to start is to simply browse the available notebooks and take a look at the provided visualizations. If you would like something to be changed or to include some different data, feel free to start a discussion!
When you are ready to get your hands on the data, start by making an account or logging in on https://renkulab.io and forking the project. Then, start an interactive environment and use the hosted JupyterLab or RStudio to explore the data.
If you don't know how to do something shoot us a line on Discourse, chat with us on gitter or open an issue and someone will be able to help out.
Is there a great data source that you wish we had included? Start a discussion!
Working with the data
A summary of the datasets available in this project is in the table below. In order to work more efficiently with the data, we have implemented a set of "converters" to standardize the various datasets to a subset of useful fields. Each converter is aware of the details of each dataset and produces a view of the dataset that is homogenized with the others. In this way, data from different sources can be used efficiently with minimal boilerplate code.
For example, to work with the JHU-CSSE country-level data as well as the more detailed dataset from Spain:
from covid_19_utils.converters import CaseConverter
converter = CaseConverter('./data/atlas')
jhu_df = converter.read_convert('./data/covid-19_jhu-csse')
spain_df = converter.read_convert('./data/covid-19-spain')
The resulting DataFrames have exactly the same structure so they can be used interchangably in any analysis or plotting code. See the Global and Regional COVID-19 summary notebook for a more complete example.
Updating your branch or fork
The data in the main master branch of this project is updated daily - how can
you keep your fork or branch up-to-date? We recommend that you do not make
changes to the files and directories that are automatically updated so as to
avoid merge conflicts as much as possible. This includes the datasets in the
data/
directory and the notebooks in notebooks/
and runs/
. Especially for
notebooks, the easiest way to avoid conflicts would be to simply make a new
directory where you put your work.
When you are ready to pull in changes from master, you can do the following from a terminal, when working on your branch or fork: