Skip to content
Snippets Groups Projects
README.md 13.64 KiB

Covid-19 Public Data Collaboration Project

This project aggregates data from various public sources to better understand the spread and effect of COVID-19. The goal is to provide a central place where data, analysis, and discussion can be conducted and shared by a global community struggling to make sense of the current public health emergency.

For each data source, we provide a simple summary notebook with interactive figures:

Case data is complemented by population figures from various sources. A summary of all the data can be found in the table below.

Getting started with the project

The goal of this project is not to build yet another dashboard - instead, it provides a place for easy access to the relevant data for the purposes of analysis and collaboration. This project is envisioned to be hands-on; with a few clicks you can be analysing the latest data from around the globe.

The simplest way to start is to make an account or logging in and forking the project. Then, start an interactive environment and use the hosted JupyterLab or RStudio to explore the data.

If you don't know how to do something shoot us a line on Discourse, chat with us on gitter or open an issue and someone will be able to help out.

Is there a great data source that you wish we had included? Start a discussion!

Working with the data

A summary of the datasets available in this project is in the table below. In order to work more efficiently with the data, we have implemented a set of "converters" to standardize the various datasets to a subset of useful fields. Each converter is aware of the details of each dataset and produces a view of the dataset that is homogenized with the others. In this way, data from different sources can be used efficiently with minimal boilerplate code.

For example, to work with the JHU-CSSE country-level data as well as the more detailed dataset from Spain:

from covid_19_utils.converters import CaseConverter

converter = CaseConverter('./data/atlas')
jhu_df = converter.read_convert('./data/covid-19_jhu-csse')
spain_df = converter.read_convert('./data/covid-19-spain')

The resulting DataFrames have exactly the same structure so they can be used interchangably in any analysis or plotting code. See the standardization notebook for a more complete example.

Updating your branch or fork

The data in the main master branch of this project is updated daily - how can you keep your fork or branch up-to-date? We recommend that you do not make changes to the files and directories that are automatically updated so as to avoid merge conflicts as much as possible. This includes the datasets in the data/ directory and the notebooks in notebooks/ and runs/. Especially for notebooks, the easiest way to avoid conflicts would be to simply make a new directory where you put your work.

When you are ready to pull in changes from master, you can do the following from a terminal, when working on your branch or fork:

git remote add upstream https://renkulab.io/gitlab/covid-19/covid-19-public-data.git
git fetch upstream
git merge upstream/master

This will sync your branch or fork with the latest changes from the master branch of the parent repository.