Skip to content
Snippets Groups Projects
README.md 15.3 KiB
Newer Older
# Covid-19 Public Data Collaboration Project

Rok Roškar's avatar
Rok Roškar committed
This project aggregates data from various public sources to better understand
the spread and effect of COVID-19. The goal is to provide a central place where
data, analysis, and discussion can be conducted and shared by a global community
struggling to make sense of the current public health emergency.
The main goal is not to create another dashboard, or collection of dashboards, for
that matter. There are already many excellent ones available, for example, on
https://covid19dashboards.com.

The goal here is different: aggregate data from multiple
public sources, standardize the data formats, and make the data around COVID-19 easy to
work with. Rather than presenting answers to our questions, we want to make it easy for
you to explore the data, and formulate and answer your own questions. Some questions
can be answered by looking at just one data source, but many cannot. And it may be
worthwhile to run an analysis initially developed against one data source against another.
This project aims to make it possible, and hopefully even easy, to do these things.
For a simple example, have a look at the [Global and Regional COVID-19 summary
Rok Roškar's avatar
Rok Roškar committed
notebook](https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/dataset_summary.run.ipynb).
For each data source, we provide a simple summary notebook with interactive
figures:
Rok Roškar's avatar
Rok Roškar committed
* [Summary of global data from from JHU CSSE](https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/Dashboard.run.ipynb)
* [Global data from from ECDC](https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/covid-19-ecdc.run.ipynb)
Rok Roškar's avatar
Rok Roškar committed
* [U.S. state-level data from covidtracking.com](https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/covidtracking.run.ipynb)
* [U.S. county-level data from the New York Times](https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/covid-19-us-nyt.run.ipynb)
* [Regional data for Italy from italian Civil Protection](https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/covid-19-italy.run.ipynb)
* [Switzerland cantonal data collected by the Zürich Statistical Office](https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/openzh-covid-19.run.ipynb)

Case data is complemented by population figures from various sources. A summary
of all the data can be found in the table below.

## Getting started with the project

Rok Roškar's avatar
Rok Roškar committed
The project intends provides a place for easy access to the relevant data for
the purposes of analysis and collaboration. It is envisioned to be hands-on;
with a few clicks you can be analysing the latest data from around the globe. We
hope that this will make it easier for domain specialists to team up with
analysts and data scientists to tackle the open questions together.

The simplest way to start is to simply browse the available notebooks and take a
look at the provided visualizations. If you would like something to be changed
or to include some different data, feel free to start a
[discussion](https://renkulab.io/projects/covid-19/covid-19-public-data/collaboration/issues)!
Rok Roškar's avatar
Rok Roškar committed
When you are ready to get your hands on the data, start by making an account or
logging in on https://renkulab.io and forking the project. Then, [start an interactive
environment](https://renkulab.io/projects/covid-19/covid-19-public-data/environments/new)
Rok Roškar's avatar
Rok Roškar committed
and use the hosted JupyterLab or RStudio to explore the data.

If you don't know how to do something shoot us a line [on
Discourse](https://renku.discourse.group), chat with us on
[gitter](https://gitter.im/SwissDataScienceCenter/renku) or [open an
issue](https://renkulab.io/projects/covid-19/covid-19-public-data/collaboration/issues)
and someone will be able to help out.
Rok Roškar's avatar
Rok Roškar committed
Is there a great data source that you wish we had included? Start a
[discussion](https://renkulab.io/projects/covid-19/covid-19-public-data/collaboration/issues)!
Rok Roškar's avatar
Rok Roškar committed
## Working with the data

A summary of the datasets available in this project is in the table below. In
order to work more efficiently with the data, we have implemented a set of
"converters" to standardize the various datasets to a subset of useful fields.
Each converter is aware of the details of each dataset and produces a view of
Rok Roškar's avatar
Rok Roškar committed
the dataset that is homogenized with the others. In this way, data from
different sources can be used efficiently with minimal boilerplate code.

For example, to work with the JHU-CSSE country-level data as well as the more
Rok Roškar's avatar
Rok Roškar committed
detailed dataset from Spain:

```python
from covid_19_utils.converters import CaseConverter

converter = CaseConverter('./data/atlas')
jhu_df = converter.read_convert('./data/covid-19_jhu-csse')
spain_df = converter.read_convert('./data/covid-19-spain')
```

The resulting DataFrames have exactly the same structure so they can be used
Rok Roškar's avatar
Rok Roškar committed
interchangably in any analysis or plotting code. See the [Global and Regional
COVID-19 summary
notebook](https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/datasets_summary.run.ipynb)
for a more complete example.
### Updating your branch or fork

The data in the main master branch of this project is updated daily - how can
you keep your fork or branch up-to-date? We recommend that you do not make
changes to the files and directories that are automatically updated so as to
avoid merge conflicts as much as possible. This includes the datasets in the
`data/` directory and the notebooks in `notebooks/` and `runs/`. Especially for
notebooks, the easiest way to avoid conflicts would be to simply make a new
directory where you put your work.

When you are ready to pull in changes from master, you can do the following from
a terminal, when working on your branch or fork:

```
git remote add upstream https://renkulab.io/gitlab/covid-19/covid-19-public-data.git
git fetch upstream
git merge upstream/master
```

This will sync your branch or fork with the latest changes from the master
branch of the parent repository.

Rok Roškar's avatar
Rok Roškar committed
### Project structure

`data/`: contains all of the datasets.

`notebooks/`: contains the sample notebooks. The ones in the base directory are executed automatically every time the project is updated and their rendered versions can be found in the `runs/` directory.

`runs/`: contains executed (rendered) versions of various pre- and post-processing notebooks.

`src/covid-19/covid_19_utils`: contains the data converters as well as some
useful helper and plotting functions that are used in the sample notebooks.
## Dataset Summary

<table class="table">
<thead>
<tr>
<th>Source</th>
<th>Dataset</th>
<th>Location</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://github.com/CSSEGISandData/COVID-19">Covid-19 Data Repository at JHU CSSE</a></td>
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/datasets/f6726a5b-f973-45d5-b873-30fa0dff772f/">covid-19_jhu-csse</a></td>
<td><code>data/covid-19_jhu-csse</code></td>
Rok Roškar's avatar
Rok Roškar committed
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/Dashboard.run.ipynb">notebooks/Dashboard.ipynb</a></td>
<td><a href="https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide">Covid-19 data collected by the ECDC</a></td>
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/datasets/78a35752-cc00-443d-8ed8-e37a82599099/">covid-19-ecdc</a></td>
<td><code>data/covid-19-ecdc</code></td>
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/covid-19-ecdc.run.ipynb">notebooks/covid-19-ecdc.ipynb</a></td>
</tr>
<tr>
<td><a href="https://covidtracking.com/">covidtracking.com</a></td>
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/datasets/c8bec148-5332-4602-9dc3-e39bbe92ed67/">covidtracking</a></td>
<td><code>data/covidtracking</code></td>
Rok Roškar's avatar
Rok Roškar committed
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/covidtracking.run.ipynb">notebooks/covidtracking.ipynb</a></td>
Rok Roškar's avatar
Rok Roškar committed
<td><a href="https://github.com/nytimes/covid-19-data">New York Times Covid-19 Data</a></td>
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/datasets/dcac07eb-4c9c-40c5-b541-5072c8302750/">covid-19-us-nyt</a></td>
<td><code>data/covid-19-us-nyt</code></td>
Rok Roškar's avatar
Rok Roškar committed
<td><a href=https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/covid-19-us-nyt.run.ipynb>notebooks/covid-19-us-nyt.ipynb</td>
Rok Roškar's avatar
Rok Roškar committed
</tr>
<tr>
<td><a href="https://github.com/openZH/covid_19">Swiss Cantonal Data</a></td>
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/datasets/c9295d7a-0380-4a1b-8731-5c36d76cb8e7/">openzh-covid-19</a></td>
<td><code>data/openzh-covid-19</code></td>
Rok Roškar's avatar
Rok Roškar committed
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/openzh-covid-19.run.ipynb">notebooks/openzh-covid-19.ipynb</a></td>
</tr>
<tr>
<td><a href="https://github.com/pcm-dpc/COVID-19">Covid-19 data for Italy</a></td>
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/datasets/286c58b1-dbbc-4caa-a23a-fcb001d5ac51/">covid-19-italy</a></td>
<td><code>data/covid-19-italy</code></td>
Rok Roškar's avatar
Rok Roškar committed
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/runs/covid-19-italy.run.ipynb">notebooks/covid-19-italy.ipynb</a>
<tr>
    <td><a href="https://github.com/itoledor/coronavirus.git">Covid-19 data for Chile</a></td>
    <td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/datasets/e7bc5616-1e7c-44a9-995f-bce3cba304b5/">covid-19-chile</a></td>
    <td><code>data/covid-19-chile</code></td>
Rok Roškar's avatar
Rok Roškar committed
    <td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/notebooks/examples-R/covid19-chile.ipynb">notebooks/examples-R/covid19-chile.ipynb</a></td>
</tr>
<tr>
    <td><a href="https://github.com/datadista/datasets.git">Covid-19 data for Spain</a></td>
    <td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/datasets/4de0e2e6-c748-4aaf-a2ac-4a3fb0257ed1/">covid-19-spain</a></td>
Rok Roškar's avatar
Rok Roškar committed
    <td><code>data/covid-19-spain</code></td>
    <td>N/A</td>
</tr>
<tr>
    <td><a href="https://covidtracker.bsg.ox.ac.uk/">Oxford COVID-19 Government Response Tracker</a></td>
    <td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/datasets/9864a18c-fed2-4f19-bc6e-0854b65b282a/">covidtracker</a></td>
    <td><code>data/covidtracker</code></td>
    <td>N/A</td>
<tr>
<td><a href="https://github.com/echen102/COVID-19-TweetIDs">Covid-19 tweet IDs</a></td>
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/datasets/0fc08252-cb39-4b59-bc82-9b213ec0bec6/">covid-19-tweet-ids</a></td>
<td><code>data/covid-19-tweet-ids</code></td>
<td>N/A</td>
</tr>
Rok Roškar's avatar
Rok Roškar committed
<td><a href="https://www.apple.com/covid19/mobility">Apple</a>
    <a href="https://www.google.com/covid19/mobility/">Google</a>
    <a href="https://docs.google.com/spreadsheets/d/1zu9qEWI8PsOI_i8nI_S29HDGHlIp2lfVMsGxpQ5tvAQ/edit#gid=2102005060">BU</a>
</td>
<td><a href="https://renkulab.io/datasets/46c02f05-393a-4a78-bd5d-d3bf2c23d577">Measures of social distancing</a></td>
<td><code>data/distancing-metrics</code></td>
Rok Roškar's avatar
Rok Roškar committed
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/notebooks/examples/distancing-measures.ipynb">notebooks/examples/distancing-measures.ipynb</a></td>
</tbody>
</table>

### Covid-19 Data Repository JHU CSSE

This is a global Covid-19 dataset updated regularly from [Johns Hopkins
University Center for Systems Science and Engineering (JHU
CSSE)](https://github.com/CSSEGISandData/COVID-19). The
[dashboard](covid-19-public-data/files/blob/runs/Dashboard.run.ipynb) summarizes
this data in combination with population data from the world bank.

### Covid-19 Data collected by the ECDC

A global dataset collected by a team of epidemiologists at the [European Center
for Disease Prevention and
Control](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide).

### Covid tracking crowdsourcing project

Rok Roškar's avatar
Rok Roškar committed
[Covid tracking](https://covidtracking.com) is a crowd-sourced dataset for US
Rok Roškar's avatar
Rok Roškar committed
state-level data. It is updated by hand by an army of volunteers and notably
contains data on testing.
Rok Roškar's avatar
Rok Roškar committed

### New York Times Covid-19 Dataset

The [New York Times Covid-19 Dataset](https://github.com/nytimes/covid-19-data)
provides open access to data about the covid-19 cases and deaths per U.S. state
Rok Roškar's avatar
Rok Roškar committed
and county. Please note the [geographic exceptions
section](https://github.com/nytimes/covid-19-data#geographic-exceptions).
Rok Roškar's avatar
Rok Roškar committed
### Covid-19 Data for Swiss Cantons
Rok Roškar's avatar
Rok Roškar committed
The [swiss cantonal data](https://github.com/openZH/covid_19) collected by the
Zürich statistical office. Parts are updated manually, others are starting to
become automated.
Rok Roškar's avatar
Rok Roškar committed
Detailed data compiled by the [Civil Protection of
Italy](https://github.com/pcm-dpc/COVID-19). Includes detailed data on
hospitalizations, ICU cases etc. broken down to the provincial level.

### Case data for Spain

[Data for regions of
Spain](https://github.com/datadista/datasets/tree/master/COVID%2019) compiled by
[Datadista](https://github.com/datadista), an investigative journalism team in
Spain.
### Oxford COVID-19 Government Response Tracker

A compilation of global COVID-19 policy responses led by Blavatnik School of
Government at Oxford University. More details
[here](https://covidtracker.bsg.ox.ac.uk/).

### Covid-19 related tweet IDs

A collection of tweet-ids related to covid-19 from https://github.com/echen102/COVID-19-TweetIDs.

### General

- https://data.worldbank.org/indicator/SP.POP.TOTL
- https://worldmap.harvard.edu/data/geonode:country_centroids_az8
Rok Roškar's avatar
Rok Roškar committed
- https://wikidata.com (for population figures)
Rok Roškar's avatar
Rok Roškar committed

### How is the data updated?

Each morning an automatic pipeline is executed that fetches new data from each
of the data sources. `renku` is then used to run whichever pipelines are
necessary to update the pre-processed data and the rendered notebooks in order
to reflect the changes in the data.


## Derived Dataset Summary

<table class="table">
<thead>
<tr>
<th>Dataset</th>
<th>Location</th>
<th>Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>Case population rates</td>
<td><code>data/covid-19_rates</code></td>
<td><a href="https://renkulab.io/projects/covid-19/covid-19-public-data/files/blob/notebooks/process/ToRates.ipynb">notebooks/process/ToRates.ipynb</a></td>
</tr>
</tbody>
</table>

## Contributing

If you are interested in working on this project, we would love to get
contributions. We would really like to collect more data sources and make them
available here! Please provide ideas for data sources that are relevant to

If you want to add a new datasource yourself, see the section [Adding a new data
source](#adding-a-new-data-source)

## Data Sources to Add

See the [data sources issue](https://renkulab.io/projects/covid-19/covid-19-public-data/collaboration/issues/1/).

## Adding a new data source

Adding a new data source is easy! To do so, in your fork or branch of the project, do the following:

* Create a renku dataset using `renku dataset create [dataset name]`
Rok Roškar's avatar
Rok Roškar committed
* Add any files or folders using `renku dataset add`. See the [renku dataset documentation](https://renku-python.readthedocs.io/en/latest/commands.html#module-renku.cli.dataset) for more details.
* Create a notebook that shows how to read and work with the dataset in the `notebooks/examples` folder
    * Protip: use a unique name for the notebook to avoid merge conflicts
* Add an issue to the project for any suggestions on things to do with the data