Skip to main content
Skip to main content
Edit this page

COVID-19 Open-Data

COVID-19 Open-Data attempts to assemble the largest Covid-19 epidemiological database, in addition to a powerful set of expansive covariates. It includes open, publicly sourced, licensed data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, weather, and more.

The details are in GitHub here.

It's easy to insert this data into ClickHouse...

Note

The following commands were executed on a Production instance of ClickHouse Cloud. You can easily run them on a local install as well.

  1. Let's see what the data looks like:

The CSV file has 10 columns:

  1. Now let's view some of the rows:

Notice the url function easily reads data from a CSV file:

  1. We will create a table now that we know what the data looks like:
  1. The following command inserts the entire dataset into the covid19 table:
  1. It goes pretty quick - let's see how many rows were inserted:
  1. Let's see how many total cases of Covid-19 were recorded:
  1. You will notice the data has a lot of 0's for dates - either weekends or days when numbers were not reported each day. We can use a window function to smooth out the daily averages of new cases:
  1. This query determines the latest values for each location. We can't use max(date) because not all countries reported every day, so we grab the last row using ROW_NUMBER:
  1. We can use lagInFrame to determine the LAG of new cases each day. In this query we filter by the US_DC location:

The response look like:

  1. This query calculates the percentage of change in new cases each day, and includes a simple increase or decrease column in the result set:

The results look like

Note

As mentioned in the GitHub repo, the dataset is no longer updated as of September 15, 2022.