Zoom Data frames
Working with tabular data in julia
Julia was created with large computations in mind. Its main type for large collections of data is thus the array. The array is Julia’s equivalent to NumPy’s ndarray (the data collection object used for efficient computations in Python). It is also equivalent to PyTorch’s tensors (the data collection object used for deep learning in this Python library).
Julia’s equivalent to pandas' DataFrame (in Python) and data.frames or their variations such as tibbles and data.tables (in R) is provided by the DataFrames package.
DataFrame through an example
Let’s introduce DataFrame through an example typical of tabular data workflows using Covid-19 data from Johns Hopkins University. This data is open source, easy to access from a repository on GitHub, and is the data used in the JHU Covid-19 dashboard.
Load packages
We will need a number of external packages for this. Remember that external packages first need to be installed:
] add CSV DataFrames StatsPlots TimeSeries BSON
There is a conflict version between the latest CSV package version and the version of Julia running in our JupyterHub. For this reason, we have to downgrade the version to v0.8.5 (for those running Julia on their computer, this is not necessary):
] pin CSV@0.8.5
Then, you have to load them to make them accessible in the current session:
using CSV
using DataFrames
using Dates # from the standard Julia library
using BSON
Reduce printed output size of DataFrames
By default, the Jupyter Julia kernel prints the first 30 rows of DataFrames. Let’s reduce this to 5 by changing the appropriate environment variable:
ENV["LINES"] = 5;
Read the data in
We will use the file time_series_covid19_deaths_global.csv
from the JHU repository. This file contains the daily cumulative number of reported Covid-19 deaths for all countries. Some countries have their data broken down by province/state.
I uploaded it to our training cluster, so you can find it at:
~/projects/def-sponsor00/shared/data/time_series_covid19_deaths_global.csv
.
For those running Julia on their computer, you can clone the GitHub repo
(or download and unzip it), then find the file time_series_covid19_deaths_global.csv
in the csse_covid_19_data/csse_covid_19_time_series/
directory.
deaths_global = CSV.read("projects/def-sponsor00/shared/data/time_series_covid19_deaths_global.csv", DataFrame)
285 rows × 865 columns (omitted printing of 857 columns)
Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | |
---|---|---|---|---|---|---|---|---|
String? | String | Float64? | Float64? | Int64 | Int64 | Int64 | Int64 | |
1 | missing | Afghanistan | 33.9391 | 67.71 | 0 | 0 | 0 | 0 |
2 | missing | Albania | 41.1533 | 20.1683 | 0 | 0 | 0 | 0 |
3 | missing | Algeria | 28.0339 | 1.6596 | 0 | 0 | 0 | 0 |
4 | missing | Andorra | 42.5063 | 1.5218 | 0 | 0 | 0 | 0 |
5 | missing | Angola | -11.2027 | 17.8739 | 0 | 0 | 0 | 0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Explore the data
size(deaths_global)
(285, 865)
describe(deaths_global)
865 rows × 7 columns
variable | mean | min | median | max | nmissing | eltype | |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | Type | |
1 | Province/State | Alberta | Zhejiang | 196 | Union{Missing, String} | ||
2 | Country/Region | Afghanistan | Zimbabwe | 0 | String | ||
3 | Lat | 20.1779 | -71.9499 | 21.694 | 71.7069 | 2 | Union{Missing, Float64} |
4 | Long | 22.3317 | -178.117 | 20.9394 | 178.065 | 2 | Union{Missing, Float64} |
5 | 1/22/20 | 0.0596491 | 0 | 0.0 | 17 | 0 | Int64 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Rename some variables to easier names
DataFrames.rename!(deaths_global, Dict(1 => Symbol("province"),
2 => Symbol("country")))
285 rows × 865 columns (omitted printing of 856 columns)
province | country | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | |
---|---|---|---|---|---|---|---|---|---|
String? | String | Float64? | Float64? | Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | missing | Afghanistan | 33.9391 | 67.71 | 0 | 0 | 0 | 0 | 0 |
2 | missing | Albania | 41.1533 | 20.1683 | 0 | 0 | 0 | 0 | 0 |
3 | missing | Algeria | 28.0339 | 1.6596 | 0 | 0 | 0 | 0 | 0 |
4 | missing | Andorra | 42.5063 | 1.5218 | 0 | 0 | 0 | 0 | 0 |
5 | missing | Angola | -11.2027 | 17.8739 | 0 | 0 | 0 | 0 | 0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Transform the data into long format
deaths_global_long = stack(deaths_global, Not(collect(1:4)),
variable_name = Symbol("date"),
value_name = Symbol("number_deaths"))
245,385 rows × 6 columns
province | country | Lat | Long | date | number_deaths | |
---|---|---|---|---|---|---|
String? | String | Float64? | Float64? | String | Int64 | |
1 | missing | Afghanistan | 33.9391 | 67.71 | 1/22/20 | 0 |
2 | missing | Albania | 41.1533 | 20.1683 | 1/22/20 | 0 |
3 | missing | Algeria | 28.0339 | 1.6596 | 1/22/20 | 0 |
4 | missing | Andorra | 42.5063 | 1.5218 | 1/22/20 | 0 |
5 | missing | Angola | -11.2027 | 17.8739 | 1/22/20 | 0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Turn the date into the Date type
For this, we need to do multiple things:
1. Index the date column
2. Add 20
in front of the year through regular expression to make it 4 digits
3. Convert into the Date type
4. Reassign this to the date column of our DataFrame
deaths_global_long.date = Date.(
replace.(
deaths_global_long[!, 5], # (`!` does not make a copy while indexing)
r"(.*)(..)$" => s"\g<1>20\2"),
"m/dd/yyyy");
This is what our new DataFrame looks like:
deaths_global_long
245,385 rows × 6 columns
province | country | Lat | Long | date | number_deaths | |
---|---|---|---|---|---|---|
String? | String | Float64? | Float64? | Date | Int64 | |
1 | missing | Afghanistan | 33.9391 | 67.71 | 2020-01-22 | 0 |
2 | missing | Albania | 41.1533 | 20.1683 | 2020-01-22 | 0 |
3 | missing | Algeria | 28.0339 | 1.6596 | 2020-01-22 | 0 |
4 | missing | Andorra | 42.5063 | 1.5218 | 2020-01-22 | 0 |
5 | missing | Angola | -11.2027 | 17.8739 | 2020-01-22 | 0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Get cumulative totals by country and by date
To do this, we use a classic split-apply-combine workflow:
deaths_countries_grouped = groupby(deaths_global_long, [:country, :date])
GroupedDataFrame with 171339 groups based on keys: country, date
First Group (1 row): country = "Afghanistan", date = Date("2020-01-22")
province | country | Lat | Long | date | number_deaths | |
---|---|---|---|---|---|---|
String? | String | Float64? | Float64? | Date | Int64 | |
1 | missing | Afghanistan | 33.9391 | 67.71 | 2020-01-22 | 0 |
⋮
Last Group (1 row): country = "Zimbabwe", date = Date("2022-05-31")
province | country | Lat | Long | date | number_deaths | |
---|---|---|---|---|---|---|
String? | String | Float64? | Float64? | Date | Int64 | |
1 | missing | Zimbabwe | -19.0154 | 29.1549 | 2022-05-31 | 5503 |
deaths_countries = combine(deaths_countries_grouped, :number_deaths => sum)
171,339 rows × 3 columns
country | date | number_deaths_sum | |
---|---|---|---|
String | Date | Int64 | |
1 | Afghanistan | 2020-01-22 | 0 |
2 | Albania | 2020-01-22 | 0 |
3 | Algeria | 2020-01-22 | 0 |
4 | Andorra | 2020-01-22 | 0 |
5 | Angola | 2020-01-22 | 0 |
⋮ | ⋮ | ⋮ | ⋮ |
Index cumulative totals by date for Canada
deaths_canada = filter(:country => isequal("Canada"), deaths_countries)
861 rows × 3 columns
country | date | number_deaths_sum | |
---|---|---|---|
String | Date | Int64 | |
1 | Canada | 2020-01-22 | 0 |
2 | Canada | 2020-01-23 | 0 |
3 | Canada | 2020-01-24 | 0 |
4 | Canada | 2020-01-25 | 0 |
5 | Canada | 2020-01-26 | 0 |
⋮ | ⋮ | ⋮ | ⋮ |
We can then drop the country column:
select!(deaths_canada, Not(1))
861 rows × 2 columns
date | number_deaths_sum | |
---|---|---|
Date | Int64 | |
1 | 2020-01-22 | 0 |
2 | 2020-01-23 | 0 |
3 | 2020-01-24 | 0 |
4 | 2020-01-25 | 0 |
5 | 2020-01-26 | 0 |
⋮ | ⋮ | ⋮ |
Saving a Julia object
Let’s save our DataFrame deaths_canada
into a file to plot it in the next chapter.
For this, we will use the BSON package which saves a BSON file.
BSON.@save "deaths_canada.bson" deaths_canada
DataFrame cheatsheet
This example was far from exhaustive. The cheatsheet below will give you an idea of other operations possible on DataFrames: