Zoom Data frames

Last updated: January 12, 2023

Table of contents

Working with tabular data in julia

Julia was created with large computations in mind. Its main type for large collections of data is thus the array. The array is Julia’s equivalent to NumPy’s ndarray (the data collection object used for efficient computations in Python). It is also equivalent to PyTorch’s tensors (the data collection object used for deep learning in this Python library).

Julia’s equivalent to pandas' DataFrame (in Python) and data.frames or their variations such as tibbles and data.tables (in R) is provided by the DataFrames package.

DataFrame through an example

Let’s introduce DataFrame through an example typical of tabular data workflows using Covid-19 data from Johns Hopkins University. This data is open source, easy to access from a repository on GitHub, and is the data used in the JHU Covid-19 dashboard.

Load packages

We will need a number of external packages for this. Remember that external packages first need to be installed:

] add CSV DataFrames StatsPlots TimeSeries BSON

There is a conflict version between the latest CSV package version and the version of Julia running in our JupyterHub. For this reason, we have to downgrade the version to v0.8.5 (for those running Julia on their computer, this is not necessary):

] pin CSV@0.8.5

Then, you have to load them to make them accessible in the current session:

using CSV
using DataFrames
using Dates          # from the standard Julia library
using BSON

Reduce printed output size of DataFrames

By default, the Jupyter Julia kernel prints the first 30 rows of DataFrames. Let’s reduce this to 5 by changing the appropriate environment variable:

ENV["LINES"] = 5;

Read the data in

We will use the file time_series_covid19_deaths_global.csv from the JHU repository. This file contains the daily cumulative number of reported Covid-19 deaths for all countries. Some countries have their data broken down by province/state.

I uploaded it to our training cluster, so you can find it at:
~/projects/def-sponsor00/shared/data/time_series_covid19_deaths_global.csv.

For those running Julia on their computer, you can clone the GitHub repo (or download and unzip it), then find the file time_series_covid19_deaths_global.csv in the csse_covid_19_data/csse_covid_19_time_series/ directory.

deaths_global = CSV.read("projects/def-sponsor00/shared/data/time_series_covid19_deaths_global.csv", DataFrame)

285 rows × 865 columns (omitted printing of 857 columns)

	Province/State	Country/Region	Lat	Long	1/22/20	1/23/20	1/24/20	1/25/20
	String?	String	Float64?	Float64?	Int64	Int64	Int64	Int64
1	missing	Afghanistan	33.9391	67.71	0	0	0	0
2	missing	Albania	41.1533	20.1683	0	0	0	0
3	missing	Algeria	28.0339	1.6596	0	0	0	0
4	missing	Andorra	42.5063	1.5218	0	0	0	0
5	missing	Angola	-11.2027	17.8739	0	0	0	0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

Explore the data

size(deaths_global)

(285, 865)

describe(deaths_global)

865 rows × 7 columns

	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	Type
1	Province/State		Alberta		Zhejiang	196	Union{Missing, String}
2	Country/Region		Afghanistan		Zimbabwe	0	String
3	Lat	20.1779	-71.9499	21.694	71.7069	2	Union{Missing, Float64}
4	Long	22.3317	-178.117	20.9394	178.065	2	Union{Missing, Float64}
5	1/22/20	0.0596491	0	0.0	17	0	Int64
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

Rename some variables to easier names

DataFrames.rename!(deaths_global, Dict(1 => Symbol("province"),
                                       2 => Symbol("country")))

285 rows × 865 columns (omitted printing of 856 columns)

	province	country	Lat	Long	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20
	String?	String	Float64?	Float64?	Int64	Int64	Int64	Int64	Int64
1	missing	Afghanistan	33.9391	67.71	0	0	0	0	0
2	missing	Albania	41.1533	20.1683	0	0	0	0	0
3	missing	Algeria	28.0339	1.6596	0	0	0	0	0
4	missing	Andorra	42.5063	1.5218	0	0	0	0	0
5	missing	Angola	-11.2027	17.8739	0	0	0	0	0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

Transform the data into long format

deaths_global_long = stack(deaths_global, Not(collect(1:4)),
                           variable_name = Symbol("date"),
                           value_name = Symbol("number_deaths"))

245,385 rows × 6 columns

	province	country	Lat	Long	date	number_deaths
	String?	String	Float64?	Float64?	String	Int64
1	missing	Afghanistan	33.9391	67.71	1/22/20	0
2	missing	Albania	41.1533	20.1683	1/22/20	0
3	missing	Algeria	28.0339	1.6596	1/22/20	0
4	missing	Andorra	42.5063	1.5218	1/22/20	0
5	missing	Angola	-11.2027	17.8739	1/22/20	0
⋮	⋮	⋮	⋮	⋮	⋮	⋮

Turn the date into the Date type

For this, we need to do multiple things:

1. Index the date column
2. Add 20 in front of the year through regular expression to make it 4 digits
3. Convert into the Date type
4. Reassign this to the date column of our DataFrame

Step 2. and 3. need to be applied to all elements, so we use broadcasting with the dot notation.

deaths_global_long.date = Date.(
    replace.(
        deaths_global_long[!, 5],       # (`!` does not make a copy while indexing)
        r"(.*)(..)$" => s"\g<1>20\2"),
    "m/dd/yyyy");

This is what our new DataFrame looks like:

deaths_global_long

245,385 rows × 6 columns

	province	country	Lat	Long	date	number_deaths
	String?	String	Float64?	Float64?	Date	Int64
1	missing	Afghanistan	33.9391	67.71	2020-01-22	0
2	missing	Albania	41.1533	20.1683	2020-01-22	0
3	missing	Algeria	28.0339	1.6596	2020-01-22	0
4	missing	Andorra	42.5063	1.5218	2020-01-22	0
5	missing	Angola	-11.2027	17.8739	2020-01-22	0
⋮	⋮	⋮	⋮	⋮	⋮	⋮

Get cumulative totals by country and by date

To do this, we use a classic split-apply-combine workflow:

deaths_countries_grouped = groupby(deaths_global_long, [:country, :date])

GroupedDataFrame with 171339 groups based on keys: country, date

First Group (1 row): country = "Afghanistan", date = Date("2020-01-22")

	province	country	Lat	Long	date	number_deaths
	String?	String	Float64?	Float64?	Date	Int64
1	missing	Afghanistan	33.9391	67.71	2020-01-22	0

⋮

Last Group (1 row): country = "Zimbabwe", date = Date("2022-05-31")

	province	country	Lat	Long	date	number_deaths
	String?	String	Float64?	Float64?	Date	Int64
1	missing	Zimbabwe	-19.0154	29.1549	2022-05-31	5503

deaths_countries = combine(deaths_countries_grouped, :number_deaths => sum)

171,339 rows × 3 columns

	country	date	number_deaths_sum
	String	Date	Int64
1	Afghanistan	2020-01-22	0
2	Albania	2020-01-22	0
3	Algeria	2020-01-22	0
4	Andorra	2020-01-22	0
5	Angola	2020-01-22	0
⋮	⋮	⋮	⋮

Index cumulative totals by date for Canada

deaths_canada = filter(:country => isequal("Canada"), deaths_countries)

861 rows × 3 columns

	country	date	number_deaths_sum
	String	Date	Int64
1	Canada	2020-01-22	0
2	Canada	2020-01-23	0
3	Canada	2020-01-24	0
4	Canada	2020-01-25	0
5	Canada	2020-01-26	0
⋮	⋮	⋮	⋮

We can then drop the country column:

select!(deaths_canada, Not(1))

861 rows × 2 columns

	date	number_deaths_sum
	Date	Int64
1	2020-01-22	0
2	2020-01-23	0
3	2020-01-24	0
4	2020-01-25	0
5	2020-01-26	0
⋮	⋮	⋮

Saving a Julia object

Let’s save our DataFrame deaths_canada into a file to plot it in the next chapter.

For this, we will use the BSON package which saves a BSON file.

BSON.@save "deaths_canada.bson" deaths_canada

DataFrame cheatsheet

This example was far from exhaustive. The cheatsheet below will give you an idea of other operations possible on DataFrames: