Zoom Data frames

Last updated: January 12, 2023

 Table of contents

Working with tabular data in julia

Julia was created with large computations in mind. Its main type for large collections of data is thus the array. The array is Julia’s equivalent to NumPy’s ndarray (the data collection object used for efficient computations in Python). It is also equivalent to PyTorch’s tensors (the data collection object used for deep learning in this Python library).

Julia’s equivalent to pandas' DataFrame (in Python) and data.frames or their variations such as tibbles and data.tables (in R) is provided by the DataFrames package.

DataFrame through an example

Let’s introduce DataFrame through an example typical of tabular data workflows using Covid-19 data from Johns Hopkins University. This data is open source, easy to access from a repository on GitHub, and is the data used in the JHU Covid-19 dashboard.

Load packages

We will need a number of external packages for this. Remember that external packages first need to be installed:

] add CSV DataFrames StatsPlots TimeSeries BSON

There is a conflict version between the latest CSV package version and the version of Julia running in our JupyterHub. For this reason, we have to downgrade the version to v0.8.5 (for those running Julia on their computer, this is not necessary):

] pin CSV@0.8.5

Then, you have to load them to make them accessible in the current session:

using CSV
using DataFrames
using Dates          # from the standard Julia library
using BSON

Reduce printed output size of DataFrames

By default, the Jupyter Julia kernel prints the first 30 rows of DataFrames. Let’s reduce this to 5 by changing the appropriate environment variable:

ENV["LINES"] = 5;

Read the data in

We will use the file time_series_covid19_deaths_global.csv from the JHU repository. This file contains the daily cumulative number of reported Covid-19 deaths for all countries. Some countries have their data broken down by province/state.

I uploaded it to our training cluster, so you can find it at:
~/projects/def-sponsor00/shared/data/time_series_covid19_deaths_global.csv.

For those running Julia on their computer, you can clone the GitHub repo (or download and unzip it), then find the file time_series_covid19_deaths_global.csv in the csse_covid_19_data/csse_covid_19_time_series/ directory.

deaths_global = CSV.read("projects/def-sponsor00/shared/data/time_series_covid19_deaths_global.csv", DataFrame)

285 rows × 865 columns (omitted printing of 857 columns)

Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/20
String?StringFloat64?Float64?Int64Int64Int64Int64
1missingAfghanistan33.939167.710000
2missingAlbania41.153320.16830000
3missingAlgeria28.03391.65960000
4missingAndorra42.50631.52180000
5missingAngola-11.202717.87390000

Explore the data

size(deaths_global)
(285, 865)
describe(deaths_global)

865 rows × 7 columns

variablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1Province/StateAlbertaZhejiang196Union{Missing, String}
2Country/RegionAfghanistanZimbabwe0String
3Lat20.1779-71.949921.69471.70692Union{Missing, Float64}
4Long22.3317-178.11720.9394178.0652Union{Missing, Float64}
51/22/200.059649100.0170Int64

Rename some variables to easier names

DataFrames.rename!(deaths_global, Dict(1 => Symbol("province"),
                                       2 => Symbol("country")))

285 rows × 865 columns (omitted printing of 856 columns)

provincecountryLatLong1/22/201/23/201/24/201/25/201/26/20
String?StringFloat64?Float64?Int64Int64Int64Int64Int64
1missingAfghanistan33.939167.7100000
2missingAlbania41.153320.168300000
3missingAlgeria28.03391.659600000
4missingAndorra42.50631.521800000
5missingAngola-11.202717.873900000

Transform the data into long format

deaths_global_long = stack(deaths_global, Not(collect(1:4)),
                           variable_name = Symbol("date"),
                           value_name = Symbol("number_deaths"))

245,385 rows × 6 columns

provincecountryLatLongdatenumber_deaths
String?StringFloat64?Float64?StringInt64
1missingAfghanistan33.939167.711/22/200
2missingAlbania41.153320.16831/22/200
3missingAlgeria28.03391.65961/22/200
4missingAndorra42.50631.52181/22/200
5missingAngola-11.202717.87391/22/200

Turn the date into the Date type

For this, we need to do multiple things:

 1.  Index the date column
 2.  Add 20 in front of the year through regular expression to make it 4 digits
 3.  Convert into the Date type
 4.  Reassign this to the date column of our DataFrame

Step 2. and 3. need to be applied to all elements, so we use broadcasting with the dot notation.

deaths_global_long.date = Date.(
    replace.(
        deaths_global_long[!, 5],       # (`!` does not make a copy while indexing)
        r"(.*)(..)$" => s"\g<1>20\2"),
    "m/dd/yyyy");

This is what our new DataFrame looks like:

deaths_global_long

245,385 rows × 6 columns

provincecountryLatLongdatenumber_deaths
String?StringFloat64?Float64?DateInt64
1missingAfghanistan33.939167.712020-01-220
2missingAlbania41.153320.16832020-01-220
3missingAlgeria28.03391.65962020-01-220
4missingAndorra42.50631.52182020-01-220
5missingAngola-11.202717.87392020-01-220

Get cumulative totals by country and by date

To do this, we use a classic split-apply-combine workflow:

deaths_countries_grouped = groupby(deaths_global_long, [:country, :date])

GroupedDataFrame with 171339 groups based on keys: country, date

First Group (1 row): country = "Afghanistan", date = Date("2020-01-22")

provincecountryLatLongdatenumber_deaths
String?StringFloat64?Float64?DateInt64
1missingAfghanistan33.939167.712020-01-220

Last Group (1 row): country = "Zimbabwe", date = Date("2022-05-31")

provincecountryLatLongdatenumber_deaths
String?StringFloat64?Float64?DateInt64
1missingZimbabwe-19.015429.15492022-05-315503
deaths_countries = combine(deaths_countries_grouped, :number_deaths => sum)

171,339 rows × 3 columns

countrydatenumber_deaths_sum
StringDateInt64
1Afghanistan2020-01-220
2Albania2020-01-220
3Algeria2020-01-220
4Andorra2020-01-220
5Angola2020-01-220

Index cumulative totals by date for Canada

deaths_canada = filter(:country => isequal("Canada"), deaths_countries)

861 rows × 3 columns

countrydatenumber_deaths_sum
StringDateInt64
1Canada2020-01-220
2Canada2020-01-230
3Canada2020-01-240
4Canada2020-01-250
5Canada2020-01-260

We can then drop the country column:

select!(deaths_canada, Not(1))

861 rows × 2 columns

datenumber_deaths_sum
DateInt64
12020-01-220
22020-01-230
32020-01-240
42020-01-250
52020-01-260

Saving a Julia object

Let’s save our DataFrame deaths_canada into a file to plot it in the next chapter.

For this, we will use the BSON package which saves a BSON file.

BSON.@save "deaths_canada.bson" deaths_canada

DataFrame cheatsheet

This example was far from exhaustive. The cheatsheet below will give you an idea of other operations possible on DataFrames:

Comments & questions