country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
Data Joins and Transformations
Broadly, your objective while reading these chapters is to be able to identify data sets which have “messy” formats and determine a sequence of operations to transition the data into “tidy” format. To do this, you should master the following concepts:
- Determine what data format is necessary to generate a desired plot or statistical model.
- Understand the differences between “wide” and “long” format data and how to transition between the two structures.
- Understand relational data formats and how to use data joins to assemble data from multiple tables into a single table.
▶️ Watch Videos: 25-minutes
📖 Readings: 75-minutes
💻 Activities: 3
✅ Check-ins: 2
1 Part One: Tidy Data
The illustrations below are lifted from an excellent blog post (Lowndes and Horst 2020) about tidy data; they’re reproduced here because
- they’re beautiful and licensed as CCA-4.0-by, and
- they might be more memorable than the equivalent paragraphs of text without illustration.
Most of the time, data does not come in a format suitable for analysis. Spreadsheets are generally optimized for data entry or viewing, rather than for statistical analysis:
- Tables may be laid out for easy data entry, so that there are multiple observations in a single row
- It may be visually preferable to arrange columns of data to show multiple times or categories on the same row for easy comparison
When we analyze data, however, we care much more about the fundamental structure of observations: discrete units of data collection. Each observation may have several corresponding variables that may be measured simultaneously, but fundamentally each discrete data point is what we are interested in analyzing or plotting.
The structure of tidy data reflects this preference for keeping the data in a fundamental form: each observation is in its own row, any observed variables are in single columns. This format is inherently rectangular, which is also important for statistical analysis - our methods are typically designed to work with matrices of data.
The preference for tidy data has several practical implications: it is easier to reuse code on tidy data, allowing for analysis using a standardized set of tools (rather than having to build a custom tool for each data analysis job).
In addition, standardized tools for data analysis means that it is easier to collaborate with others: if everyone starts with the same set of assumptions about the dataset, you can borrow methods and tools from a collaborator’s analysis and easily apply them to your own dataset.
Tidy data makes it easier to collaborate with others and analyze new data using standardized workflows.
1.1 Examples: Messy Data
These datasets all display the same data: TB (Tuberculosis) cases documented by the WHO (World Health Organization) in Afghanistan, Brazil, and China, between 1999 and 2000. There are four variables: country, year, cases, and population, but each table has a different layout.
- For each of the data set, determine whether each table is tidy.
- If it is not, identify which rule(s) it violates.
- What would you have to do to the data to be able to compute a standardized TB infection rate per 100,000 people?
All of these data sets are “built-in” to the tidyr
package!
Here, each observation is a single row, each variable is a column, and everything is nicely arranged for e.g. regression or statistical analysis. We can easily compute another measure, such as cases per 100,000 population, by taking cases
/ population
* 100000 (this would define a new column).
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
Here, we have 4 columns again, but we now have 12 rows (instead of 6): one of the columns is an indicator of which of two numerical observations is recorded in that row; a second column stores the value (e.g., cases
, population
). This form of the data is more easily plotted in ggplot2, if we want to show trend lines for both cases and population, but computing per capita cases would be much more difficult in this form than in the arrangement in Table 1!
country | year | rate |
---|---|---|
Afghanistan | 1999 | 745/19987071 |
Afghanistan | 2000 | 2666/20595360 |
Brazil | 1999 | 37737/172006362 |
Brazil | 2000 | 80488/174504898 |
China | 1999 | 212258/1272915272 |
China | 2000 | 213766/1280428583 |
This form has only 3 columns, because the rate variable (which is a character) stores both the case count and the population. We can’t do anything with this format as it stands, because we can’t do math on data stored as characters. However, this form might be easier to read and record for a human being.
country | 1999 | 2000 |
---|---|---|
Afghanistan | 745 | 2666 |
Brazil | 37737 | 80488 |
China | 212258 | 213766 |
country | 1999 | 2000 |
---|---|---|
Afghanistan | 19987071 | 20595360 |
Brazil | 172006362 | 174504898 |
China | 1272915272 | 1280428583 |
In this form, we have two tables - one for population, and one for cases. Each year’s observations are in a separate column. This format is often found in separate sheets of an Excel workbook. To work with this data, we’ll need to transform each table so that there is a column indicating which year an observation is from, and then merge the two tables together by country and year.
country | century | year | rate |
---|---|---|---|
Afghanistan | 19 | 99 | 745/19987071 |
Afghanistan | 20 | 00 | 2666/20595360 |
Brazil | 19 | 99 | 37737/172006362 |
Brazil | 20 | 00 | 80488/174504898 |
China | 19 | 99 | 212258/1272915272 |
China | 20 | 00 | 213766/1280428583 |
Table 5 is very similar to Table 3, but the year has been separated into two columns - century, and year. This is more common with year, month, and day in separate columns (or date and time in separate columns), often to deal with the fact that spreadsheets don’t always handle dates the way you’d hope they would.
In a perfect world, all data would come in the right format for our needs, but this is often not the case. We will spend the next few weeks learning about how to use R to reformat our data to follow the tidy data framework and see why this is so important. By the end of this chapter, you will have the skills needed to wrangle and transform the most common “messy” data sets into “tidy” form.
The concept of tidy data is useful for mapping variables from the data set to elements in a graph, specifications of a model, or aggregating to create summaries. However, what is considered to be “tidy data” format for one task, might not be in the correct “tidy data” format for a different task. It is important for you to consider the end goal when restructuring your data.
Part of this course is building the skills for you to be able to map your data operation steps from an original data set to the correct format (and output).
2 Part Two: Reshaping Data
📖 Required Reading: R4DS Chapter 5 (Data tidying)
▶️ Required Video: Data Layouts – 7 minutes
▶️ Required Video: Pivoting Longer – 9 minutes
▶️ Required Video: Pivoting Wider – 9 minutes
✅ Check-in 4.1: Practice with Pivoting
Load in the cereal data set:
library(liver)
data(cereal)
head(cereal)
name manuf type calories protein fat sodium fiber carbo
1 100% Bran N cold 70 4 1 130 10.0 5.0
2 100% Natural Bran Q cold 120 3 5 15 2.0 8.0
3 All-Bran K cold 70 4 1 260 9.0 7.0
4 All-Bran with Extra Fiber K cold 50 4 0 140 14.0 8.0
5 Almond Delight R cold 110 2 2 200 1.0 14.0
6 Apple Cinnamon Cheerios G cold 110 2 2 180 1.5 10.5
sugars potass vitamins shelf weight cups rating
1 6 280 25 3 1 0.33 68.40297
2 8 135 0 3 1 1.00 33.98368
3 5 320 25 3 1 0.33 59.42551
4 0 330 25 3 1 0.50 93.70491
5 8 -1 25 3 1 0.75 34.38484
6 10 70 25 1 1 0.75 29.50954
Question 1: Create a new dataset called cereals_long
, that has three columns:
The
name
of the cerealA column called
Nutrient
with values"protein"
,"fat"
, or"fiber"
.A column called
Amount
with the corresponding amount of the nutrient.
You are expected to use pivot_longer()
to perform this operation!
3 Part Three: Joining data
📖 Required Reading: R4DS Chapter 21 (Joins)
Because these chapters do a much better job visualizing the concepts, I’ve chosen not to record additional videos.
✅ Check-in 4.2: Practice with Joins
The following code creates three datasets / tibbles:
<- tibble(
prof_info professor = c("Bodwin",
"Theobold",
"Robinson",
"Mann",
"Ruiz"),
undergrad_school = c("Harvard",
"Colorado Mesa University",
"Winona State University",
"Carlton College",
"Reed College"),
grad_school = c("UNC",
"Montana State University",
"University of Nebraska-Lincoln",
"University of Michigan",
"Oregon State University")
)
<- tidyr::tibble(
prof_course professor = c("Bodwin",
"Robinson",
"Theobold",
"Mann",
"Carlton"),
Stat_331 = c(TRUE,
TRUE,
TRUE,
TRUE,
TRUE),
Stat_330 = c(FALSE,
TRUE,
FALSE,
FALSE,
FALSE),
Stat_431 = c(TRUE,
TRUE,
TRUE,
TRUE,
FALSE)
)
<- tibble(
course_info course = c("Stat_331",
"Stat_330",
"Stat_431"),
num_sections = c(8,
3,
1)
)
Here is what they look like once created:
prof_info
# A tibble: 5 × 3
professor undergrad_school grad_school
<chr> <chr> <chr>
1 Bodwin Harvard UNC
2 Theobold Colorado Mesa University Montana State University
3 Robinson Winona State University University of Nebraska-Lincoln
4 Mann Carlton College University of Michigan
5 Ruiz Reed College Oregon State University
prof_course
# A tibble: 5 × 4
professor Stat_331 Stat_330 Stat_431
<chr> <lgl> <lgl> <lgl>
1 Bodwin TRUE FALSE TRUE
2 Robinson TRUE TRUE TRUE
3 Theobold TRUE FALSE TRUE
4 Mann TRUE FALSE TRUE
5 Carlton TRUE FALSE FALSE
course_info
# A tibble: 3 × 2
course num_sections
<chr> <dbl>
1 Stat_331 8
2 Stat_330 3
3 Stat_431 1
These datasets contain information about five Cal Poly professors, their educational history, the classes they are able to teach, and the number of sections of each class that need to be assigned.
a) Combine datasets 1 and 2 to make this dataset:
professor undergrad_school grad_school Stat_331 Stat_330 Stat_431
Bodwin Harvard UNC TRUE FALSE TRUE
Theobold Colorado Mesa University Montana State Unive… TRUE FALSE TRUE
Robinson Winona State University University of Nebra… TRUE TRUE TRUE
Mann Carlton College University of Michi… TRUE FALSE TRUE
b) Combine datasets 1 and 2 to make this dataset:
professor undergrad_school grad_school Stat_331 Stat_330 Stat_431
Bodwin Harvard UNC TRUE FALSE TRUE
Theobold Colorado Mesa University Montana State Unive… TRUE FALSE TRUE
Robinson Winona State University University of Nebra… TRUE TRUE TRUE
Mann Carlton College University of Michi… TRUE FALSE TRUE
Ruiz Reed College Oregon State Univer… NA NA NA
c) Combine datasets 2 and 3 to make this dataset:
professor course can_teach num_sections
Bodwin Stat_331 TRUE 8
Bodwin Stat_330 FALSE 3
Bodwin Stat_431 TRUE 1
Robinson Stat_331 TRUE 8
Robinson Stat_330 TRUE 3
Robinson Stat_431 TRUE 1
Theobold Stat_331 TRUE 8
Theobold Stat_330 FALSE 3
Theobold Stat_431 TRUE 1
Mann Stat_331 TRUE 8
Mann Stat_330 FALSE 3
Mann Stat_431 TRUE 1
Carlton Stat_331 TRUE 8
Carlton Stat_330 FALSE 3
Carlton Stat_431 FALSE 1