Data Joins and Transformations

Broadly, your objective while reading these chapters is to be able to identify data sets which have “messy” formats and determine a sequence of operations to transition the data into “tidy” format. To do this, you should master the following concepts:


▶️ Watch Videos: 25-minutes

📖 Readings: 75-minutes

💻 Activities: 3

✅ Check-ins: 2


1 Part One: Tidy Data

The illustrations below are lifted from an excellent blog post (Lowndes and Horst 2020) about tidy data; they’re reproduced here because

  1. they’re beautiful and licensed as CCA-4.0-by, and
  2. they might be more memorable than the equivalent paragraphs of text without illustration.

Most of the time, data does not come in a format suitable for analysis. Spreadsheets are generally optimized for data entry or viewing, rather than for statistical analysis:

  • Tables may be laid out for easy data entry, so that there are multiple observations in a single row
  • It may be visually preferable to arrange columns of data to show multiple times or categories on the same row for easy comparison

When we analyze data, however, we care much more about the fundamental structure of observations: discrete units of data collection. Each observation may have several corresponding variables that may be measured simultaneously, but fundamentally each discrete data point is what we are interested in analyzing or plotting.

The structure of tidy data reflects this preference for keeping the data in a fundamental form: each observation is in its own row, any observed variables are in single columns. This format is inherently rectangular, which is also important for statistical analysis - our methods are typically designed to work with matrices of data.

Stylized text providing an overview of Tidy Data. The top reads 'Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.' On the left reads 'In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.' There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure.

Figure 1: Tidy data format, illustrated.

There are two sets of anthropomorphized data tables. The top group of three tables are all rectangular and smiling, with a shared speech bubble reading 'our columns are variables and our rows are observations!'. Text to the left of that group reads “The standard structure of tidy data means that 'tidy datasets are all alike…' The lower group of four tables are all different shapes, look ragged and concerned, and have different speech bubbles reading (from left to right) 'my column are values and my rows are variables', 'I have variables in columns AND in rows', 'I have multiple variables in a single column', and 'I don’t even KNOW what my deal is.' Next to the frazzled data tables is text '...but every messy dataset is messy in its own way. -Hadley Wickham.'

Figure 2: An illustration of the principle that every messy dataset is messy in its own way.

The preference for tidy data has several practical implications: it is easier to reuse code on tidy data, allowing for analysis using a standardized set of tools (rather than having to build a custom tool for each data analysis job).

On the left is a happy cute fuzzy monster holding a rectangular data frame with a tool that fits the data frame shape. On the workbench behind the monster are other data frames of similar rectangular shape, and neatly arranged tools that also look like they would fit those data frames. The workbench looks uncluttered and tidy. The text above the tidy workbench reads 'When working with tidy data, we can use the same tools in similar ways for different datasets…' On the right is a cute monster looking very frustrated, using duct tape and other tools to haphazardly tie data tables together, each in a different way. The monster is in front of a messy, cluttered workbench. The text above the frustrated monster reads '...but working with untidy data often means reinventing the wheel with one-time approaches that are hard to iterate or reuse.'

Figure 3: Tidy data is easier to manage because the same tools and approaches apply to multiple datasets.

In addition, standardized tools for data analysis means that it is easier to collaborate with others: if everyone starts with the same set of assumptions about the dataset, you can borrow methods and tools from a collaborator’s analysis and easily apply them to your own dataset.

Two happy looking round fuzzy monsters, each holding a similarly shaped wrench with the word wrangle' on it. Between their tools is held up a rectangular data table labeled 'TIDY.'

Figure 4: Collaboration with tidy data.

Cute fuzzy monsters putting rectangular data tables onto a conveyor belt. Along the conveyor belt line are different automated 'stations' that update the data, reading 'WRANGLE', 'VISUALIZE', and 'MODEL'. A monster at the end of the conveyor belt is carrying away a table that reads 'Complete analysis.'

Figure 5: Tidy data enables standardized workflows.

Tidy data makes it easier to collaborate with others and analyze new data using standardized workflows.

1.1 Examples: Messy Data

These datasets all display the same data: TB (Tuberculosis) cases documented by the WHO (World Health Organization) in Afghanistan, Brazil, and China, between 1999 and 2000. There are four variables: country, year, cases, and population, but each table has a different layout.

  • For each of the data set, determine whether each table is tidy.
    • If it is not, identify which rule(s) it violates.
  • What would you have to do to the data to be able to compute a standardized TB infection rate per 100,000 people?
Tip

All of these data sets are “built-in” to the tidyr package!

Table 1
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

Here, each observation is a single row, each variable is a column, and everything is nicely arranged for e.g. regression or statistical analysis. We can easily compute another measure, such as cases per 100,000 population, by taking cases / population * 100000 (this would define a new column).

Table 2
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583

Here, we have 4 columns again, but we now have 12 rows (instead of 6): one of the columns is an indicator of which of two numerical observations is recorded in that row; a second column stores the value (e.g., cases, population). This form of the data is more easily plotted in ggplot2, if we want to show trend lines for both cases and population, but computing per capita cases would be much more difficult in this form than in the arrangement in Table 1!

Table 3
country year rate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583

This form has only 3 columns, because the rate variable (which is a character) stores both the case count and the population. We can’t do anything with this format as it stands, because we can’t do math on data stored as characters. However, this form might be easier to read and record for a human being.

Table 4a
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766
Table 4b
country 1999 2000
Afghanistan 19987071 20595360
Brazil 172006362 174504898
China 1272915272 1280428583

In this form, we have two tables - one for population, and one for cases. Each year’s observations are in a separate column. This format is often found in separate sheets of an Excel workbook. To work with this data, we’ll need to transform each table so that there is a column indicating which year an observation is from, and then merge the two tables together by country and year.

Table 5
country century year rate
Afghanistan 19 99 745/19987071
Afghanistan 20 00 2666/20595360
Brazil 19 99 37737/172006362
Brazil 20 00 80488/174504898
China 19 99 212258/1272915272
China 20 00 213766/1280428583

Table 5 is very similar to Table 3, but the year has been separated into two columns - century, and year. This is more common with year, month, and day in separate columns (or date and time in separate columns), often to deal with the fact that spreadsheets don’t always handle dates the way you’d hope they would.

In a perfect world, all data would come in the right format for our needs, but this is often not the case. We will spend the next few weeks learning about how to use R to reformat our data to follow the tidy data framework and see why this is so important. By the end of this chapter, you will have the skills needed to wrangle and transform the most common “messy” data sets into “tidy” form.

Do we always want our data in the same layout?

The concept of tidy data is useful for mapping variables from the data set to elements in a graph, specifications of a model, or aggregating to create summaries. However, what is considered to be “tidy data” format for one task, might not be in the correct “tidy data” format for a different task. It is important for you to consider the end goal when restructuring your data.

Part of this course is building the skills for you to be able to map your data operation steps from an original data set to the correct format (and output).

2 Part Two: Reshaping Data

📖 Required Reading: R4DS Chapter 5 (Data tidying)

Hexagonal sticker for the 'tidyr' package from RStudio. The design features a tidy broom sweeping across, symbolizing data cleaning and organization. The background is a soft gradient of blues, and the package name 'tidyr' is prominently displayed in white near the bottom.

▶️ Required Video: Data Layouts – 7 minutes

▶️ Required Video: Pivoting Longer – 9 minutes

▶️ Required Video: Pivoting Wider – 9 minutes

Check-in 4.1: Practice with Pivoting

Load in the cereal data set:

library(liver)

data(cereal)

head(cereal)
                       name manuf type calories protein fat sodium fiber carbo
1                 100% Bran     N cold       70       4   1    130  10.0   5.0
2         100% Natural Bran     Q cold      120       3   5     15   2.0   8.0
3                  All-Bran     K cold       70       4   1    260   9.0   7.0
4 All-Bran with Extra Fiber     K cold       50       4   0    140  14.0   8.0
5            Almond Delight     R cold      110       2   2    200   1.0  14.0
6   Apple Cinnamon Cheerios     G cold      110       2   2    180   1.5  10.5
  sugars potass vitamins shelf weight cups   rating
1      6    280       25     3      1 0.33 68.40297
2      8    135        0     3      1 1.00 33.98368
3      5    320       25     3      1 0.33 59.42551
4      0    330       25     3      1 0.50 93.70491
5      8     -1       25     3      1 0.75 34.38484
6     10     70       25     1      1 0.75 29.50954

Question 1: Create a new dataset called cereals_long, that has three columns:

  • The name of the cereal

  • A column called Nutrient with values "protein", "fat", or "fiber".

  • A column called Amount with the corresponding amount of the nutrient.

Caution

You are expected to use pivot_longer() to perform this operation!

3 Part Three: Joining data

📖 Required Reading: R4DS Chapter 21 (Joins)

Note

Because these chapters do a much better job visualizing the concepts, I’ve chosen not to record additional videos.

Check-in 4.2: Practice with Joins

The following code creates three datasets / tibbles:

prof_info <- tibble(
  professor = c("Bodwin", 
                "Theobold", 
                "Robinson",
                "Mann", 
                "Ruiz"),
  undergrad_school = c("Harvard", 
                       "Colorado Mesa University",
                       "Winona State University",
                       "Carlton College", 
                       "Reed College"),
  grad_school = c("UNC", 
                  "Montana State University", 
                  "University of Nebraska-Lincoln",
                  "University of Michigan", 
                  "Oregon State University")
)

prof_course <- tidyr::tibble(
  professor = c("Bodwin", 
                "Robinson", 
                "Theobold", 
                "Mann", 
                "Carlton"),
  Stat_331 = c(TRUE, 
               TRUE, 
               TRUE, 
               TRUE, 
               TRUE),
  Stat_330 = c(FALSE, 
               TRUE, 
               FALSE, 
               FALSE, 
               FALSE),
  Stat_431 = c(TRUE, 
               TRUE, 
               TRUE, 
               TRUE, 
               FALSE)
)

course_info <- tibble(
  course = c("Stat_331", 
             "Stat_330", 
             "Stat_431"),
  num_sections = c(8, 
                   3, 
                   1)
)

Here is what they look like once created:

prof_info
# A tibble: 5 × 3
  professor undergrad_school         grad_school                   
  <chr>     <chr>                    <chr>                         
1 Bodwin    Harvard                  UNC                           
2 Theobold  Colorado Mesa University Montana State University      
3 Robinson  Winona State University  University of Nebraska-Lincoln
4 Mann      Carlton College          University of Michigan        
5 Ruiz      Reed College             Oregon State University       
prof_course 
# A tibble: 5 × 4
  professor Stat_331 Stat_330 Stat_431
  <chr>     <lgl>    <lgl>    <lgl>   
1 Bodwin    TRUE     FALSE    TRUE    
2 Robinson  TRUE     TRUE     TRUE    
3 Theobold  TRUE     FALSE    TRUE    
4 Mann      TRUE     FALSE    TRUE    
5 Carlton   TRUE     FALSE    FALSE   
course_info
# A tibble: 3 × 2
  course   num_sections
  <chr>           <dbl>
1 Stat_331            8
2 Stat_330            3
3 Stat_431            1

These datasets contain information about five Cal Poly professors, their educational history, the classes they are able to teach, and the number of sections of each class that need to be assigned.

a) Combine datasets 1 and 2 to make this dataset:

professor undergrad_school         grad_school          Stat_331 Stat_330 Stat_431
Bodwin    Harvard                  UNC                  TRUE     FALSE    TRUE    
Theobold  Colorado Mesa University Montana State Unive… TRUE     FALSE    TRUE    
Robinson  Winona State University  University of Nebra… TRUE     TRUE     TRUE    
Mann      Carlton College          University of Michi… TRUE     FALSE    TRUE

b) Combine datasets 1 and 2 to make this dataset:

professor undergrad_school         grad_school          Stat_331 Stat_330 Stat_431
Bodwin    Harvard                  UNC                  TRUE     FALSE    TRUE    
Theobold  Colorado Mesa University Montana State Unive… TRUE     FALSE    TRUE    
Robinson  Winona State University  University of Nebra… TRUE     TRUE     TRUE  
Mann      Carlton College          University of Michi… TRUE     FALSE    TRUE    
Ruiz      Reed College             Oregon State Univer… NA       NA       NA

c) Combine datasets 2 and 3 to make this dataset:

professor course   can_teach num_sections
Bodwin    Stat_331 TRUE                 8
Bodwin    Stat_330 FALSE                3
Bodwin    Stat_431 TRUE                 1
Robinson  Stat_331 TRUE                 8
Robinson  Stat_330 TRUE                 3
Robinson  Stat_431 TRUE                 1
Theobold  Stat_331 TRUE                 8
Theobold  Stat_330 FALSE                3
Theobold  Stat_431 TRUE                 1
Mann      Stat_331 TRUE                 8
Mann      Stat_330 FALSE                3
Mann      Stat_431 TRUE                 1
Carlton   Stat_331 TRUE                 8
Carlton   Stat_330 FALSE                3
Carlton   Stat_431 FALSE                1

References

Lowndes, Julie, and Allison Horst. 2020. “Tidy Data for Efficiency, Reproducibility, and Collaboration.” Blog. Openscapes. https://www.openscapes.org/blog/2020/10/12/tidy-data//.