Data Joins + Pivots

Tuesday, October 15

Today we will discuss…

Discord & Tokens
Grade Expectations
- Survey on Canvas
New Material
- Pivoting data with tidyr
- Joining data with dplyr
PA 4: Military Spending

Discord & Deadline Extension Tokens

If you answer / respond to another student’s question on Discord, you will earn one additional “token” (3-day deadline extension request).

Grade Expectations

The course syllabus has been updated to reflect the grade criteria you all proposed.

Grade Expectations

The course syllabus has been updated to reflect the grade criteria you all proposed.

You have until Sunday to vote on whether you agree with the proposed criteria, and to suggest ways these criteria could be revised.

Data Layouts

Tidy Data

Tidy data…

is rectangular.
has observations as rows and variables as columns.
has different formats for different tasks.

Creating Tidy Data

We may need to transform our data to turn it into the version of tidy that is best for a task at hand.

An illustration of two round, fluffy characters, one green and one purple, holding a large table between them labeled 'TIDY.' The table is held by clamps on either side labeled 'WRANGLE' in orange and pink. The green character on the left is smiling with one arm raised, while the purple character on the right is cheering with both arms up. The table represents tidy data, and the clamps labeled 'WRANGLE' suggest data manipulation or preparation.

Image by Allison Horst

Cereal Data

library(liver)
data(cereal)
head(cereal)

name	manuf	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
100% Bran	N	cold	70	4	1	130	10.0	5.0	6	280	25	3	1	0.33	68.40297
100% Natural Bran	Q	cold	120	3	5	15	2.0	8.0	8	135	0	3	1	1.00	33.98368
All-Bran	K	cold	70	4	1	260	9.0	7.0	5	320	25	3	1	0.33	59.42551
All-Bran with Extra Fiber	K	cold	50	4	0	140	14.0	8.0	0	330	25	3	1	0.50	93.70491
Almond Delight	R	cold	110	2	2	200	1.0	14.0	8	-1	25	3	1	0.75	34.38484
Apple Cinnamon Cheerios	G	cold	110	2	2	180	1.5	10.5	10	70	25	1	1	0.75	29.50954

Creating Tidy Data

Let’s say we want to look at mean cereal nutrients based on shelf.

Currently, the data are in a wide format – a separate column for each nutrient.

Transforming the data will make plotting easier.

Code

cereal_wide <- cereal |> 
  group_by(shelf) |> 
  summarise(
    across(.cols = calories:vitamins, 
           .fns = ~ mean(.x)
           )
    )

shelf	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins
1	102.5000	2.650000	0.60	176.2500	1.6850000	15.80000	4.800000	75.50000	20.00000
2	109.5238	1.904762	1.00	145.7143	0.9047619	13.61905	9.619048	57.80952	23.80952
3	107.7778	2.861111	1.25	158.6111	3.1388889	14.50000	6.527778	129.83333	35.41667

Code

my_colors <- c("calories_col" = "steelblue", "sugars_col" = "orange3")

cereal_wide |> 
  ggplot() +
  geom_point(mapping = aes(x = shelf, y = calories, color = "calories_col")) +
  geom_line(mapping = aes(x = shelf, y = calories, color = "calories_col")) + 
  geom_point(mapping = aes(x = shelf, y = sugars, color = "sugars_col")) +
  geom_line(mapping = aes(x = shelf, y = sugars, color = "sugars_col")) +
  scale_color_manual(values = my_colors, labels = names(my_colors)) +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount", color = "Nutrient")

A line chart displaying the relationship between the mean amount of two nutrients ('calories_col' and 'sugars_col') and their corresponding shelf number (1, 2, 3). The y-axis represents the mean amount, ranging from 0 to 100, and the x-axis represents the shelf numbers (1, 2, 3). The blue line represents 'calories_col,' with a relatively flat trend around 90 across the shelves. The orange line represents 'sugars_col,' which remains close to zero across the same shelves. The legend on the right identifies the nutrient lines.

Code

cereal_long <- cereal |> 
  pivot_longer(cols = calories:vitamins,
               names_to = "Nutrient",
               values_to = "Amount") |> 
  group_by(shelf, Nutrient) |> 
  summarise(mean_amount = mean(Amount))

shelf	Nutrient	mean_amount
1	calories	102.5000000
1	carbo	15.8000000
1	fat	0.6000000
1	fiber	1.6850000
1	potass	75.5000000
1	protein	2.6500000
1	sodium	176.2500000
1	sugars	4.8000000
1	vitamins	20.0000000
2	calories	109.5238095
2	carbo	13.6190476
2	fat	1.0000000
2	fiber	0.9047619
2	potass	57.8095238
2	protein	1.9047619
2	sodium	145.7142857
2	sugars	9.6190476
2	vitamins	23.8095238
3	calories	107.7777778
3	carbo	14.5000000
3	fat	1.2500000
3	fiber	3.1388889
3	potass	129.8333333
3	protein	2.8611111
3	sodium	158.6111111
3	sugars	6.5277778
3	vitamins	35.4166667

Code

cereal_long |> 
  ggplot(mapping = aes(x = shelf, 
                       y = mean_amount, 
                       color = Nutrient)
         ) +
  geom_point() +
  geom_line() +
  labs(x = "Shelf", 
       y = "", 
       subtitle = "Mean Amount")

Pivoting Data

A gif showing the visual transformation of the data pictured, going from the wide format to the long format. Above the image is the R code that would produce each data layout.

Manual Method

Consider daily rainfall observed in SLO in January 2023.

The data is in a human-friendly form (like a calendar).
Each week has a row, and each day has a column.

A spreadsheet displaying a table with days of the week as column headers (Sunday to Saturday) and weeks numbered 1 to 5 in the first column. The table contains numerical values corresponding to each day of the week for each week. Some cells contain 'NA,' indicating missing values, while other cells contain numerical values like 0.12, 0.27, 4.26, and so on. Several cells contain zeros, and a few cells are left empty.

Data source

How would you manually convert this to long format?

Manual Method: Steps

Keep the column Week.
Create a new column Day_of_Week.
Create a new column Rainfall (hold daily rainfall values).
Now we have three columns – move Sunday values over.

Manual Method: Steps

Keep the column Week.
Create a new column Day_of_Week.
Create a new column Rainfall (hold daily rainfall values).
Now we have three columns – move Sunday values over.
Duplicate Week 1-5 and copy Monday values over.

A table with three columns: 'Week,' 'Day_of_Week,' and 'Rainfall.' The 'Week' column contains numbers 1 to 5, while the 'Day_of_Week' column alternates between 'Sunday' and 'Monday.' The 'Rainfall' column lists numerical values for each week and day. For Sunday, the rainfall values are 0, 0.27, 0.34, 0, and 'NA' for weeks 1 through 5, respectively. For Monday, the rainfall values are 0.12, 4.26, 0.33, 0, and 'NA' for weeks 1 through 5, respectively.

Manual Method: Steps

Keep the column Week.
Create a new column Day_of_Week.
Create a new column Rainfall (hold daily rainfall values).
Now we have three columns – move Sunday values over.
Duplicate Week 1-5 and copy Monday values over.
Duplicate Week 1-5 and copy Tuesday values over.

Manual Method: Steps

Keep the column Week.
Create a new column Day_of_Week.
Create a new column Rainfall (hold daily rainfall values).
Now we have three columns – move Sunday values over.
Duplicate Week 1-5 and copy Monday values over.
Duplicate Week 1-5 and copy Tuesday values over.
Continue for the rest of the days of the week.

Computational Approach

A table showing rainfall data for each day of the week across five weeks, with annotations explaining how to reshape the data from wide to long format. The table has columns for 'Week,' followed by the days of the week (Sunday to Saturday), and contains numerical values representing rainfall. There are three color-coded annotations: the 'Week' column is highlighted in green, labeled 'key' with the note 'keep this column, but repeat it many times.' The day names (Sunday through Saturday) are highlighted in red with the note 'turn into new column: Day_of_Week (repeat day of week many times).' The data cells are highlighted in blue with the note 'turn into new column: Rainfall (values for each week # and day of week).'

We can use pivot_longer() to turn a wide dataset into a long(er) dataset.

`pivot_longer()`

Take a wide dataset and turn it into a long dataset.

cols – specify the columns that should be pivoted.
- Do not include the names of ID columns (columns to not be pivoted).
names_to – the name of the new column containing the old column names.
values_to – the name of the new column containing the old column values.

`pivot_longer()`

slo_rainfall |> 
  pivot_longer(cols = Sunday:Saturday,
               names_to  = "Day_of_Week",
               values_to = "Daily_Rainfall")

Week	Day_of_Week	Daily_Rainfall
1	Sunday	0.00
1	Monday	0.12
1	Tuesday	0.00
1	Wednesday	1.58
1	Thursday	0.91
1	Friday	0.00
1	Saturday	0.05
2	Sunday	0.27
2	Monday	4.26
2	Tuesday	0.43
2	Wednesday	0.00
2	Thursday	0.00
2	Friday	0.16
2	Saturday	1.41
3	Sunday	0.34
3	Monday	0.33
3	Tuesday	0.00
3	Wednesday	0.00
3	Thursday	0.13
3	Friday	0.00
3	Saturday	0.00
4	Sunday	0.00
4	Monday	0.00
4	Tuesday	0.00
4	Wednesday	0.00
4	Thursday	0.00
4	Friday	0.00
4	Saturday	NA
5	Sunday	NA
5	Monday	NA
5	Tuesday	NA
5	Wednesday	NA
5	Thursday	NA
5	Friday	NA
5	Saturday	NA

`pivot_wider()`

Take a long dataset and turn it into a wide dataset.

id_cols – specify the column(s) that contain the ID for unique rows in the wide dataset.
names_from – the name of the column containing the new column names.
values_from – the name of the column containing the new column values.

`pivot_wider()`

Let’s say we calculate the mean amount of protein for cereals on each shelf and for each manuf.

mean_protein <- cereal |> 
  group_by(manuf, shelf) |> 
  summarize(mean_protein = mean(protein))

manuf	shelf	mean_protein
A	2	4.000000
G	1	3.000000
G	2	1.285714
G	3	2.666667
K	1	2.750000
K	2	2.142857
K	3	2.916667
N	1	2.666667
N	2	2.500000
N	3	4.000000
P	1	1.500000
P	2	1.000000
P	3	3.000000
Q	1	5.000000
Q	2	2.000000
Q	3	2.500000
R	1	2.000000
R	3	3.000000

`pivot_wider()`

We can make this dataset more easily readable…

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein)

manuf	1	2	3
G	3.000000	1.285714	2.666667
K	2.750000	2.142857	2.916667
N	2.666667	2.500000	4.000000
P	1.500000	1.000000	3.000000
Q	5.000000	2.000000	2.500000
R	2.000000	NA	3.000000
A	NA	4.000000	NA

Better names in `pivot_wider()`

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein,
              names_prefix = "Shelf_")

manuf	Shelf_1	Shelf_2	Shelf_3
G	3.000000	1.285714	2.666667
K	2.750000	2.142857	2.916667
N	2.666667	2.500000	4.000000
P	1.500000	1.000000	3.000000
Q	5.000000	2.000000	2.500000
R	2.000000	NA	3.000000
A	NA	4.000000	NA

Data Joins

Relational Data

Multiple, interconnected tables of data are called relational.

It is the relation between datasets, not just the individual datasets themselves, that are important.

A diagram depicting the relationships between various tables in a movie database. The tables and their columns are as follows. directors_genres: Contains 'director_id' (int), 'genre' (varchar), and 'prob' (float). Linked to the 'directors' table by 'director_id.' movies_directors: Contains 'director_id' (int) and 'movie_id' (int). Linked to both the 'directors' and 'movies' tables by 'director_id' and 'movie_id.' movies_genres: Contains 'movie_id' (int) and 'genre' (varchar). Linked to the 'movies' table by 'movie_id.' roles: Contains 'actor_id' (int), 'movie_id' (int), and 'role' (varchar). Linked to both the 'actors' and 'movies' tables by 'actor_id' and 'movie_id.' The following entity tables are represented at the bottom: directors: Contains 'id' (int), 'first_name' (varchar), and 'last_name' (varchar). movies: Contains 'id' (int), 'name' (varchar), 'year' (int), and 'rank' (float). actors: Contains 'id' (int), 'first_name' (varchar), 'last_name' (varchar), and 'gender' (char). Arrows represent relationships between the various tables, with foreign keys connecting them.

IMDb movie relational data

Data Joins

We can combine (join) data tables based on their relations.

Mutating joins

Add variables from a new dataframe to observations in an existing dataframe.

full_join(), left_join(), right_join(), inner_join(), outer_join()

Filtering Joins

Filter observations based on values in new dataframe.

semi_join(), anti_join()

Keys

A key uniquely identifies an observation in a data set.

To combine (join) two datasets, the key needs to be present in both.

`inner_join()`

Keeps observations when their keys are present in both datasets.

This image shows two tables on the left, labeled 'x' and 'y.' The 'x' table contains two columns: a key column with values 1, 2, and 3, and a value column with 'x1,' 'x2,' and 'x3.' The 'y' table also has two columns: a key column with values 1, 2, and 4, and a value column with 'y1,' 'y2,' and 'y3.'

This table combines data from both 'x' and 'y.' based on the 'key' column. It contains three columns: 'key,' 'val_x,' and 'val_y.' For key values 1 and 2, the corresponding values from both 'x' and 'y' are shown ('x1' with 'y1' and 'x2' with 'y2'), while the third rows from both original tables are excluded due to the mismatch in key values.

`inner_join()`: IMDb Example

directors_genres

director_id	genre	prob
429	Adventure	0.750000
429	Fantasy	0.750000
2931	Drama	0.714286
2931	Action	0.428571
11652	Sci-Fi	0.500000
11652	Action	0.500000
14927	Animation	1.000000
14927	Family	1.000000
15092	Comedy	0.545455
15092	Crime	0.545455

movies_directors

director_id	movie_id
429	300229
9247	124110
11652	10920
11652	333856
14927	192017
15092	109093
15092	237431

ID: 429, 2931, 11652, 14927, 15092 ID: 429, 9247, 11652, 14927, 15092

inner_join(directors_genres, movies_directors)

director_id	genre	prob	movie_id
429	Adventure	0.750000	300229
429	Fantasy	0.750000	300229
11652	Sci-Fi	0.500000	10920
11652	Sci-Fi	0.500000	333856
11652	Action	0.500000	10920
11652	Action	0.500000	333856
14927	Animation	1.000000	192017
14927	Family	1.000000	192017
15092	Comedy	0.545455	109093
15092	Comedy	0.545455	237431
15092	Crime	0.545455	109093
15092	Crime	0.545455	237431

ID: 429, ~~2931~~, ~~9247~~, 11652, 14927, 15092

`inner_join()`: IMDb Example

What if our key does not have the same name?

directors_genres

director_id	genre	prob
429	Adventure	0.750000
429	Fantasy	0.750000
2931	Drama	0.714286
2931	Action	0.428571
11652	Sci-Fi	0.500000
11652	Action	0.500000
14927	Animation	1.000000
14927	Family	1.000000
15092	Comedy	0.545455
15092	Crime	0.545455

directors

id	first_name	last_name
429	Andrew	Adamson
9247	Zach	Braff
11652	James (I)	Cameron
14927	Ron	Clements
15092	Ethan	Coen

inner_join(directors_genres, 
           directors, 
           by = join_by(director_id == id))

id	first_name	last_name	genre	prob
429	Andrew	Adamson	Adventure	0.750000
429	Andrew	Adamson	Fantasy	0.750000
11652	James (I)	Cameron	Sci-Fi	0.500000
11652	James (I)	Cameron	Action	0.500000
14927	Ron	Clements	Animation	1.000000
14927	Ron	Clements	Family	1.000000
15092	Ethan	Coen	Comedy	0.545455
15092	Ethan	Coen	Crime	0.545455

Piping Joins

Remember: the dataset you pipe in becomes the first argument of the function you are piping into!

So if you are using a pipe, you will only be specifying the right dataset inside the join function.

inner_join(directors_genres, movies_directors)

…is equivalent to…

directors_genres |> 
  inner_join(movies_directors)

More Mutating Joins

left_join() – keep only (and all) observations present in the left data set
right_join() – keep only (and all) observations present in the right data set
full_join() – keep only (and all) observations present in both data sets

Four Venn diagrams illustrating different types of joins between two datasets, labeled 'x' and 'y.' inner_join(x, y): Shows two overlapping circles with only the intersection shaded, representing records that are common to both 'x' and 'y.' left_join(x, y): Shows two overlapping circles with the left circle ('x') fully shaded and the intersection shaded, representing all records from 'x' and the matching records from 'y.' right_join(x, y): Shows two overlapping circles with the right circle ('y') fully shaded and the intersection shaded, representing all records from 'y' and the matching records from 'x.' full_join(x, y): Shows two overlapping circles with both circles fully shaded, representing all records from both 'x' and 'y,' including those without matches.

More Mutating Joins

Which directors would remain for each of the following?

left_join(directors_genres, movies_directors)
right_join(directors_genres, movies_directors)
full_join(directors_genres, movies_directors)

directors_genres |> 
  distinct(director_id)

director_id
429
2931
11652
14927
15092

movies_directors |> 
  distinct(director_id)

director_id
429
9247
11652
14927
15092

Filtering Joins: `semi_join()`

Keeps observations when their keys are present in both datasets, but only keeps variables from the first dataset.

An illustration of a semi join between two tables, each with two columns. The table labeled 'x,' has a key column with values 1, 2, and 3 and a value column with 'x1,' 'x2,' and 'x3.' The table labeled 'y,' has a key column with values 1, 2, and 4 and a value column with 'y1,' 'y2,' and 'y4.'

→

An arrow points from the first two tables (x and y) toward the right where a single table is displayed. This table is the result of a semi join, where only the values of x that had a match in y are kept. The resulting table includes rows with matching key values from both 'x' and 'y' (keys 1 and 2), displaying only the 'x' values ('x1' and 'x2'). Rows with non-matching keys (3 and 4) are excluded.

Filtering Joins: `semi_join()`

semi_join()
Connection to filter()

directors_genres |> 
  semi_join(movies_directors)

director_id	genre	prob
429	Adventure	0.750000
429	Fantasy	0.750000
11652	Sci-Fi	0.500000
11652	Action	0.500000
14927	Animation	1.000000
14927	Family	1.000000
15092	Comedy	0.545455
15092	Crime	0.545455

Movie Directors: 429, ~~2931~~, 11652, 14927, 15092

directors_genres |>
  filter(director_id %in% movies_directors$director_id)

director_id	genre	prob
429	Adventure	0.750000
429	Fantasy	0.750000
11652	Sci-Fi	0.500000
11652	Action	0.500000
14927	Animation	1.000000
14927	Family	1.000000
15092	Comedy	0.545455
15092	Crime	0.545455

Filtering Joins: `anti_join()`

Removes observations when their keys are present in both datasets, and only keeps variables from the first dataset.

Two tables on the left labeled 'x' and 'y,' each with two columns, show key-value pairs. The 'x' table contains a key column with values 1, 2, and 3 and a value column with 'x1,' 'x2,' and 'x3.' The 'y' table contains a key column with values 1, 2, and 4 and a value column with 'y1,' 'y2,' and 'y4.'

→

Filtering Joins: `anti_join()`

anti_join()
Connection to filter()

directors_genres |> 
  anti_join(movies_directors)

director_id	genre	prob
2931	Drama	0.714286
2931	Action	0.428571

Movie Directors: ~~429~~, 2931, ~~11652~~, ~~14927~~, ~~15092~~

directors_genres |>
  filter(!director_id %in% movies_directors$director_id)

director_id	genre	prob
2931	Drama	0.714286
2931	Action	0.428571

PA 4: Military Spending

Today you will be tidying messy data to explore the relationship between countries of the world and military spending.

This activity will require knowledge of:

function documentation
function arguments
locating missing values
character vectors
keys joining two datasets
searching / iterating over multiple columns
pivoting data from wide to long
creating side-by-side boxplots
locating the median on a boxplot
estimating the variability from a boxplot

None of us have all these abilities. Each of us has some of these abilities.

dplyr Resources

Every group should have a dplyr cheatsheet!

On the Back: The Combine Tables section gives advice on joining two datasets

The “Filtering Join” section will be helpful when performing an anti_join()!

A picture of the dplyr cheatsheet, which contains helpful information on working with data in a variety of ways.

tidyr Resources

Every group should have a tidyr cheatsheet!

On the Front: The Reshape Data section gives advice on pivoting a dataset from wide to long

A picture of the tidyr cheatsheet, which contains helpful information on transforming data in a variety of ways.

Task Card

Every group should have a task card!

On the Front

the expectations of each role
the norms of collaborating

On the Back

code and pictures for performing an anti_join()
code and pictures for pivoting a dataset from wide to long
guidelines for formatting dplyr and tidyr code

Pair Programming Expectations

Developer

Reads prompt and ensures Coder understands what is being asked.
Types the code specified by the Coder into the Quarto document.
Runs the code provided by the Coder.
Works with Coder to debug the code.
Evaluates the output.
Works with Coder to write code comments.

Coder

Reads out instructions or prompts
Directs the Developer what to type.
Talks with Developer about their ideas.
Manages resources (e.g., cheatsheets, textbook, slides).
Works with Developer to debug the code.
Works with Developer to write code comments.

Getting Started

First, both of you will do the following:

Join your Practice Activity workspace in Posit Cloud
- You are in a new group, so you should have a new workspace!
Log-in to Posit Cloud
Open the PA 4: Military Spending project
Open the PA4-dplyr.qmd file

Then, the partner who woke up the earliest today starts as the Developer (typing and listening to instructions from the Coder)!

The Coder does not type.
- The collaborative editing feature should allow you to track what is being typed.
The Developer only types what they are told to type.

External Resources

During the Practice Activity, you are not permitted to use Google or ChatGPT for help.

You are permitted to use:

the dplyr cheatsheet,
the tidyr cheatsheet,
the course textbook, and
the course slides.

Submission

Each person will input your responses to Canvas Questions 1, 2, and 3 into the PA4 Canvas quiz.
The person who last occupied the role of Developer will download and submit the PA-4.html file for the group.
- Only one submission per group!

Data Joins + Pivots

Tuesday, October 15

Discord & Deadline Extension Tokens

Discord & Deadline Extension Tokens

Grade Expectations

Grade Expectations

Grade Expectations

Data Layouts

Tidy Data

Creating Tidy Data

Cereal Data

Creating Tidy Data

Tidying the Cereals Data

Pivoting Data

Manual Method

Manual Method: Steps

Manual Method: Steps

Manual Method: Steps

Manual Method: Steps

Computational Approach

pivot_longer()

pivot_longer()

pivot_wider()

pivot_wider()

pivot_wider()

Better names in pivot_wider()

Data Joins

Relational Data

Data Joins

Keys

inner_join()

inner_join(): IMDb Example

inner_join(): IMDb Example

Piping Joins

More Mutating Joins

More Mutating Joins

Filtering Joins: semi_join()

Filtering Joins: semi_join()

Filtering Joins: anti_join()

Filtering Joins: anti_join()

PA 4: Military Spending

This activity will require knowledge of:

dplyr Resources

tidyr Resources

Task Card

Pair Programming Expectations

Getting Started

External Resources

Submission

Exit Ticket

`pivot_longer()`

`pivot_longer()`

`pivot_wider()`

`pivot_wider()`

`pivot_wider()`

Better names in `pivot_wider()`

`inner_join()`

`inner_join()`: IMDb Example

`inner_join()`: IMDb Example

Filtering Joins: `semi_join()`

Filtering Joins: `semi_join()`

Filtering Joins: `anti_join()`

Filtering Joins: `anti_join()`