Complex Data Pipelines

Asking more complex research questions

Background: Intro level RQs

The Big Five tidyverse verbs:
- arrange
- filter
- mutate
- group_by
- summarize
(Doesn’t have to be tidyverse syntax!)

Dataset 1: Fast food nutrition

Code

dat_ff

# A tibble: 515 × 17
   restaurant item      calories cal_fat total_fat sat_fat trans_fat cholesterol
   <chr>      <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1 Mcdonalds  Artisan …      380      60         7       2       0            95
 2 Mcdonalds  Single B…      840     410        45      17       1.5         130
 3 Mcdonalds  Double B…     1130     600        67      27       3           220
 4 Mcdonalds  Grilled …      750     280        31      10       0.5         155
 5 Mcdonalds  Crispy B…      920     410        45      12       0.5         120
 6 Mcdonalds  Big Mac        540     250        28      10       1            80
 7 Mcdonalds  Cheesebu…      300     100        12       5       0.5          40
 8 Mcdonalds  Classic …      510     210        24       4       0            65
 9 Mcdonalds  Double C…      430     190        21      11       1            85
10 Mcdonalds  Double Q…      770     400        45      21       2.5         175
# ℹ 505 more rows
# ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
#   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>

Single-function RQs

Type of basic questions:

arrange What item has the most calories?
mutate Calculate the saturated fat to total fat ratio of each item.
summarize What is the median calorie item?
filter How many items have more than 1000 calories?
group_by Same questions, but within each restaurant.

Intro pipeline RQs

mutate -> arrange “What item has the highest saturated fat to total fat ratio?”
mutate -> filter “How many items have more than 50% of total fat in saturated form?”
group_by -> summarize -> filter “How many restaurants have at least one 2000 calorie item?”
any -> plot: “Make a boxplot of average saturated fat percentages for items in each restaurant.”

YOUR TURN

Choose one of your clean tabular datasets.

Jot down some research questions or tasks that are answerable with intro-level pipelines on that dataset.

03:00

Designing problems for complex pipelines

Making it complex

Ways to level up pipeline complexity of the RQs:

Changing the cases/rows (esp for visualization)
Operations inside mutate or with group_by
Many pivots, especially for summary tables
Multiple datasets and joins

Change the cases/rows

Summarizing by group: rows become restaurants not items.

Code

dat_ff |>
  group_by(restaurant) |>
  summarize(mean(calories))

# A tibble: 8 × 2
  restaurant  `mean(calories)`
  <chr>                  <dbl>
1 Arbys                   533.
2 Burger King             609.
3 Chick Fil-A             384.
4 Dairy Queen             520.
5 Mcdonalds               640.
6 Sonic                   632.
7 Subway                  503.
8 Taco Bell               444.

Change the cases/rows

Pivoting: rows become nutrition measurement for an item rather than items.

Code

dat_ff |>
  pivot_longer(calories:calcium,
               values_to = "amount",
               names_to = "nutritional_item")

# A tibble: 7,210 × 5
   restaurant item                             salad nutritional_item amount
   <chr>      <chr>                            <chr> <chr>             <dbl>
 1 Mcdonalds  Artisan Grilled Chicken Sandwich Other calories            380
 2 Mcdonalds  Artisan Grilled Chicken Sandwich Other cal_fat              60
 3 Mcdonalds  Artisan Grilled Chicken Sandwich Other total_fat             7
 4 Mcdonalds  Artisan Grilled Chicken Sandwich Other sat_fat               2
 5 Mcdonalds  Artisan Grilled Chicken Sandwich Other trans_fat             0
 6 Mcdonalds  Artisan Grilled Chicken Sandwich Other cholesterol          95
 7 Mcdonalds  Artisan Grilled Chicken Sandwich Other sodium             1110
 8 Mcdonalds  Artisan Grilled Chicken Sandwich Other total_carb           44
 9 Mcdonalds  Artisan Grilled Chicken Sandwich Other fiber                 3
10 Mcdonalds  Artisan Grilled Chicken Sandwich Other sugar                11
# ℹ 7,200 more rows

Change the cases/rows

Distinct: rows become restaurants rather than items.

Code

dat_ff |>
  distinct(restaurant, .keep_all = TRUE)

# A tibble: 8 × 17
  restaurant  item      calories cal_fat total_fat sat_fat trans_fat cholesterol
  <chr>       <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
1 Mcdonalds   "Artisan…      380      60         7       2         0          95
2 Chick Fil-A "Chargri…      430     144        16       8         0          85
3 Sonic       "Hatch G…      710     380        43      17         2         120
4 Arbys       "Arby's …      330     100        11       4         0          30
5 Burger King "America…     1550    1134       126      47         8         805
6 Dairy Queen "1/2 lb.…     1000     660        74      26         2         170
7 Subway      "6\" B.L…      320      80         9       4         0          20
8 Taco Bell   "1/2 lb.…      540     230        26       7         1          45
# ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
#   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>

Do more with mutate

`group_by` |> `mutate`

What is the highest-calorie single item at each restaurant?

Code

dat_ff |>
  group_by(restaurant) |>
  mutate(
    max_calories = max(calories),
    prop_of_max_cal = calories/max_calories
  )

# A tibble: 515 × 19
# Groups:   restaurant [8]
   restaurant item      calories cal_fat total_fat sat_fat trans_fat cholesterol
   <chr>      <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1 Mcdonalds  Artisan …      380      60         7       2       0            95
 2 Mcdonalds  Single B…      840     410        45      17       1.5         130
 3 Mcdonalds  Double B…     1130     600        67      27       3           220
 4 Mcdonalds  Grilled …      750     280        31      10       0.5         155
 5 Mcdonalds  Crispy B…      920     410        45      12       0.5         120
 6 Mcdonalds  Big Mac        540     250        28      10       1            80
 7 Mcdonalds  Cheesebu…      300     100        12       5       0.5          40
 8 Mcdonalds  Classic …      510     210        24       4       0            65
 9 Mcdonalds  Double C…      430     190        21      11       1            85
10 Mcdonalds  Double Q…      770     400        45      21       2.5         175
# ℹ 505 more rows
# ℹ 11 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
#   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
#   salad <chr>, max_calories <dbl>, prop_of_max_cal <dbl>

YOUR TURN

Think of a measurement by category that you might want to add to every additional row of your dataset, for purposes of comparing.

Code

#| echo: false
countdown(minutes = 3)

03:00

String parsing

Find the median calories for a burger at each restaurant.

Code

dat_ff |>
  mutate(
    is_burger = str_detect(item, "burger")
  ) |>
  filter(is_burger) |>
  group_by(restaurant) |>
  summarize(median(calories))

# A tibble: 4 × 2
  restaurant  `median(calories)`
  <chr>                    <dbl>
1 Burger King                360
2 Dairy Queen                470
3 Mcdonalds                  300
4 Sonic                      800

Regular expressions

Optional: Level up this task with regular expressions!

Find the median calories for a CHEESE burger at each restaurant.

Code

dat_ff |>
  mutate(
    is_cheese_burger = str_detect(item, "[Ch]ee(z|se) ?[bB]urger")
  ) |>
  filter(is_cheese_burger) |>
  group_by(restaurant) |>
  summarize(median(calories))

# A tibble: 4 × 2
  restaurant  `median(calories)`
  <chr>                    <dbl>
1 Burger King                450
2 Dairy Queen                515
3 Mcdonalds                  365
4 Sonic                      800

YOUR TURN

Find a string column in your dataset, jot down 1-3 new non-string columns you might want to make by parsing that string.

03:00

Complex/unvectorized functions and mapping

For each item, give the amount of saturated fat or of trans fat, whichever is larger

Code

## This won't work!

dat_ff |>
  mutate(
    bad_fat = max(sat_fat, trans_fat)
  )

# A tibble: 515 × 18
   restaurant item      calories cal_fat total_fat sat_fat trans_fat cholesterol
   <chr>      <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1 Mcdonalds  Artisan …      380      60         7       2       0            95
 2 Mcdonalds  Single B…      840     410        45      17       1.5         130
 3 Mcdonalds  Double B…     1130     600        67      27       3           220
 4 Mcdonalds  Grilled …      750     280        31      10       0.5         155
 5 Mcdonalds  Crispy B…      920     410        45      12       0.5         120
 6 Mcdonalds  Big Mac        540     250        28      10       1            80
 7 Mcdonalds  Cheesebu…      300     100        12       5       0.5          40
 8 Mcdonalds  Classic …      510     210        24       4       0            65
 9 Mcdonalds  Double C…      430     190        21      11       1            85
10 Mcdonalds  Double Q…      770     400        45      21       2.5         175
# ℹ 505 more rows
# ℹ 10 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
#   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
#   salad <chr>, bad_fat <dbl>

Complex/unvectorized functions and mapping

For each item, give the amount of saturated fat or of trans fat, whichever is larger

Code

## This will!

dat_ff |>
  mutate(
    bad_fat = pmap_dbl(list(sat_fat, trans_fat), max)
  )

# A tibble: 515 × 18
   restaurant item      calories cal_fat total_fat sat_fat trans_fat cholesterol
   <chr>      <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1 Mcdonalds  Artisan …      380      60         7       2       0            95
 2 Mcdonalds  Single B…      840     410        45      17       1.5         130
 3 Mcdonalds  Double B…     1130     600        67      27       3           220
 4 Mcdonalds  Grilled …      750     280        31      10       0.5         155
 5 Mcdonalds  Crispy B…      920     410        45      12       0.5         120
 6 Mcdonalds  Big Mac        540     250        28      10       1            80
 7 Mcdonalds  Cheesebu…      300     100        12       5       0.5          40
 8 Mcdonalds  Classic …      510     210        24       4       0            65
 9 Mcdonalds  Double C…      430     190        21      11       1            85
10 Mcdonalds  Double Q…      770     400        45      21       2.5         175
# ℹ 505 more rows
# ℹ 10 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
#   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
#   salad <chr>, bad_fat <dbl>

YOUR TURN - Try an activity

Your city has just passed a law that fast food items need to contain a warning note if they have over 40% calories from fat, over 40% saturated fat, or any trans fats at all.

Write a function that creates a warning note based on an items nutritional information.

Then, use that function inside of mutate to add a warning label for each item.

As you work on this, make note of any pain points, confusions, or complexities that differentiate this task from an “Intro level” one.

10:00

Answer

Code

check_healthy_fat <- function(calories, cal_fat, total_fat, sat_fat, trans_fat) {
  
  cal_fat_pct <- cal_fat/calories
  sat_fat_pct <- sat_fat/total_fat
  
  warning <- ""
  
  if (!is.na(cal_fat_pct) & cal_fat_pct > 0.4) {
    warning <- paste(warning, "Calories from fat is high.")
  }
  
  if (!is.na(sat_fat_pct) & sat_fat_pct > 0.4) {
    warning <- paste(warning, "High saturated fat percent.")
  }
  
  if (!is.na(trans_fat) & trans_fat > 0) {
    warning <- paste(warning, "Contains trans fat.")
  }
  
  return(warning)
  
}

Answer

Code

dat_ff |>
  mutate(
    healthy_fat_warning = pmap_chr(list(calories, cal_fat, total_fat, sat_fat, trans_fat), check_healthy_fat)
  )

# A tibble: 515 × 18
   restaurant item      calories cal_fat total_fat sat_fat trans_fat cholesterol
   <chr>      <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1 Mcdonalds  Artisan …      380      60         7       2       0            95
 2 Mcdonalds  Single B…      840     410        45      17       1.5         130
 3 Mcdonalds  Double B…     1130     600        67      27       3           220
 4 Mcdonalds  Grilled …      750     280        31      10       0.5         155
 5 Mcdonalds  Crispy B…      920     410        45      12       0.5         120
 6 Mcdonalds  Big Mac        540     250        28      10       1            80
 7 Mcdonalds  Cheesebu…      300     100        12       5       0.5          40
 8 Mcdonalds  Classic …      510     210        24       4       0            65
 9 Mcdonalds  Double C…      430     190        21      11       1            85
10 Mcdonalds  Double Q…      770     400        45      21       2.5         175
# ℹ 505 more rows
# ℹ 10 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
#   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
#   salad <chr>, healthy_fat_warning <chr>

Non one-to-one Joins

Consider this additional dataset:

Code

dat_loc

# A tibble: 10,000 × 11
    ...1 address    categories city  country latitude longitude name  postalCode
   <dbl> <chr>      <chr>      <chr> <chr>      <dbl>     <dbl> <chr> <chr>     
 1     0 800 N Can… American … Thib… US          29.8     -90.8 SONI… 70301     
 2     1 800 N Can… Fast Food… Thib… US          29.8     -90.8 SONI… 70301     
 3     2 206 Wears… Fast Food… Pige… US          35.8     -83.6 Taco… 37863     
 4     3 3652 Park… Fast Food  Pige… US          35.8     -83.6 Arby… 37863     
 5     4 2118 Mt Z… Fast Food… Morr… US          33.6     -84.3 Stea… 30260     
 6     5 9768 Gran… Fast Food… Detr… US          42.4     -83.1 Wend… 48204     
 7     6 13600 W M… Fast Food… Detr… US          42.4     -83.2 Burg… 48235     
 8     7 4111 Ocea… Fast Food  Ocea… US          33.2    -117.  Chip… 92056     
 9     8 162 Old C… Fast Food… Rive… US          40.9     -72.7 7-El… 11901     
10     9 1407 S St… Fast Food… Mona… US          31.6    -103.  Dair… 79756     
# ℹ 9,990 more rows
# ℹ 2 more variables: province <chr>, websites <chr>

Complex joins

There are some complications in joining this data to our nutritional data…

The restaurants aren’t named consistently, e.g. “Sonic” vs “SONIC Drive-In”.
Sometimes there are multiple matches - “Dairy Queen”, “Wolf’s Dairy Queen”, “Dairy Heaven”, “Dairy Queen (Treat Only)”
Many restaurants appear in dat_loc but not dat_ff.
Each restaurant appears multiple times in both dat_loc and dat_ff.

Filtering Joins

anti_join can be used to check which keys don’t match:

Code

dat_ff |>
  anti_join(dat_loc,
            by = c("restaurant" = "name")) |>
  distinct(restaurant)

# A tibble: 1 × 1
  restaurant 
  <chr>      
1 Chick Fil-A

Mutating Joins

“Where can I get a burger?”

Code

dat_ff |>
  filter(str_detect(item, "burger")) |>
  distinct(restaurant) |>
  left_join(dat_loc,
            by = c("restaurant" = "name"))

# A tibble: 1,075 × 11
   restaurant  ...1 address          categories city  country latitude longitude
   <chr>      <dbl> <chr>            <chr>      <chr> <chr>      <dbl>     <dbl>
 1 Mcdonalds   8688 3926 52nd St     Fast Food  Keno… US          42.6     -87.9
 2 Mcdonalds   8950 116 E Sr 200     Fast Food  Yulee US          30.6     -81.6
 3 Mcdonalds   9034 3009 Hwy 190 W   Fast Food… Hamm… US          30.5     -90.5
 4 Mcdonalds   9080 200 Jonesboro Rd Fast Food… McDo… US          33.5     -84.2
 5 Sonic       1563 1211 S Union St  Fast Food… Opel… US          30.5     -92.1
 6 Sonic       4083 4403 S Tamarac … Fast Food  Denv… US          39.6    -105. 
 7 Sonic       4084 11658 E Colfax … Fast Food  Auro… US          39.7    -105. 
 8 Sonic       4140 1805 E William … Fast Food  Aust… US          30.2     -97.8
 9 Sonic       5186 103 Niagara St   Burger Jo… East… US          38.9     -90.1
10 Sonic       5187 1205 Parkway Dr  Fast Food… Blac… US          43.2    -112. 
# ℹ 1,065 more rows
# ℹ 3 more variables: postalCode <chr>, province <chr>, websites <chr>

Mutating Joins

Notice:

We used distinct on dat_ff first. This makes it a one-to-many join instead of a many-to-many. Why is this good/what if we didn’t?
dat_ff had 515 rows. dat_loc had 10000. The result of our pipeline has 1075. When a join is not one-to-one, you can’t predict the number of rows easily!
What if we had used full_join? right_join?

YOUR TURN

Look at your collection of data - where are there keys to match? Are any of them not one-to-one, or otherwise complex?

03:00

Multiple Pivots

A common pipeline with multiple pivots might be:

Do a calculation
Pivot longer
Join
Calculate some summary by groups
Pivot back wider

Example

What city has the most burgers?

Code

burger_counts <- dat_ff |> 
  mutate(
    is_burger = str_detect(item, "burger")
  ) |>
  group_by(restaurant) |> 
  count(is_burger) |>
  pivot_wider(names_from = is_burger,
              values_from = n,
              names_prefix = "burger_")

burger_counts

# A tibble: 8 × 3
# Groups:   restaurant [8]
  restaurant  burger_FALSE burger_TRUE
  <chr>              <int>       <int>
1 Arbys                 55          NA
2 Burger King           61           9
3 Chick Fil-A           27          NA
4 Dairy Queen           36           6
5 Mcdonalds             54           3
6 Sonic                 40          13
7 Subway                96          NA
8 Taco Bell            115          NA

Example

What city has the most burgers?

Code

burger_counts |>
  left_join(dat_loc,
            by = c("restaurant" = "name")) |>
  group_by(city) |>
  summarize(
    total_burgers = sum(burger_TRUE, na.rm = TRUE)
  ) |>
  arrange(desc(total_burgers))

# A tibble: 1,502 × 2
   city        total_burgers
   <chr>               <int>
 1 Los Angeles           108
 2 Houston                99
 3 Detroit                81
 4 Chicago                78
 5 Miami                  63
 6 Minneapolis            63
 7 Omaha                  63
 8 Dallas                 53
 9 Phoenix                48
10 Saint Paul             48
# ℹ 1,492 more rows

YOUR TURN

Sketch a summary table that might be interesting for your data, then reverse engineer the pivots and joins needed to achieve it.

05:00

Designing pipeline activities

“Transition” or “Scaffold” activities

Some suggested “scaffold” activities:

Sketch a plan
Reorder a pipeline
Fill in a blank

Sketch a plan

Begin with pen-and-paper only - no computer, no code
Plan out the steps needed to address the RQ
Include “sketches” of the how the dataset should look after each step

Reorder a pipeline

Provide all the pipeline elements, ask students to put them in proper order.

burger_counts |>

a) summarize(
    total_burgers = sum(burger_TRUE, na.rm = TRUE)
  )
b) group_by(city)
c) left_join(dat_loc,
            by = c("restaurant" = "name"))
d) arrange(desc(total_burgers))

Answer: c-b-a-d

Fill in a blank

This could mean fill in a blank in code to achieve a goal …

burger_counts |>
  left_join(dat_loc,
            by = c("restaurant" = "name")) |>
  ___________ |>
  summarize(
    total_burgers = sum(burger_TRUE, na.rm = TRUE)
  ) |>
  arrange(desc(total_burgers))

Fill in a blank

This could also mean fill in an intermediate dataset step:

[Original Data Sketch]
->
[Joined Data Sketch]
->
STUDENTS FILL IN THIS SKETCH
->
[Final output]

YOUR TURN

For your dataset, come up with some research questions that require a complicated pipeline to address.

Sketch out the steps needed to answer a research question.

Jot down some ideas about how you would make transitional student questions out of these - e.g., which steps would you show and which would you leave blank?

10:00

Complex Data Pipelines

Asking more complex research questions

Background: Intro level RQs

Dataset 1: Fast food nutrition

Single-function RQs

Intro pipeline RQs

YOUR TURN

Designing problems for complex pipelines

Making it complex

Change the cases/rows

Change the cases/rows

Change the cases/rows

Do more with mutate

group_by |> mutate

YOUR TURN

String parsing

Regular expressions

YOUR TURN

Complex/unvectorized functions and mapping

Complex/unvectorized functions and mapping

YOUR TURN - Try an activity

Answer

Answer

Non one-to-one Joins

Complex joins

Filtering Joins

Mutating Joins

Mutating Joins

YOUR TURN

Multiple Pivots

Example

Example

YOUR TURN

Designing pipeline activities

“Transition” or “Scaffold” activities

Sketch a plan

Reorder a pipeline

Fill in a blank

Fill in a blank

YOUR TURN

`group_by` |> `mutate`