Complex Data Pipelines

Asking more complex research questions

Background: Intro level RQs

  • The Big Five tidyverse verbs:

    • arrange
    • filter
    • mutate
    • group_by
    • summarize
  • (Doesn’t have to be tidyverse syntax!)

Dataset 1: Fast food nutrition

Code
dat_ff
# A tibble: 515 × 17
   restaurant item      calories cal_fat total_fat sat_fat trans_fat cholesterol
   <chr>      <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1 Mcdonalds  Artisan …      380      60         7       2       0            95
 2 Mcdonalds  Single B…      840     410        45      17       1.5         130
 3 Mcdonalds  Double B…     1130     600        67      27       3           220
 4 Mcdonalds  Grilled …      750     280        31      10       0.5         155
 5 Mcdonalds  Crispy B…      920     410        45      12       0.5         120
 6 Mcdonalds  Big Mac        540     250        28      10       1            80
 7 Mcdonalds  Cheesebu…      300     100        12       5       0.5          40
 8 Mcdonalds  Classic …      510     210        24       4       0            65
 9 Mcdonalds  Double C…      430     190        21      11       1            85
10 Mcdonalds  Double Q…      770     400        45      21       2.5         175
# ℹ 505 more rows
# ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
#   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>

Single-function RQs

Type of basic questions:

  • arrange What item has the most calories?

  • mutate Calculate the saturated fat to total fat ratio of each item.

  • summarize What is the median calorie item?

  • filter How many items have more than 1000 calories?

  • group_by Same questions, but within each restaurant.

Intro pipeline RQs

  • mutate -> arrange “What item has the highest saturated fat to total fat ratio?”

  • mutate -> filter “How many items have more than 50% of total fat in saturated form?”

  • group_by -> summarize -> filter “How many restaurants have at least one 2000 calorie item?”

  • any -> plot: “Make a boxplot of average saturated fat percentages for items in each restaurant.”

YOUR TURN

Choose one of your clean tabular datasets.

Jot down some research questions or tasks that are answerable with intro-level pipelines on that dataset.

03:00

Designing problems for complex pipelines

Making it complex

Ways to level up pipeline complexity of the RQs:

  1. Changing the cases/rows (esp for visualization)

  2. Operations inside mutate or with group_by

  3. Many pivots, especially for summary tables

  4. Multiple datasets and joins

Change the cases/rows

  • Summarizing by group: rows become restaurants not items.
Code
dat_ff |>
  group_by(restaurant) |>
  summarize(mean(calories))
# A tibble: 8 × 2
  restaurant  `mean(calories)`
  <chr>                  <dbl>
1 Arbys                   533.
2 Burger King             609.
3 Chick Fil-A             384.
4 Dairy Queen             520.
5 Mcdonalds               640.
6 Sonic                   632.
7 Subway                  503.
8 Taco Bell               444.

Change the cases/rows

  • Pivoting: rows become nutrition measurement for an item rather than items.
Code
dat_ff |>
  pivot_longer(calories:calcium,
               values_to = "amount",
               names_to = "nutritional_item")
# A tibble: 7,210 × 5
   restaurant item                             salad nutritional_item amount
   <chr>      <chr>                            <chr> <chr>             <dbl>
 1 Mcdonalds  Artisan Grilled Chicken Sandwich Other calories            380
 2 Mcdonalds  Artisan Grilled Chicken Sandwich Other cal_fat              60
 3 Mcdonalds  Artisan Grilled Chicken Sandwich Other total_fat             7
 4 Mcdonalds  Artisan Grilled Chicken Sandwich Other sat_fat               2
 5 Mcdonalds  Artisan Grilled Chicken Sandwich Other trans_fat             0
 6 Mcdonalds  Artisan Grilled Chicken Sandwich Other cholesterol          95
 7 Mcdonalds  Artisan Grilled Chicken Sandwich Other sodium             1110
 8 Mcdonalds  Artisan Grilled Chicken Sandwich Other total_carb           44
 9 Mcdonalds  Artisan Grilled Chicken Sandwich Other fiber                 3
10 Mcdonalds  Artisan Grilled Chicken Sandwich Other sugar                11
# ℹ 7,200 more rows

Change the cases/rows

  • Distinct: rows become restaurants rather than items.
Code
dat_ff |>
  distinct(restaurant, .keep_all = TRUE)
# A tibble: 8 × 17
  restaurant  item      calories cal_fat total_fat sat_fat trans_fat cholesterol
  <chr>       <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
1 Mcdonalds   "Artisan…      380      60         7       2         0          95
2 Chick Fil-A "Chargri…      430     144        16       8         0          85
3 Sonic       "Hatch G…      710     380        43      17         2         120
4 Arbys       "Arby's …      330     100        11       4         0          30
5 Burger King "America…     1550    1134       126      47         8         805
6 Dairy Queen "1/2 lb.…     1000     660        74      26         2         170
7 Subway      "6\" B.L…      320      80         9       4         0          20
8 Taco Bell   "1/2 lb.…      540     230        26       7         1          45
# ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
#   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>

Do more with mutate

group_by |> mutate

What is the highest-calorie single item at each restaurant?

Code
dat_ff |>
  group_by(restaurant) |>
  mutate(
    max_calories = max(calories),
    prop_of_max_cal = calories/max_calories
  ) 
# A tibble: 515 × 19
# Groups:   restaurant [8]
   restaurant item      calories cal_fat total_fat sat_fat trans_fat cholesterol
   <chr>      <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1 Mcdonalds  Artisan …      380      60         7       2       0            95
 2 Mcdonalds  Single B…      840     410        45      17       1.5         130
 3 Mcdonalds  Double B…     1130     600        67      27       3           220
 4 Mcdonalds  Grilled …      750     280        31      10       0.5         155
 5 Mcdonalds  Crispy B…      920     410        45      12       0.5         120
 6 Mcdonalds  Big Mac        540     250        28      10       1            80
 7 Mcdonalds  Cheesebu…      300     100        12       5       0.5          40
 8 Mcdonalds  Classic …      510     210        24       4       0            65
 9 Mcdonalds  Double C…      430     190        21      11       1            85
10 Mcdonalds  Double Q…      770     400        45      21       2.5         175
# ℹ 505 more rows
# ℹ 11 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
#   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
#   salad <chr>, max_calories <dbl>, prop_of_max_cal <dbl>

YOUR TURN

Think of a measurement by category that you might want to add to every additional row of your dataset, for purposes of comparing.

Code
#| echo: false
countdown(minutes = 3)
03:00

String parsing

Find the median calories for a burger at each restaurant.

Code
dat_ff |>
  mutate(
    is_burger = str_detect(item, "burger")
  ) |>
  filter(is_burger) |>
  group_by(restaurant) |>
  summarize(median(calories))
# A tibble: 4 × 2
  restaurant  `median(calories)`
  <chr>                    <dbl>
1 Burger King                360
2 Dairy Queen                470
3 Mcdonalds                  300
4 Sonic                      800

Regular expressions

Optional: Level up this task with regular expressions!

Find the median calories for a CHEESE burger at each restaurant.

Code
dat_ff |>
  mutate(
    is_cheese_burger = str_detect(item, "[Ch]ee(z|se) ?[bB]urger")
  ) |>
  filter(is_cheese_burger) |>
  group_by(restaurant) |>
  summarize(median(calories))
# A tibble: 4 × 2
  restaurant  `median(calories)`
  <chr>                    <dbl>
1 Burger King                450
2 Dairy Queen                515
3 Mcdonalds                  365
4 Sonic                      800

YOUR TURN

Find a string column in your dataset, jot down 1-3 new non-string columns you might want to make by parsing that string.

03:00

Complex/unvectorized functions and mapping

For each item, give the amount of saturated fat or of trans fat, whichever is larger

Code
## This won't work!

dat_ff |>
  mutate(
    bad_fat = max(sat_fat, trans_fat)
  )
# A tibble: 515 × 18
   restaurant item      calories cal_fat total_fat sat_fat trans_fat cholesterol
   <chr>      <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1 Mcdonalds  Artisan …      380      60         7       2       0            95
 2 Mcdonalds  Single B…      840     410        45      17       1.5         130
 3 Mcdonalds  Double B…     1130     600        67      27       3           220
 4 Mcdonalds  Grilled …      750     280        31      10       0.5         155
 5 Mcdonalds  Crispy B…      920     410        45      12       0.5         120
 6 Mcdonalds  Big Mac        540     250        28      10       1            80
 7 Mcdonalds  Cheesebu…      300     100        12       5       0.5          40
 8 Mcdonalds  Classic …      510     210        24       4       0            65
 9 Mcdonalds  Double C…      430     190        21      11       1            85
10 Mcdonalds  Double Q…      770     400        45      21       2.5         175
# ℹ 505 more rows
# ℹ 10 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
#   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
#   salad <chr>, bad_fat <dbl>

Complex/unvectorized functions and mapping

For each item, give the amount of saturated fat or of trans fat, whichever is larger

Code
## This will!

dat_ff |>
  mutate(
    bad_fat = pmap_dbl(list(sat_fat, trans_fat), max)
  )
# A tibble: 515 × 18
   restaurant item      calories cal_fat total_fat sat_fat trans_fat cholesterol
   <chr>      <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1 Mcdonalds  Artisan …      380      60         7       2       0            95
 2 Mcdonalds  Single B…      840     410        45      17       1.5         130
 3 Mcdonalds  Double B…     1130     600        67      27       3           220
 4 Mcdonalds  Grilled …      750     280        31      10       0.5         155
 5 Mcdonalds  Crispy B…      920     410        45      12       0.5         120
 6 Mcdonalds  Big Mac        540     250        28      10       1            80
 7 Mcdonalds  Cheesebu…      300     100        12       5       0.5          40
 8 Mcdonalds  Classic …      510     210        24       4       0            65
 9 Mcdonalds  Double C…      430     190        21      11       1            85
10 Mcdonalds  Double Q…      770     400        45      21       2.5         175
# ℹ 505 more rows
# ℹ 10 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
#   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
#   salad <chr>, bad_fat <dbl>

YOUR TURN - Try an activity

Your city has just passed a law that fast food items need to contain a warning note if they have over 40% calories from fat, over 40% saturated fat, or any trans fats at all.

Write a function that creates a warning note based on an items nutritional information.

Then, use that function inside of mutate to add a warning label for each item.

As you work on this, make note of any pain points, confusions, or complexities that differentiate this task from an “Intro level” one.

10:00

Answer

Code
check_healthy_fat <- function(calories, cal_fat, total_fat, sat_fat, trans_fat) {
  
  cal_fat_pct <- cal_fat/calories
  sat_fat_pct <- sat_fat/total_fat
  
  warning <- ""
  
  if (!is.na(cal_fat_pct) & cal_fat_pct > 0.4) {
    warning <- paste(warning, "Calories from fat is high.")
  }
  
  if (!is.na(sat_fat_pct) & sat_fat_pct > 0.4) {
    warning <- paste(warning, "High saturated fat percent.")
  }
  
  if (!is.na(trans_fat) & trans_fat > 0) {
    warning <- paste(warning, "Contains trans fat.")
  }
  
  return(warning)
  
}

Answer

Code
dat_ff |>
  mutate(
    healthy_fat_warning = pmap_chr(list(calories, cal_fat, total_fat, sat_fat, trans_fat), check_healthy_fat)
  )
# A tibble: 515 × 18
   restaurant item      calories cal_fat total_fat sat_fat trans_fat cholesterol
   <chr>      <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1 Mcdonalds  Artisan …      380      60         7       2       0            95
 2 Mcdonalds  Single B…      840     410        45      17       1.5         130
 3 Mcdonalds  Double B…     1130     600        67      27       3           220
 4 Mcdonalds  Grilled …      750     280        31      10       0.5         155
 5 Mcdonalds  Crispy B…      920     410        45      12       0.5         120
 6 Mcdonalds  Big Mac        540     250        28      10       1            80
 7 Mcdonalds  Cheesebu…      300     100        12       5       0.5          40
 8 Mcdonalds  Classic …      510     210        24       4       0            65
 9 Mcdonalds  Double C…      430     190        21      11       1            85
10 Mcdonalds  Double Q…      770     400        45      21       2.5         175
# ℹ 505 more rows
# ℹ 10 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
#   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
#   salad <chr>, healthy_fat_warning <chr>

Non one-to-one Joins

Consider this additional dataset:

Code
dat_loc
# A tibble: 10,000 × 11
    ...1 address    categories city  country latitude longitude name  postalCode
   <dbl> <chr>      <chr>      <chr> <chr>      <dbl>     <dbl> <chr> <chr>     
 1     0 800 N Can… American … Thib… US          29.8     -90.8 SONI… 70301     
 2     1 800 N Can… Fast Food… Thib… US          29.8     -90.8 SONI… 70301     
 3     2 206 Wears… Fast Food… Pige… US          35.8     -83.6 Taco… 37863     
 4     3 3652 Park… Fast Food  Pige… US          35.8     -83.6 Arby… 37863     
 5     4 2118 Mt Z… Fast Food… Morr… US          33.6     -84.3 Stea… 30260     
 6     5 9768 Gran… Fast Food… Detr… US          42.4     -83.1 Wend… 48204     
 7     6 13600 W M… Fast Food… Detr… US          42.4     -83.2 Burg… 48235     
 8     7 4111 Ocea… Fast Food  Ocea… US          33.2    -117.  Chip… 92056     
 9     8 162 Old C… Fast Food… Rive… US          40.9     -72.7 7-El… 11901     
10     9 1407 S St… Fast Food… Mona… US          31.6    -103.  Dair… 79756     
# ℹ 9,990 more rows
# ℹ 2 more variables: province <chr>, websites <chr>

Complex joins

There are some complications in joining this data to our nutritional data…

  • The restaurants aren’t named consistently, e.g. “Sonic” vs “SONIC Drive-In”.

  • Sometimes there are multiple matches - “Dairy Queen”, “Wolf’s Dairy Queen”, “Dairy Heaven”, “Dairy Queen (Treat Only)”

  • Many restaurants appear in dat_loc but not dat_ff.

  • Each restaurant appears multiple times in both dat_loc and dat_ff.

Filtering Joins

anti_join can be used to check which keys don’t match:

Code
dat_ff |>
  anti_join(dat_loc,
            by = c("restaurant" = "name")) |>
  distinct(restaurant)
# A tibble: 1 × 1
  restaurant 
  <chr>      
1 Chick Fil-A

Mutating Joins

“Where can I get a burger?”

Code
dat_ff |>
  filter(str_detect(item, "burger")) |>
  distinct(restaurant) |>
  left_join(dat_loc,
            by = c("restaurant" = "name"))
# A tibble: 1,075 × 11
   restaurant  ...1 address          categories city  country latitude longitude
   <chr>      <dbl> <chr>            <chr>      <chr> <chr>      <dbl>     <dbl>
 1 Mcdonalds   8688 3926 52nd St     Fast Food  Keno… US          42.6     -87.9
 2 Mcdonalds   8950 116 E Sr 200     Fast Food  Yulee US          30.6     -81.6
 3 Mcdonalds   9034 3009 Hwy 190 W   Fast Food… Hamm… US          30.5     -90.5
 4 Mcdonalds   9080 200 Jonesboro Rd Fast Food… McDo… US          33.5     -84.2
 5 Sonic       1563 1211 S Union St  Fast Food… Opel… US          30.5     -92.1
 6 Sonic       4083 4403 S Tamarac … Fast Food  Denv… US          39.6    -105. 
 7 Sonic       4084 11658 E Colfax … Fast Food  Auro… US          39.7    -105. 
 8 Sonic       4140 1805 E William … Fast Food  Aust… US          30.2     -97.8
 9 Sonic       5186 103 Niagara St   Burger Jo… East… US          38.9     -90.1
10 Sonic       5187 1205 Parkway Dr  Fast Food… Blac… US          43.2    -112. 
# ℹ 1,065 more rows
# ℹ 3 more variables: postalCode <chr>, province <chr>, websites <chr>

Mutating Joins

Notice:

  • We used distinct on dat_ff first. This makes it a one-to-many join instead of a many-to-many. Why is this good/what if we didn’t?

  • dat_ff had 515 rows. dat_loc had 10000. The result of our pipeline has 1075. When a join is not one-to-one, you can’t predict the number of rows easily!

  • What if we had used full_join? right_join?

YOUR TURN

Look at your collection of data - where are there keys to match? Are any of them not one-to-one, or otherwise complex?

03:00

Multiple Pivots

A common pipeline with multiple pivots might be:

  1. Do a calculation
  2. Pivot longer
  3. Join
  4. Calculate some summary by groups
  5. Pivot back wider

Example

What city has the most burgers?

Code
burger_counts <- dat_ff |> 
  mutate(
    is_burger = str_detect(item, "burger")
  ) |>
  group_by(restaurant) |> 
  count(is_burger) |>
  pivot_wider(names_from = is_burger,
              values_from = n,
              names_prefix = "burger_")

burger_counts
# A tibble: 8 × 3
# Groups:   restaurant [8]
  restaurant  burger_FALSE burger_TRUE
  <chr>              <int>       <int>
1 Arbys                 55          NA
2 Burger King           61           9
3 Chick Fil-A           27          NA
4 Dairy Queen           36           6
5 Mcdonalds             54           3
6 Sonic                 40          13
7 Subway                96          NA
8 Taco Bell            115          NA

Example

What city has the most burgers?

Code
burger_counts |>
  left_join(dat_loc,
            by = c("restaurant" = "name")) |>
  group_by(city) |>
  summarize(
    total_burgers = sum(burger_TRUE, na.rm = TRUE)
  ) |>
  arrange(desc(total_burgers))
# A tibble: 1,502 × 2
   city        total_burgers
   <chr>               <int>
 1 Los Angeles           108
 2 Houston                99
 3 Detroit                81
 4 Chicago                78
 5 Miami                  63
 6 Minneapolis            63
 7 Omaha                  63
 8 Dallas                 53
 9 Phoenix                48
10 Saint Paul             48
# ℹ 1,492 more rows

YOUR TURN

Sketch a summary table that might be interesting for your data, then reverse engineer the pivots and joins needed to achieve it.

05:00

Designing pipeline activities

“Transition” or “Scaffold” activities

Some suggested “scaffold” activities:

  1. Sketch a plan

  2. Reorder a pipeline

  3. Fill in a blank

Sketch a plan

  • Begin with pen-and-paper only - no computer, no code

  • Plan out the steps needed to address the RQ

  • Include “sketches” of the how the dataset should look after each step

Reorder a pipeline

Provide all the pipeline elements, ask students to put them in proper order.

burger_counts |>

a) summarize(
    total_burgers = sum(burger_TRUE, na.rm = TRUE)
  )
b) group_by(city)
c) left_join(dat_loc,
            by = c("restaurant" = "name"))
d) arrange(desc(total_burgers))
  

Answer: c-b-a-d

Fill in a blank

This could mean fill in a blank in code to achieve a goal …

burger_counts |>
  left_join(dat_loc,
            by = c("restaurant" = "name")) |>
  ___________ |>
  summarize(
    total_burgers = sum(burger_TRUE, na.rm = TRUE)
  ) |>
  arrange(desc(total_burgers))

Fill in a blank

This could also mean fill in an intermediate dataset step:

[Original Data Sketch]
->
[Joined Data Sketch]
->
STUDENTS FILL IN THIS SKETCH
->
[Final output]

YOUR TURN

For your dataset, come up with some research questions that require a complicated pipeline to address.

Sketch out the steps needed to answer a research question.

Jot down some ideas about how you would make transitional student questions out of these - e.g., which steps would you show and which would you leave blank?

10:00