Extending Data Wrangling Verbs

Thursday, October 10

Today we will…

  • Debrief PA 3
  • Debrief Lab 2 & Challenge 2
  • Discuss Lab 2 Peer Review
  • Outline Grade Expectations for Syllabus
  • New Material
    • Extend dplyr verbs to have more functionality
  • Lab 3: Teacher Evaluations

PA 3: Identify the Mystery College

Efficient Coding

For the final problem, many groups’ code looked like this:

colleges_clean |> 
  filter(REGION == 7) |>
  filter(ADM_RATE > median(ADM_RATE)) |>
  filter(TUITION_DIFF != 0) |>
  filter(SAT_AVG %% 2 == 1) |>
  filter(STABBR != "ID") |>
  filter(UGDS < 10000) |>
  filter(STABBR != "MT") |> 
  slice_min(order_by = ADM_RATE) 


Is there a more “efficient” way do this?

A few words about drop_na()

  • Easy tool to remove missing values
  • Unilaterally removes any row with a missing value for any variable
  • But you can specify what columns it should look at for missing values!
colleges_clean |> 
  drop_na(REGION, 
          ADM_RATE, 
          TUITION_DIFF, 
          SAT_AVG, 
          STABBR, 
          UGDS)

Lab 2 & Challenge 2

Grading / Feedback & Revisions

  • Each question will earn a score of “Success” or “Growing”.
    • Questions marked “Growing” will receive feedback on how to improve your solution.
    • These questions can be resubmitted for additional feedback.
  • Earning a “Success” doesn’t necessarily mean your solution is without error.
    • You may still receive feedback on how to improve your solution.
    • These questions cannot be resubmitted for additional feedback.
  • Revisions for Lab 2 & Challenge 2 are due next Friday (October 18th)
    • You must submit your revised HTML to the original Lab 2 / Challenge 2 assignment portal.
    • You must include reflections on what you learned by completing the revisions.

Lab 2 Growing Points

  • Q1: Loading in data with the here() function
    • In this class, we use a package oriented workflow to read in the data.
    • We do not specify relative paths to read in our data.
  • Q5: Adding transparency
    • Only variables being mapped to aesthetics are inserted into the aes() function!
    • Hard coded values (e.g., alpha = 0.5) belong outside the aes() function.
  • Report formatting: No messages or warnings should be output in the HTML document!
    • Use code chunk options and execute options to suppress these messages and warnings.

Don’t make people tilt their head

Peer Code Review

Peer Code Review

Each of you was assigned one student’s lab to provide feedback on their code formatting.

Peer Code Review

Your feedback is to be provided in the comment box!

What feedback would you give?

ggplot(data = surveys, mapping = aes(x=hindfoot_length,y= weight)) +  
  geom_jitter(alpha=.2,color='tomato')+ facet_wrap(~species)+geom_boxplot(outlier.shape = NA)+labs(
    title ='Weight to hindfoot comparison'
  )+ xlab('length (mm)')+ylab('Weight(g)')

Grade Expectations

Defining Grades in 331

  • A: Superior Attainment of Course Objectives

  • B: Good Attainment of Course Objectives

  • C: Acceptable Attainment of Course Objectives

  • D: Poor Attainment of Course Objectives

We need to define criteria for each of these grades based on the four objectives of this course—learning targets, revising thinking, extending thinking, collaborating with peers.

Extending dplyr verbs

Example Data set – Cereal

library(liver)
data(cereal)

glimpse(cereal)
Rows: 77
Columns: 16
$ name     <fct> 100% Bran, 100% Natural Bran, All-Bran, All-Bran with Extra F…
$ manuf    <fct> N, Q, K, K, R, G, K, G, R, P, Q, G, G, G, G, R, K, K, G, K, N…
$ type     <fct> cold, cold, cold, cold, cold, cold, cold, cold, cold, cold, c…
$ calories <int> 70, 120, 70, 50, 110, 110, 110, 130, 90, 90, 120, 110, 120, 1…
$ protein  <int> 4, 3, 4, 4, 2, 2, 2, 3, 2, 3, 1, 6, 1, 3, 1, 2, 2, 1, 1, 3, 3…
$ fat      <int> 1, 5, 1, 0, 2, 2, 0, 2, 1, 0, 2, 2, 3, 2, 1, 0, 0, 0, 1, 3, 0…
$ sodium   <int> 130, 15, 260, 140, 200, 180, 125, 210, 200, 210, 220, 290, 21…
$ fiber    <dbl> 10.0, 2.0, 9.0, 14.0, 1.0, 1.5, 1.0, 2.0, 4.0, 5.0, 0.0, 2.0,…
$ carbo    <dbl> 5.0, 8.0, 7.0, 8.0, 14.0, 10.5, 11.0, 18.0, 15.0, 13.0, 12.0,…
$ sugars   <int> 6, 8, 5, 0, 8, 10, 14, 8, 6, 5, 12, 1, 9, 7, 13, 3, 2, 12, 13…
$ potass   <int> 280, 135, 320, 330, -1, 70, 30, 100, 125, 190, 35, 105, 45, 1…
$ vitamins <int> 25, 0, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25…
$ shelf    <int> 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 2, 1, 2, 3, 2, 1, 1, 2, 2, 3, 2…
$ weight   <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.33, 1.00, 1.00, 1…
$ cups     <dbl> 0.33, 1.00, 0.33, 0.50, 0.75, 0.75, 1.00, 0.75, 0.67, 0.67, 0…
$ rating   <dbl> 68.40297, 33.98368, 59.42551, 93.70491, 34.38484, 29.50954, 3…

Count with count()

How many cereals does each manuf have in this dataset?

cereal |> 
  group_by(manuf) |> 
  count()
manuf n
A 1
G 22
K 23
N 6
P 9
Q 8
R 8

group_by() + slice()

For each manuf, find the cereal with the most fiber.

cereal |> 
  group_by(manuf) |> 
  slice_max(order_by = fiber)


name manuf type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
Maypo A hot 100 4 1 0 0.0 16 3 95 25 2 1.00 1.00 54.85092
Total Raisin Bran G cold 140 3 1 190 4.0 15 14 230 100 3 1.50 1.00 28.59278
All-Bran with Extra Fiber K cold 50 4 0 140 14.0 8 0 330 25 3 1.00 0.50 93.70491
100% Bran N cold 70 4 1 130 10.0 5 6 280 25 3 1.00 0.33 68.40297
Post Nat. Raisin Bran P cold 120 3 1 200 6.0 11 14 260 25 3 1.33 0.67 37.84059
Quaker Oatmeal Q hot 100 5 2 0 2.7 -1 -1 110 0 1 1.00 0.67 50.82839
Bran Chex R cold 90 2 1 200 4.0 15 6 125 25 1 1.00 0.67 49.12025

Multiple Variables in slice()

Find the 3 cereals with the highest fiber and potass.

  • If you are ordering by multiple variables, wrap them in a data.frame!
cereal |> 
  slice_max(order_by = data.frame(fiber, potass),
            n = 3)


name manuf type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
All-Bran with Extra Fiber K cold 50 4 0 140 14 8 0 330 25 3 1 0.50 93.70491
100% Bran N cold 70 4 1 130 10 5 6 280 25 3 1 0.33 68.40297
All-Bran K cold 70 4 1 260 9 7 5 320 25 3 1 0.33 59.42551

Discretize with if_else()

For each cereal, label the potass as “high” or “low”.

if_else(<CONDITION>, <TRUE OUTPUT>, <FALSE OUTPUT>)

cereal |> 
  mutate(po_category = if_else(potass <= 100, 
                               "low", 
                               "high"),
         .after = potass)
name manuf type calories protein fat sodium fiber carbo sugars potass po_category vitamins shelf weight cups rating
100% Bran N cold 70 4 1 130 10.0 5.0 6 280 high 25 3 1.00 0.33 68.40297
100% Natural Bran Q cold 120 3 5 15 2.0 8.0 8 135 high 0 3 1.00 1.00 33.98368
All-Bran K cold 70 4 1 260 9.0 7.0 5 320 high 25 3 1.00 0.33 59.42551
All-Bran with Extra Fiber K cold 50 4 0 140 14.0 8.0 0 330 high 25 3 1.00 0.50 93.70491
Almond Delight R cold 110 2 2 200 1.0 14.0 8 -1 low 25 3 1.00 0.75 34.38484
Apple Cinnamon Cheerios G cold 110 2 2 180 1.5 10.5 10 70 low 25 1 1.00 0.75 29.50954
Apple Jacks K cold 110 2 0 125 1.0 11.0 14 30 low 25 2 1.00 1.00 33.17409
Basic 4 G cold 130 3 2 210 2.0 18.0 8 100 low 25 3 1.33 0.75 37.03856
Bran Chex R cold 90 2 1 200 4.0 15.0 6 125 high 25 1 1.00 0.67 49.12025
Bran Flakes P cold 90 3 0 210 5.0 13.0 5 190 high 25 3 1.00 0.67 53.31381
Cap'n'Crunch Q cold 120 1 2 220 0.0 12.0 12 35 low 25 2 1.00 0.75 18.04285
Cheerios G cold 110 6 2 290 2.0 17.0 1 105 high 25 1 1.00 1.25 50.76500
Cinnamon Toast Crunch G cold 120 1 3 210 0.0 13.0 9 45 low 25 2 1.00 0.75 19.82357
Clusters G cold 110 3 2 140 2.0 13.0 7 105 high 25 3 1.00 0.50 40.40021
Cocoa Puffs G cold 110 1 1 180 0.0 12.0 13 55 low 25 2 1.00 1.00 22.73645
Corn Chex R cold 110 2 0 280 0.0 22.0 3 25 low 25 1 1.00 1.00 41.44502
Corn Flakes K cold 100 2 0 290 1.0 21.0 2 35 low 25 1 1.00 1.00 45.86332
Corn Pops K cold 110 1 0 90 1.0 13.0 12 20 low 25 2 1.00 1.00 35.78279
Count Chocula G cold 110 1 1 180 0.0 12.0 13 65 low 25 2 1.00 1.00 22.39651
Cracklin' Oat Bran K cold 110 3 3 140 4.0 10.0 7 160 high 25 3 1.00 0.50 40.44877
Cream of Wheat (Quick) N hot 100 3 0 80 1.0 21.0 0 -1 low 0 2 1.00 1.00 64.53382
Crispix K cold 110 2 0 220 1.0 21.0 3 30 low 25 3 1.00 1.00 46.89564
Crispy Wheat & Raisins G cold 100 2 1 140 2.0 11.0 10 120 high 25 3 1.00 0.75 36.17620
Double Chex R cold 100 2 0 190 1.0 18.0 5 80 low 25 3 1.00 0.75 44.33086
Froot Loops K cold 110 2 1 125 1.0 11.0 13 30 low 25 2 1.00 1.00 32.20758
Frosted Flakes K cold 110 1 0 200 1.0 14.0 11 25 low 25 1 1.00 0.75 31.43597
Frosted Mini-Wheats K cold 100 3 0 0 3.0 14.0 7 100 low 25 2 1.00 0.80 58.34514
Fruit & Fibre Dates; Walnuts; and Oats P cold 120 3 2 160 5.0 12.0 10 200 high 25 3 1.25 0.67 40.91705
Fruitful Bran K cold 120 3 0 240 5.0 14.0 12 190 high 25 3 1.33 0.67 41.01549
Fruity Pebbles P cold 110 1 1 135 0.0 13.0 12 25 low 25 2 1.00 0.75 28.02576
Golden Crisp P cold 100 2 0 45 0.0 11.0 15 40 low 25 1 1.00 0.88 35.25244
Golden Grahams G cold 110 1 1 280 0.0 15.0 9 45 low 25 2 1.00 0.75 23.80404
Grape Nuts Flakes P cold 100 3 1 140 3.0 15.0 5 85 low 25 3 1.00 0.88 52.07690
Grape-Nuts P cold 110 3 0 170 3.0 17.0 3 90 low 25 3 1.00 0.25 53.37101
Great Grains Pecan P cold 120 3 3 75 3.0 13.0 4 100 low 25 3 1.00 0.33 45.81172
Honey Graham Ohs Q cold 120 1 2 220 1.0 12.0 11 45 low 25 2 1.00 1.00 21.87129
Honey Nut Cheerios G cold 110 3 1 250 1.5 11.5 10 90 low 25 1 1.00 0.75 31.07222
Honey-comb P cold 110 1 0 180 0.0 14.0 11 35 low 25 1 1.00 1.33 28.74241
Just Right Crunchy Nuggets K cold 110 2 1 170 1.0 17.0 6 60 low 100 3 1.00 1.00 36.52368
Just Right Fruit & Nut K cold 140 3 1 170 2.0 20.0 9 95 low 100 3 1.30 0.75 36.47151
Kix G cold 110 2 1 260 0.0 21.0 3 40 low 25 2 1.00 1.50 39.24111
Life Q cold 100 4 2 150 2.0 12.0 6 95 low 25 2 1.00 0.67 45.32807
Lucky Charms G cold 110 2 1 180 0.0 12.0 12 55 low 25 2 1.00 1.00 26.73451
Maypo A hot 100 4 1 0 0.0 16.0 3 95 low 25 2 1.00 1.00 54.85092
Muesli Raisins; Dates; & Almonds R cold 150 4 3 95 3.0 16.0 11 170 high 25 3 1.00 1.00 37.13686
Muesli Raisins; Peaches; & Pecans R cold 150 4 3 150 3.0 16.0 11 170 high 25 3 1.00 1.00 34.13976
Mueslix Crispy Blend K cold 160 3 2 150 3.0 17.0 13 160 high 25 3 1.50 0.67 30.31335
Multi-Grain Cheerios G cold 100 2 1 220 2.0 15.0 6 90 low 25 1 1.00 1.00 40.10596
Nut&Honey Crunch K cold 120 2 1 190 0.0 15.0 9 40 low 25 2 1.00 0.67 29.92429
Nutri-Grain Almond-Raisin K cold 140 3 2 220 3.0 21.0 7 130 high 25 3 1.33 0.67 40.69232
Nutri-grain Wheat K cold 90 3 0 170 3.0 18.0 2 90 low 25 3 1.00 1.00 59.64284
Oatmeal Raisin Crisp G cold 130 3 2 170 1.5 13.5 10 120 high 25 3 1.25 0.50 30.45084
Post Nat. Raisin Bran P cold 120 3 1 200 6.0 11.0 14 260 high 25 3 1.33 0.67 37.84059
Product 19 K cold 100 3 0 320 1.0 20.0 3 45 low 100 3 1.00 1.00 41.50354
Puffed Rice Q cold 50 1 0 0 0.0 13.0 0 15 low 0 3 0.50 1.00 60.75611
Puffed Wheat Q cold 50 2 0 0 1.0 10.0 0 50 low 0 3 0.50 1.00 63.00565
Quaker Oat Squares Q cold 100 4 1 135 2.0 14.0 6 110 high 25 3 1.00 0.50 49.51187
Quaker Oatmeal Q hot 100 5 2 0 2.7 -1.0 -1 110 high 0 1 1.00 0.67 50.82839
Raisin Bran K cold 120 3 1 210 5.0 14.0 12 240 high 25 2 1.33 0.75 39.25920
Raisin Nut Bran G cold 100 3 2 140 2.5 10.5 8 140 high 25 3 1.00 0.50 39.70340
Raisin Squares K cold 90 2 0 0 2.0 15.0 6 110 high 25 3 1.00 0.50 55.33314
Rice Chex R cold 110 1 0 240 0.0 23.0 2 30 low 25 1 1.00 1.13 41.99893
Rice Krispies K cold 110 2 0 290 0.0 22.0 3 35 low 25 1 1.00 1.00 40.56016
Shredded Wheat N cold 80 2 0 0 3.0 16.0 0 95 low 0 1 0.83 1.00 68.23588
Shredded Wheat 'n'Bran N cold 90 3 0 0 4.0 19.0 0 140 high 0 1 1.00 0.67 74.47295
Shredded Wheat spoon size N cold 90 3 0 0 3.0 20.0 0 120 high 0 1 1.00 0.67 72.80179
Smacks K cold 110 2 1 70 1.0 9.0 15 40 low 25 2 1.00 0.75 31.23005
Special K K cold 110 6 0 230 1.0 16.0 3 55 low 25 1 1.00 1.00 53.13132
Strawberry Fruit Wheats N cold 90 2 0 15 3.0 15.0 5 90 low 25 2 1.00 1.00 59.36399
Total Corn Flakes G cold 110 2 1 200 0.0 21.0 3 35 low 100 3 1.00 1.00 38.83975
Total Raisin Bran G cold 140 3 1 190 4.0 15.0 14 230 high 100 3 1.50 1.00 28.59278
Total Whole Grain G cold 100 3 1 200 3.0 16.0 3 110 high 100 3 1.00 1.00 46.65884
Triples G cold 110 2 1 250 0.0 21.0 3 60 low 25 3 1.00 0.75 39.10617
Trix G cold 110 1 1 140 0.0 13.0 12 25 low 25 2 1.00 1.00 27.75330
Wheat Chex R cold 100 3 1 230 3.0 17.0 3 115 high 25 1 1.00 0.67 49.78744
Wheaties G cold 100 3 1 200 3.0 17.0 3 110 high 25 1 1.00 1.00 51.59219
Wheaties Honey Gold G cold 110 2 1 200 1.0 16.0 8 60 low 25 1 1.00 0.75 36.18756

What do you think .after does?

Re-level with case_when()

For each manufacturer, change the manuf code to the name of the manufacturer.

A series of if-else statements.

cereal |> 
  mutate(manuf = case_when(manuf == "A" ~ "American Home Food Products", 
                           manuf == "G" ~ "General Mills", 
                           manuf == "K" ~ "Kelloggs", 
                           manuf == "N" ~ "Nabisco", 
                           manuf == "P" ~ "Post", 
                           manuf == "Q" ~ "Quaker Oats", 
                           manuf == "R" ~ "Ralston Purina")
         )

Does this code create a new variable or change an existing variable?

name manuf type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
100% Bran Nabisco cold 70 4 1 130 10.0 5.0 6 280 25 3 1.00 0.33 68.40297
100% Natural Bran Quaker Oats cold 120 3 5 15 2.0 8.0 8 135 0 3 1.00 1.00 33.98368
All-Bran Kelloggs cold 70 4 1 260 9.0 7.0 5 320 25 3 1.00 0.33 59.42551
All-Bran with Extra Fiber Kelloggs cold 50 4 0 140 14.0 8.0 0 330 25 3 1.00 0.50 93.70491
Almond Delight Ralston Purina cold 110 2 2 200 1.0 14.0 8 -1 25 3 1.00 0.75 34.38484
Apple Cinnamon Cheerios General Mills cold 110 2 2 180 1.5 10.5 10 70 25 1 1.00 0.75 29.50954
Apple Jacks Kelloggs cold 110 2 0 125 1.0 11.0 14 30 25 2 1.00 1.00 33.17409
Basic 4 General Mills cold 130 3 2 210 2.0 18.0 8 100 25 3 1.33 0.75 37.03856
Bran Chex Ralston Purina cold 90 2 1 200 4.0 15.0 6 125 25 1 1.00 0.67 49.12025
Bran Flakes Post cold 90 3 0 210 5.0 13.0 5 190 25 3 1.00 0.67 53.31381
Cap'n'Crunch Quaker Oats cold 120 1 2 220 0.0 12.0 12 35 25 2 1.00 0.75 18.04285
Cheerios General Mills cold 110 6 2 290 2.0 17.0 1 105 25 1 1.00 1.25 50.76500
Cinnamon Toast Crunch General Mills cold 120 1 3 210 0.0 13.0 9 45 25 2 1.00 0.75 19.82357
Clusters General Mills cold 110 3 2 140 2.0 13.0 7 105 25 3 1.00 0.50 40.40021
Cocoa Puffs General Mills cold 110 1 1 180 0.0 12.0 13 55 25 2 1.00 1.00 22.73645
Corn Chex Ralston Purina cold 110 2 0 280 0.0 22.0 3 25 25 1 1.00 1.00 41.44502
Corn Flakes Kelloggs cold 100 2 0 290 1.0 21.0 2 35 25 1 1.00 1.00 45.86332
Corn Pops Kelloggs cold 110 1 0 90 1.0 13.0 12 20 25 2 1.00 1.00 35.78279
Count Chocula General Mills cold 110 1 1 180 0.0 12.0 13 65 25 2 1.00 1.00 22.39651
Cracklin' Oat Bran Kelloggs cold 110 3 3 140 4.0 10.0 7 160 25 3 1.00 0.50 40.44877
Cream of Wheat (Quick) Nabisco hot 100 3 0 80 1.0 21.0 0 -1 0 2 1.00 1.00 64.53382
Crispix Kelloggs cold 110 2 0 220 1.0 21.0 3 30 25 3 1.00 1.00 46.89564
Crispy Wheat & Raisins General Mills cold 100 2 1 140 2.0 11.0 10 120 25 3 1.00 0.75 36.17620
Double Chex Ralston Purina cold 100 2 0 190 1.0 18.0 5 80 25 3 1.00 0.75 44.33086
Froot Loops Kelloggs cold 110 2 1 125 1.0 11.0 13 30 25 2 1.00 1.00 32.20758
Frosted Flakes Kelloggs cold 110 1 0 200 1.0 14.0 11 25 25 1 1.00 0.75 31.43597
Frosted Mini-Wheats Kelloggs cold 100 3 0 0 3.0 14.0 7 100 25 2 1.00 0.80 58.34514
Fruit & Fibre Dates; Walnuts; and Oats Post cold 120 3 2 160 5.0 12.0 10 200 25 3 1.25 0.67 40.91705
Fruitful Bran Kelloggs cold 120 3 0 240 5.0 14.0 12 190 25 3 1.33 0.67 41.01549
Fruity Pebbles Post cold 110 1 1 135 0.0 13.0 12 25 25 2 1.00 0.75 28.02576
Golden Crisp Post cold 100 2 0 45 0.0 11.0 15 40 25 1 1.00 0.88 35.25244
Golden Grahams General Mills cold 110 1 1 280 0.0 15.0 9 45 25 2 1.00 0.75 23.80404
Grape Nuts Flakes Post cold 100 3 1 140 3.0 15.0 5 85 25 3 1.00 0.88 52.07690
Grape-Nuts Post cold 110 3 0 170 3.0 17.0 3 90 25 3 1.00 0.25 53.37101
Great Grains Pecan Post cold 120 3 3 75 3.0 13.0 4 100 25 3 1.00 0.33 45.81172
Honey Graham Ohs Quaker Oats cold 120 1 2 220 1.0 12.0 11 45 25 2 1.00 1.00 21.87129
Honey Nut Cheerios General Mills cold 110 3 1 250 1.5 11.5 10 90 25 1 1.00 0.75 31.07222
Honey-comb Post cold 110 1 0 180 0.0 14.0 11 35 25 1 1.00 1.33 28.74241
Just Right Crunchy Nuggets Kelloggs cold 110 2 1 170 1.0 17.0 6 60 100 3 1.00 1.00 36.52368
Just Right Fruit & Nut Kelloggs cold 140 3 1 170 2.0 20.0 9 95 100 3 1.30 0.75 36.47151
Kix General Mills cold 110 2 1 260 0.0 21.0 3 40 25 2 1.00 1.50 39.24111
Life Quaker Oats cold 100 4 2 150 2.0 12.0 6 95 25 2 1.00 0.67 45.32807
Lucky Charms General Mills cold 110 2 1 180 0.0 12.0 12 55 25 2 1.00 1.00 26.73451
Maypo American Home Food Products hot 100 4 1 0 0.0 16.0 3 95 25 2 1.00 1.00 54.85092
Muesli Raisins; Dates; & Almonds Ralston Purina cold 150 4 3 95 3.0 16.0 11 170 25 3 1.00 1.00 37.13686
Muesli Raisins; Peaches; & Pecans Ralston Purina cold 150 4 3 150 3.0 16.0 11 170 25 3 1.00 1.00 34.13976
Mueslix Crispy Blend Kelloggs cold 160 3 2 150 3.0 17.0 13 160 25 3 1.50 0.67 30.31335
Multi-Grain Cheerios General Mills cold 100 2 1 220 2.0 15.0 6 90 25 1 1.00 1.00 40.10596
Nut&Honey Crunch Kelloggs cold 120 2 1 190 0.0 15.0 9 40 25 2 1.00 0.67 29.92429
Nutri-Grain Almond-Raisin Kelloggs cold 140 3 2 220 3.0 21.0 7 130 25 3 1.33 0.67 40.69232
Nutri-grain Wheat Kelloggs cold 90 3 0 170 3.0 18.0 2 90 25 3 1.00 1.00 59.64284
Oatmeal Raisin Crisp General Mills cold 130 3 2 170 1.5 13.5 10 120 25 3 1.25 0.50 30.45084
Post Nat. Raisin Bran Post cold 120 3 1 200 6.0 11.0 14 260 25 3 1.33 0.67 37.84059
Product 19 Kelloggs cold 100 3 0 320 1.0 20.0 3 45 100 3 1.00 1.00 41.50354
Puffed Rice Quaker Oats cold 50 1 0 0 0.0 13.0 0 15 0 3 0.50 1.00 60.75611
Puffed Wheat Quaker Oats cold 50 2 0 0 1.0 10.0 0 50 0 3 0.50 1.00 63.00565
Quaker Oat Squares Quaker Oats cold 100 4 1 135 2.0 14.0 6 110 25 3 1.00 0.50 49.51187
Quaker Oatmeal Quaker Oats hot 100 5 2 0 2.7 -1.0 -1 110 0 1 1.00 0.67 50.82839
Raisin Bran Kelloggs cold 120 3 1 210 5.0 14.0 12 240 25 2 1.33 0.75 39.25920
Raisin Nut Bran General Mills cold 100 3 2 140 2.5 10.5 8 140 25 3 1.00 0.50 39.70340
Raisin Squares Kelloggs cold 90 2 0 0 2.0 15.0 6 110 25 3 1.00 0.50 55.33314
Rice Chex Ralston Purina cold 110 1 0 240 0.0 23.0 2 30 25 1 1.00 1.13 41.99893
Rice Krispies Kelloggs cold 110 2 0 290 0.0 22.0 3 35 25 1 1.00 1.00 40.56016
Shredded Wheat Nabisco cold 80 2 0 0 3.0 16.0 0 95 0 1 0.83 1.00 68.23588
Shredded Wheat 'n'Bran Nabisco cold 90 3 0 0 4.0 19.0 0 140 0 1 1.00 0.67 74.47295
Shredded Wheat spoon size Nabisco cold 90 3 0 0 3.0 20.0 0 120 0 1 1.00 0.67 72.80179
Smacks Kelloggs cold 110 2 1 70 1.0 9.0 15 40 25 2 1.00 0.75 31.23005
Special K Kelloggs cold 110 6 0 230 1.0 16.0 3 55 25 1 1.00 1.00 53.13132
Strawberry Fruit Wheats Nabisco cold 90 2 0 15 3.0 15.0 5 90 25 2 1.00 1.00 59.36399
Total Corn Flakes General Mills cold 110 2 1 200 0.0 21.0 3 35 100 3 1.00 1.00 38.83975
Total Raisin Bran General Mills cold 140 3 1 190 4.0 15.0 14 230 100 3 1.50 1.00 28.59278
Total Whole Grain General Mills cold 100 3 1 200 3.0 16.0 3 110 100 3 1.00 1.00 46.65884
Triples General Mills cold 110 2 1 250 0.0 21.0 3 60 25 3 1.00 0.75 39.10617
Trix General Mills cold 110 1 1 140 0.0 13.0 12 25 25 2 1.00 1.00 27.75330
Wheat Chex Ralston Purina cold 100 3 1 230 3.0 17.0 3 115 25 1 1.00 0.67 49.78744
Wheaties General Mills cold 100 3 1 200 3.0 17.0 3 110 25 1 1.00 1.00 51.59219
Wheaties Honey Gold General Mills cold 110 2 1 200 1.0 16.0 8 60 25 1 1.00 0.75 36.18756

Calculate a Summary Statistic for Many Columns

For each type of cereal, calculate the mean nutrient levels.

cereal |> 
  group_by(type) |> 
  summarize(mean_cal = mean(calories), 
            mean_protein = mean(protein), 
            mean_fat = mean(fat), 
            mean_sodium = mean(sodium), 
            mean_fiber = mean(fiber), 
            mean_carbs = mean(carbo), 
            mean_sugars = mean(sugars), 
            mean_potassium = mean(potass)
            )

Does this seem like the most efficient way we could do this?

Summarize multiple columns with across()

Within the summarize() function, we use the across() function, with three arguments:

  • .cols – to specify the columns to apply functions to.
  • .fns – to specify the function(s) to apply.
  • .x – as a placeholder (alias) for the variables being passed into the function.

We use lambda functions: ~ <FUN_NAME>(.x, <ARGS>) to specify what function(s) to apply

Summarize multiple columns with across()

For each type of cereal, calculate the mean nutrient levels.

cereal |> 
  group_by(type) |> 
  summarise(
    across(.cols = calories:potass, 
           .fns = ~ mean(.x)
           )
    )
type calories protein fat sodium fiber carbo sugars potass
cold 107.1622 2.486486 1.013513 165.06757 2.189189 14.7027 7.1756757 97.21622
hot 100.0000 4.000000 1.000000 26.66667 1.233333 12.0000 0.6666667 68.00000

Summarize multiple columns with across()

If missing values were present, we would need to remove them when calculating the mean!

cereal |> 
  group_by(type) |> 
  summarise(
    across(.cols = calories:potass, 
           .fns = ~ mean(.x, na.rm = TRUE)
           )
    )
type calories protein fat sodium fiber carbo sugars potass
cold 107.6667 2.416667 0.8833333 170.08333 1.800000 14.66667 7.2000000 86.66667
hot 100.0000 4.000000 1.0000000 26.66667 1.233333 12.00000 0.6666667 68.00000

Getting Fancy with Names

.names

A glue specification that describes how to name the output columns. This can use {.col} to stand for the selected column name, and {.fn} to stand for the name of the function being applied. The default (NULL) is equivalent to "{.col}" for the single function case and "{.col}_{.fn}" for the case where a list is used for .fns.


cereal |> 
  group_by(type) |> 
  summarise(
    across(.cols = calories:potass, 
           .fns = ~ mean(.x), 
           .names = "mean_{.col}"
           )
    )
type mean_calories mean_protein mean_fat mean_sodium mean_fiber mean_carbo mean_sugars mean_potass
cold 107.1622 2.486486 1.013513 165.06757 2.189189 14.7027 7.1756757 97.21622
hot 100.0000 4.000000 1.000000 26.66667 1.233333 12.0000 0.6666667 68.00000

Piping into ggplot()

Plot the mean protein per cup for each manuf.

cereal |> 
  mutate(manuf = case_when(manuf == "A" ~ "American Home Food Products", 
                           manuf == "G" ~ "General Mills", 
                           manuf == "K" ~ "Kelloggs", 
                           manuf == "N" ~ "Nabisco", 
                           manuf == "P" ~ "Post", 
                           manuf == "Q" ~ "Quaker Oats", 
                           manuf == "R" ~ "Ralston Purina")) |> 
  filter(type == "cold") |> 
  mutate(pro_per_cup = protein / cups) |> 
  group_by(manuf) |> 
  summarise(mean_pro_per_cup = mean(pro_per_cup)) |> 
  ggplot(mapping = aes(y = manuf, 
                       x = mean_pro_per_cup, 
                       shape = manuf)) +
  geom_point(show.legend = FALSE,
             size = 6) +
  labs(subtitle = "for 77 different breakfast cereals",
       title = "Mean Grams of Protein per Cup", 
       x = "", 
       y = "") +
  theme_bw() +
  theme(plot.title = element_text(size = 24),
        plot.subtitle = element_text(size = 18),
        axis.text = element_text(size = 22)
        ) +
  scale_x_continuous(limits = c(0, 10))

Piping into ggplot()

Creating a Game Plan

Creating a Game Plan

Just like when creating graphics with ggplot, wrangling data with dplyr involves thinking through many steps and writing many layers of code.

  • To help you think through a wrangling problem, I strongly encourage you to create a game plan before we start writing code.

This might involve…

  • a sketch or flowchart.
  • a list of dplyr verbs and variable names.
  • annotating the head of the dataframe.

Creating a Game Plan

What is the median grams of sugars per shelf and the number of cereals per shelf, when we drop the missing values (coded as sugars = -1)?


The person with the nearest birthday: explain out loud to your neighbor how you would do this manipulation.

cereal |> 
  select(sugars, shelf) |> 
  filter(sugars != -1) |> 
  group_by(shelf) |> 
  summarise(med_sugars = median(sugars),
            num_shelf = n()
            )
# A tibble: 3 × 3
  shelf med_sugars num_shelf
  <int>      <dbl>     <int>
1     1          3        19
2     2         12        21
3     3          6        36

Lab 3 & Challenge 3

Exploring teacher evaluations during COVID-19

To do…

  • Lab 3: Student Evaluations of Teaching
    • Due Sunday, 10/13 at 11:59pm
  • Challenge 3: Extending Teaching Evaluation Investigations
    • Due Sunday, 10/13 at 11:59pm
  • Read Chapter 4: Data Joins and Transformations
    • Check-in 4.1 + 4.2 due Tuesday 10/15 at 12pm