dplyr Review

CautionOptional Content

This module consists of readings reviewing material typically taught in STAT 331. It is possible you can skip over portions of this reading. It is your responsibility to decide which areas you need to review before diving into Stat 541.

Answer the following questions to see if you can safely skip this section.

  1. In essence, a data.frame is simply a special list - with a few extra restrictions on the list format.

Think about the datasets you have already worked with. Which of the following restrictions on a list do you think are needed for the list to be a data.frame? (Select all that apply)

  1. The elements of the list must all be vectors of the same length.
  2. The elements of the list must all be the same data type.
  3. The elements of the list must all have no missing values.
  4. The elements of the list must all have names.
  1. Tibbles are described as β€œopinionated” dataframes. Which of the following are true about a tibble’s behavior? (Select all that apply)
  1. tibbles only print the first 10 rows of a dataset
  2. tibbles allow for non-syntactic variable names, like :)
  3. tibbles never convert strings to factors
  4. tibbles create row names

If you had a hard time answering these questions, I would recommend reviewing Section 1.1.

  1. Match each of the base R code excerpt to the associated dplyr verb.
  • arrange()

  • summarize()

  • group_by() + summarize()

  1. penguins[order(penguins$bill_length_mm) , ]

  2. penguins[penguins$species == "Adelie", ]

  3. aggregate(bill_length_mm ~ species, data = penguins, FUN = mean)

  4. with(penguins, mass_ratio = body_mass_g / flipper_length_mm)

  5. penguins$species

  6. mean(penguins[penguins$species == "Adelie", ], na.rm = TRUE)

If you had a hard time answering this question, I would recommend reviewing Section 1.2.

  1. Suppose we would like to study how the ratio of bill length to bill depth across the different penguin species. Arrange the following steps into an order that accomplishes this goal (assuming the steps are connected with a |> or a %>%).
# a
arrange(avg_bill_ratio)

# b
group_by(species)

# c
penguins 

# d
summarize(
    avg_bill_ratio = mean(bill_ratio, na.rm = TRUE)
    )
  
# e
mutate(
    bill_ratio = bill_length_mm / bill_depth_mm
    )

If you had a hard time answering this question, I would recommend reviewing Section 1.3.

dplyr

You should feel comfortable using:

  • The five main dplyr verbs:

    • filter()

    • arrange()

    • select()

    • mutate()

    • summarize()

  • Incorportaing group_by() to perform groupwise operations

  • Chaining together data wrangling operations with the pipe operator (|> or %>%)

Data Structures

Choose one of these two options:

Required-videoRequired Video
Required-readingRequired Reading

In addition, read the following section from the first edition of R for DS:

Required-readingRequired Reading

Data Wrangling with dplyr

If you had a hard time answering Question 3, I would recommend reviewing this content.

Required-readingRequired Reading

The Pipe Operator

If you had a hard time answering Question 4, I would recommend also reviewing this content.

Required-videoRequired Video