`dplyr` Review

Optional Content

This module consists of readings reviewing material typically taught in STAT 331. It is possible you can skip over portions of this reading. It is your responsibility to decide which areas you need to review before diving into Stat 541.

Answer the following questions to see if you can safely skip this section.

In essence, a data.frame is simply a special list - with a few extra restrictions on the list format.

Think about the datasets you have already worked with. Which of the following restrictions on a list do you think are needed for the list to be a data.frame? (Select all that apply)

The elements of the list must all be vectors of the same length.
The elements of the list must all be the same data type.
The elements of the list must all have no missing values.
The elements of the list must all have names.

Tibbles are described as “opinionated” dataframes. Which of the following are true about a tibble’s behavior? (Select all that apply)

tibbles only print the first 10 rows of a dataset
tibbles allow for non-syntactic variable names, like :)
tibbles never convert strings to factors
tibbles create row names

If you had a hard time answering these questions, I would recommend reviewing Section 1.1.

Match each of the base R code excerpt to the associated dplyr verb.

filter()
select()
mutate()

arrange()
summarize()
group_by() + summarize()

penguins[order(penguins$bill_length_mm) , ]
penguins[penguins$species == "Adelie", ]
aggregate(bill_length_mm ~ species, data = penguins, FUN = mean)
with(penguins, mass_ratio = body_mass_g / flipper_length_mm)
penguins$species
mean(penguins[penguins$species == "Adelie", ], na.rm = TRUE)

If you had a hard time answering this question, I would recommend reviewing Section 1.2.

Suppose we would like to study how the ratio of bill length to bill depth across the different penguin species. Arrange the following steps into an order that accomplishes this goal (assuming the steps are connected with a |> or a %>%).

# a
arrange(avg_bill_ratio)

# b
group_by(species)

# c
penguins 

# d
summarize(
    avg_bill_ratio = mean(bill_ratio, na.rm = TRUE)
    )
  
# e
mutate(
    bill_ratio = bill_length_mm / bill_depth_mm
    )

If you had a hard time answering this question, I would recommend reviewing Section 1.3.

`dplyr`

You should feel comfortable using:

The five main dplyr verbs:
- filter()
- arrange()
- select()
- mutate()
- summarize()
Incorportaing group_by() to perform groupwise operations
Chaining together data wrangling operations with the pipe operator (|> or %>%)