Writing Data Frame & Plot Functions in R

This week, we’re going to learn how to write our own dplyr-like functions—functions that behave the way dplyr verbs do. To do this, we’ll dive into some of the deeper ideas that power the tidyverse, especially how functions in dplyr interpret and evaluate code. These ideas—collectively known as tidy evaluation—make the tidyverse so expressive and intuitive for common data-analysis tasks, but they can also make programming with it more challenging.

This week, we’re moving from using tidyverse functions for “common” tasks to learning how to build our own functions for the “less common” ones. By the end, you’ll understand not only how to use dplyr, but how to extend it.

By the end of the week you should have a grasp of:

What the “embracing” ({{ }}) operator is
How to use the {{ }} operator in data frame functions
Why we need to use the {{ }} operator when writing data frame functions
What “data masking” is
What functions use data masking
What “tidy selection” is
What functions use tidy selection
How to use across() in functions
How to use pick() in functions

📖 Readings: 60-90 minutes

📽 Watch Videos: 20 minutes

✅ Preview Activities: 3

1 Writing Data Frame Functions in R

Now that we’ve learned at a high level how to write functions that operate on a we’re going to take this a step further and learn how to work with “tidy evaluation.”

Since there isn’t a dedicated chapter to writing data frame functions, I’ve compiled a compilation of resources for each topic I want us to cover. First, let’s learn some of the “big picture” ideas of writing data frame functions.

📖 Required Reading: R4DS – Functions

Read only Section 3 – Data Frame Functions!

✅ Check-in 7.1 Writing Data Frame Functions

Data Structure

Note for these questions I am assuming the flight data has a similar structure to the nycflights data from the openintro R package.

head(openintro::nycflights)

Question 1: Fill in the code below to write a function that finds all flights that were cancelled or delayed by more than a user supplied number of hours:

filter_severe <- function(df, hours) { 
  df |> 
    filter(dep_delay _____)
}


nycflights |> 
  filter_severe(hours = 2)

Question 2: Fill in the code below to write a function that converts the user supplied variable that uses clock time (e.g., dep_time, arr_time, etc.) into a decimal time (i.e. hours + (minutes / 60)).

standardize_time <- function(df, time_var) {
  df |> 
    # Times are stored as 2008 for 8:08pm
    mutate( {{ time_var }} := 
              ## Grab first two numbers for hour
                str_sub(
                  ____, 
                  start = ____, 
                  end = ____) +  
              ## Grab second two numbers for minutes
                str_sub(
                  ____, 
                  start = ____, 
                  end = ____) / 60
            )
  
}

nycflights |> 
  standardize_time(arr_time)

2 Tidy Selection & Data Masking

Now that we’ve learned how to write functions that operate on entire data frames, we’re going to take this a step further and look more closely at tidy evaluation—the system that powers how many tidyverse functions interpret and evaluate your code.

At a high level:

Data masking is used in functions that compute with variables, such as arrange(), filter(), and summarise(). Inside these functions, you can refer to column names directly (e.g., filter(mpg > 20)) because dplyr temporarily “masks” the data frame’s columns so they behave like regular variables.
Tidy selection is used in functions that select variables, such as select() and across(). These functions let you choose columns using helpers like starts_with(), where(is.numeric), or vectors of column names.

A quick way to tell which system is in play:

If the function accepts a vector of column names (e.g., select(mtcars, c(vs, am, gear))), it uses tidy selection.
If it does not accept a vector of column names and instead works on column values (e.g., filter(mtcars, mpg > 20)), it uses data masking.

Understanding which type of tidy evaluation a function uses will help you write your own functions that behave just like tidyverse ones.

Let’s read this guide on tidy evaluation which breaks down what tidy selection and data masking are and how they are used in dplyr functions.

📖 Programming with dplyr: Tidy Selection & Data Masking

If you are interested in a video on tidy evaluation here is a talk by Jenny Bryan

Important

I do want to note that this video is from 2019 and some things have changed since then. Namely, we used to need to use the enquo() function to inject variable names into dplyr functions, whereas we now use embracing {{}}. 🤗

✅ Check-in 7.2: Tidy Evaluation & Data Masking

Question 1: Suppose you want to write a function that selects specific columns from a data frame using a character vector of column names stored in cols (e.g., cols = c("species", "bill_length_mm")). Which of the following code snippets will correctly select only those columns inside a function using tidy selection?

my_select <- function(df, cols) {
  # your code here
}

## Option A
df %>% select(cols)

## Option B
df %>% select(all_of(cols))

## Option C
df %>% select({{ cols }})

## Option D
df %>% select(.data[[cols]])

Question 2: You’re writing a function that filters rows in a data frame based on a column specified by the user. The function should take the data frame (df), a column name (var), and a numeric threshold (threshold), and return all rows where that column’s value is greater than the threshold. The function call should look something like: my_filter(penguins, bill_length_mm, 50)

my_filter <- function(df, var, threshold) {
  # your code here
}

## Option A
df %>% filter(var > threshold)

## Option B
df %>% filter(.data[[var]] > threshold)

## Option C
df %>% filter({{ var }} > threshold)

## Option D
df %>% filter(df[[var]] > threshold)

Question 3: For each of the following functions determine if the function uses data-masking or tidy-selection:

across()
count()
distinct()
group_by()
rename()
select()

Question 4: Suppose I wanted to write a function that added a standardized version of a specified (numeric) column into a data frame. Which of the following functions would accomplish this task?

## Option A
add_std_col <- function(df, var){
  mutate(df, 
         std_col = scale({{var}}, center = TRUE, scale = TRUE)
         )
}

## Option B
add_std_col <- function(df, var){
  mutate(df, 
         std_col = scale(var, center = TRUE, scale = TRUE)
         )
}

## Option C
add_std_col <- function(df, var){
  mutate(df, 
         std_col := scale({{var}}, center = TRUE, scale = TRUE)
         ) 
}

## Option D
add_std_col <- function(df, var){
  mutate(df, 
         std_col := scale(var, center = TRUE, scale = TRUE)
         )
}

Question 5: Suppose I wanted to scale my function from Question 4 into a function that standardizes every numeric column from a specified data frame. Which of the following functions would accomplish this task?

## Option A
standardize_df <- function(df){
  mutate(df, 
         across(.cols = where(is.numeric), 
                .fns = ~ scale(.x, center = TRUE, scale = TRUE), 
                .names = "{.col}_std"
         )
         )
}

## Option B
standardize_df <- function(df){
  mutate(df, 
         across(.cols = pick(where(is.numeric)), 
                .fns = ~ scale(.x, center = TRUE, scale = TRUE), 
                .names = "{.col}_std"
         )
         )
}

## Option C
standardize_df <- function(df){
  mutate(df,
         std = scale(pick(where(is.numeric)), 
                     center = TRUE, 
                     scale = TRUE)
         )
}

## Option D
standardize_df <- function(df){
  mutate(df,
         across(.cols = where(is.numeric),
                .fns = ~ scale(pick(where(is.numeric)), 
                               center = TRUE, 
                               scale = TRUE),
      .names = "{.col}_std"
    )
  )
}