Writing Data Frame & Plot Functions in R

The second half of this week’s coursework focuses on writing functions that work with data frames.

By the end of the week you should have a grasp of:


▢️ Watch Videos: 20 minutes

πŸ“– Readings: 60-75 minutes

βœ… Preview Activities: 1 (broken into two parts)


1 Writing Data Frame Functions in R

πŸ“– Required Reading: R4DS – Functions

Read only Section 3 (Data frame functions)!

1.1 Tidy Evaluation

Writing functions that work with data frames and call on the functions we’ve become used to (e.g., filter(), select(), summarise()) requires we learn about tidy evaluation. To write these functions you will need to know, at a high level, whether the function you are trying to incorporate uses data masking or tidy selection.

At a high level, data masking is used in functions like arrange(), filter(), and summarize() that compute with variables. Whereas, tidy selection is used for functions like select() and rename() that select variables.

Your intuition about which functions use tidy evaluation should be good for many of these functions. If you can input c(var1, var2, var3) into the function (e.g., select(mtcars, c(vs, am, gear))), then the function uses tidy selection! If you cannot input c(var1, var2, var3) into the function, then the function is performing computations on the data and uses data masking.

If you are interested in learning more about tidy evaluation, I would highly recommend:

  • this video by Jenny Bryan
    • I do want to note that this video is from 2019 and some things have changed since then. Namely, we used to need to use the enquo() function to inject
      variable names into dplyr functions, whereas we now use embracing {{}}. πŸ€—
  • this vignette for tidy evaluation with dplyr

βœ… Check-in 7.2: Writing Data Frame Functions

Data Structure

Note for Questions 1 & 2, I am assuming the flight data has a similar structure to the nycflights data from the openintro R package.

head(openintro::nycflights)
# A tibble: 6 Γ— 16
   year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
  <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
1  2013     6    30      940        15     1216        -4 VX      N626VA     407
2  2013     5     7     1657        -3     2104        10 DL      N3760C     329
3  2013    12     8      859        -1     1238        11 DL      N712TW     422
4  2013     5    14     1841        -4     2122       -34 DL      N914DL    2391
5  2013     7    21     1102        -3     1230        -8 9E      N823AY    3652
6  2013     1     1     1817        -3     2008         3 AA      N3AXAA     353
# β„Ή 6 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>

Question 1: Fill in the code below to write a function that finds all flights that were cancelled or delayed by more than a user supplied number of hours:

filter_severe <- function(df, hours) { 
  df |> 
    filter(dep_delay _____)
}


nycflights |> 
  filter_severe(hours = 2)

Question 2: Fill in the code below to write a function that converts the user supplied variable that uses clock time (e.g., dep_time, arr_time, etc.) into a decimal time (i.e. hours + (minutes / 60)).

standardize_time <- function(df, time_var) {
  df |> 
    # Times are stored as 2008 for 8
    mutate( {{ time_var }} := 
              as.numeric(
                ## Grab first two numbers for hour
                str_sub(
                  {{ time_var }}, 
                  start = 1, 
                  end = 2)
                ) +  
              as.numeric(
                ## Grab second two numbers for minutes
                str_sub(
                  {{ time_var }}, 
                  start = 3, 
                  end = 4)
                ) / 60
            )
  
}

nycflights |> 
  standardize_time(arr_time)

Question 3: For each of the following functions determine if the function uses data-masking or tidy-selection:

  • distinct()
  • count()
  • group_by()
  • select()
  • rename_with()
  • across()

2 Writing Plotting Functions in R

πŸ“– Required Reading: R4DS – Functions

Read only Section 4 (Data frame functions)!

βœ… Check-in 7.2: Writing Data Frame Functions

Question 4: Fill in the code below to build a rich plotting function which:

  • draws a scatterplot given dataset and x and y variables,
  • adds a line of best fit (i.e. a linear model with no standard errors)
  • add a title.
scatterplot <- function(df, x_var, y_var) {
  label <- rlang::englue("A scatterplot of _____ and _____, including a line of best fit.")
  
  df |> 
    ggplot(mapping = aes(x = _____, 
                         y = _____
                         )
           ) + 
    geom_point() + 
    geom_smooth(method = "lm", _____) +
    labs(title = _____)
}