Lab 6 - Spicy

Performing Many Different Versions of an Analysis

This assignment will challenge your function writing abilities. I’m not going to lie, these functions are difficult but well within your reach. I do, however, want to recognize that not everyone is interested in being a “virtuoso” with their function writing. So, there are two options for this week’s lab:

  1. and complete the “Alternative Lab 6”.

Setting the Stage

My number one use case for writing functions and iteration / looping is to perform some exploration or modeling repeatedly for different “tweaked” versions. For example, our broad goal might be to fit a linear regression model to our data. However, there are often multiple choices that we have to make in practice:

  • Keep missing values or fill them in (imputation)?
  • Filter out outliers in one or more variables?

We can map these choices to arguments in a custom model-fitting function:

  • impute: TRUE or FALSE
  • remove_outliers: TRUE or FALSE

A function that implements the analysis and allows for variation in these choices:

fit_model <- function(df, impute, remove_outliers, mod) {
    if (impute) {
        df <- some_imputation_function(df)
    }
    
    if (remove_outliers) {
        df <- function_for_removing_outliers(df)
    }
    
    lm(mod, data = df)
}

Helper Functions

Exercise 1: Write a function that removes outliers in a dataset. The user should be able to supply the dataset, the variables to remove outliers from, and a threshold on the number of SDs away from the mean used to define outliers. Hint 1: You will need to calculate a z-score to filter the values! Hint 2: You might want to consider specifying a default value (e.g., 3) for sd_thresh.

Testing Your Function!

## Testing how your function handles multiple input variables
remove_outliers(diamonds, 
                price, 
                x, 
                y, 
                z)

## Testing how your function handles an input that isn't numeric
remove_outliers(diamonds, 
                price, 
                color)

## Testing how your function handles a non-default sd_thresh
remove_outliers(diamonds, 
                price,
                x, 
                y, 
                z, 
                sd_thresh = 2)

Exercise 2: Write a function that imputes missing values for numeric variables in a dataset. The user should be able to supply the dataset, the variables to impute values for, and a function to use when imputing. Hint 1: You will need to use across() to apply your function, since the user can input multiple variables. Hint 2: The replace_na() function is helpful here!

Testing Your Function!

## Testing how your function handles multiple input variables
impute_missing(nycflights13::flights, 
               arr_delay, 
               dep_delay) 

## Testing how your function handles an input that isn't numeric
impute_missing(nycflights13::flights, 
               arr_delay, 
               carrier)

## Testing how your function handles a non-default impute_fun
impute_missing(nycflights13::flights, 
               arr_delay, 
               dep_delay, 
               impute_fun = median)

Primary Function

Exercise 3: Write a fit_model() function that fits a specified linear regression model for a specified dataset. The function should:

  • allow the user to specify if outliers should be removed (TRUE or FALSE)
  • allow the user to specify if missing observations should be imputed (TRUE or FALSE)

If either option is TRUE, your function should call your remove_outliers() or impute_missing() functions to modify the data before the regression model is fit.

Testing Your Function!

fit_model(
  diamonds,
  mod_formula = price ~ carat + cut,
  remove_outliers = TRUE,
  impute_missing = TRUE,
  price, 
  carat
)

Iteration

In the diamonds dataset, we want to understand the relationship between price and size (carat). We want to explore variation along two choices:

  1. The variables included in the model. We’ll explore 3 sets of variables:

    • No further variables (just price and carat)
    • Adjusting for cut
    • Adjusting for cut and clarity
    • Adjusting for cut, clarity, and color
  2. Whether or not to impute missing values

  3. Whether or not to remove outliers in the carat variable (we’ll define outliers as cases whose carat is over 3 SDs away from the mean).

Parameters

First, we need to define the set of parameters we want to iterate the fit_model() function over. The tidyr package has a useful function called crossing() that is useful for generating argument combinations. For each argument, we specify all possible values for that argument and crossing() generates all combinations. Note that you can create a list of formula objects in R with c(y ~ x1, y ~ x1 + x2).

df_arg_combos <- crossing(
    impute = c(TRUE, FALSE),
    remove_outliers = c(TRUE, FALSE), 
    mod = c(y ~ x1, 
            y ~ x1 + x2)
)
df_arg_combos

Exercise 4: Use crossing() to create the data frame of argument combinations for our analyses.

Iterating Over the Parameters

We’ve arrived at the final step!

Exercise 5: Use pmap() from purrr to apply the fit_model() function to every combination of arguments from `diamonds.