<- function(first_num, second_num = 2, type = "add") {
add_or_subtract
if (type == "add") {
+ second_num
first_num else if (type == "subtract") {
} - second_num
first_num else {
} stop("Please choose `add` or `subtract` as the type.")
}
}
Writing Functions in R
I’ve included quite a few talks in this coursework (1) because they do a great job discussing topics related to function design, and (2) they were given by my coding heroes (cough, Jenny Bryan 🤩).
It’s not critical for you to sit in front of a computer when listening to these talks, so I might recommend going on a walk and listening to one! The weather is too nice to sit in front of a computer all day!
Basics of Functions
If you do not recall the basics of writing functions, or if you want a quick refresher, watch the video below.
🎥 Recommended Video: How to Write a Function in R
📖 Recommended Reading: R for Data Science: Vector Functions
Anatomy of a Function
Let’s establish some vocabulary moving forward. Consider the very simple function below:
The function name is chosen by whoever writes the function:
add_or_subtract
The required arguments are the ones for which no default value is supplied:
first_num
The optional arguments are the ones for which a default value is supplied:
second_num = 2
,type = "add"
The body of the function is all the code inside the definition. This code will be run in the environment of the function, rather than in the global environment. This means that code in the body of the function does not have the power to alter anything outside the function1:
if (type == "add") {
+ second_num
first_num else if (type == "subtract") {
} - second_num
first_num else {
} stop("Please choose `add` or `subtract` as the type.")
}
- The return values of the function are the possible objects that get returned:
first_num + second_num
,first_num - second_num
When we use a function in code, this is referred to as a function call.
Question 1: What will be returned by each of the following?
- 1
- -1
- 30
- An error defined in the function
add_or_subtract
- An error defined in a different function, that is called from inside
add_or_subtract
add_or_subtract(5, 6, type = "subtract")
add_or_subtract("orange")
add_or_subtract(5, 6, type = "multiply")
add_or_subtract("orange", type = "multiply")
Question 2:
Consider the following code:
<- 5
first_num <- 3
second_num
<- 8
result
<- add_or_subtract(first_num, second_num = 4)
result
<- add_or_subtract(first_num) result_2
In your Global Environment, what is the value of…
first_num
second_num
result
result_2
Good Function Design
Most likely, you have so far only written functions for your own convenience. (Or for assignments, of course!) We are now going to be designing functions for other people to use and possibly even edit them. This means we need to put some thought into the design of the function.
🎥 Recommended Video: What Makes a Good Function
🎥 Recommended Video: Code Smells and Feels
Designing functions is somewhat subjective, but there are a few principles that apply:
- Choose a good, descriptive names
- Your function name should describe what it does, and usually involves a verb.
- Your argument names should be simple and/or descriptive.
- Names of variables in the body of the function should be descriptive.
- Output should be very predictable
- Your function should always return the same object type, no matter what input it gets.
- Your function should expect certain objects or object types as input, and give errors when it does not get them.
- Your function should give errors or warnings for common mistakes.
- Default values of arguments should only be used when there is a clear common choice.
- The body of the function should be easy to read.
- Code should use good style principles.
- There should be occasional comments to explain the purpose of the steps.
- Complicated steps, or steps that are repeated many times, should be written into separate functions (sometimes called helper functions).
- Functions should be self-contained.
- They should not rely on any information besides what is given as input. (Relying on other functions is fine, though!)
- They should not alter the Global Environment. (Do not put
library()
statements inside functions!)
Question 3: Identify the relevant issues for the following function:
<- function(x) {
doStuff if (is.numeric(x)) {
<<- x + 1
y return(y)
else if (is.character(x)) {
} <<- list(msg = paste("Hello", x))
result return(result)
else {
} warning("Unsupported input type")
return(NULL)
} }
- the body of the function does not follow good style principles
- the argument names are not descriptive
- the function returns a different object type for different inputs
- function does not give errors when unexpected object types are input
- the names of variables in the body of the function are not descriptive
- the function doesn’t use
- the function modifies the global environment
- the function’s name does not describe what it does
Debugging Functions
Suppose you’ve done it: You’ve written the most glorious, beautiful, well-designed function of all time. It’s many lines long, and it relies on several sub-functions.
You run it and - it doesn’t work.
How can you track down exactly where in your complicated functions, something went wrong?
🎥 Recommended Video: Object of Type ‘closure’ is Not Subsettable
Question 4: What does using the traceback
approach to debugging NOT tell you?
- The function call that triggered the error.
- The sub-function where the error actually occurred.
- The value of the argument or object that caused the error.
- The text of the full error message.
Question 5: Which of the following is NOT a disadvantage of using browser()
?
- You can’t insert it into existing functions.
- You can’t view variables in the function environment when it is running.
- You have to remember to take it out of your code when you are done with it.
- You have to run your code line-by-line until you find the error.
Question 6: What is the most fun pronunciation of debugonce()
- “Debug Once”
- “Debut Gonky”
- “Debugoncé” like “Beyoncé”
Advanced Details
As this is an Advanced course, let’s take a moment to talk about two quirky details of how R handles functions.
Objects of Type Closure
In R, functions are objects. That is, creating a function is not fundamentally different from creating a vector or a data frame.
Here we store the vector 1, 2, 3
in the object named a
:
<- 1:3
a
a
[1] 1 2 3
Here we store the procedure “add one plus one” in the object named a
:
<- function(){
a 1 + 1
}
a
function ()
{
1 + 1
}
For some strange reason, there is a specific term in R for “an object that is a function”—closure. Have you ever gotten this error?
1] a[
Error in a[1]: object of type 'closure' is not subsettable
I bet you have! What happened here is that we tried to take a subset of the vector a
. But a
is a function, not a vector, so this doesn’t work! If you’ve encounter this error in the wild, it’s probably because you tried to reference a non-existent object. However, you used an object name that happens to also be an existing function.
Question 7: What is the most likely cause of the error message Error in x[1] : object of type 'closure'
is not subsettable?
- Trying to access an element of a list using parentheses
()
instead of square brackets[]
- Attempting to subset a numeric vector using the wrong index type
- Trying to extract an element from a function using square bracket notation
- Passing a missing argument to a function
Lazy Evaluation
Like most people, R’s goal is to avoid doing any unnecessary work. When you “give” a value to an argument of a function, R does a quick check to make sure you haven’t done anything too crazy, like forgotten a parenthesis. Then it says, “Yep, looks like R code to me!” and moves on with its life. Only when that argument is actually used does R try to run the code.
Consider the following obvious problem:
mean('orange')
Warning in mean.default("orange"): argument is not numeric or logical:
returning NA
[1] NA
Now consider the following function:
<- function(x) {
silly_function
cat("I am silly!")
}
What do you think will happen when we run:
silly_function(
x = mean("orange")
)
Seems like it should be an error, right? But wait! Try it out for yourself.
The function silly_function()
doesn’t use the x
argument. Thus, R was “lazy”, and never even bothered to try to run mean("orange")
- so we never get an error. 🙀
Question 8: In R, when exactly does the evaluation of a function argument occur?
- Immediately when the function is called
- Only if and when the argument’s value is actually used within the function body
- When the function is compiled
- Only after all other arguments have been evaluated
Non-Standard Evaluation and Tunnelling
Suppose you want to write a function that takes a dataset, a categorical variable, and a quantitative variable; and returns the means by group.
You might think to yourself, “Easy!” and write something like this:
<- function(dataset, cat_var, quant_var) {
means_by_group
%>%
dataset group_by(cat_var) %>%
summarize(means = mean(quant_var,
na.rm = TRUE)
) }
Okay, let’s run it!
means_by_group(penguins,
cat_var = species,
quant_var = bill_length_mm)
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `cat_var` is not found.
Dagnabbit! The function tried to group the data by a variable named cat_var
- but the dataset penguins
doesn’t have any variables named cat_var
!
What happened here is that the function group_by()
uses non-standard evaluation. This means it has a very special type of input called unquoted. Notice that we say group_by(species)
not group_by("species")
- there are no quotation marks, because species
is a variable name, not a string. In the means_by_group()
function, R sees the unquoted variable cat_var
, and tries to use it as an input in group_by()
, not realizing that we actually meant to pass along the variable name species
into the function.
📖 Recommended Reading: R for Data Science: Data Frame Functions
To solve this conundrum, we use a trick called tunneling to “force” the unquoted name Species
through to the function group_by()
. It looks like this:
<- function(dataset, cat_var, quant_var) {
means_by_group
%>%
dataset group_by({{cat_var}}) %>%
summarize(
means = mean({{quant_var}})
)
}
Note: The tunnel, or “curly-curly” operator, {{ }}
, is from the tidyverse package rlang
.
Now everything works!
means_by_group(penguins,
cat_var = species,
quant_var = bill_length_mm
)
# A tibble: 3 × 2
species means
<fct> <dbl>
1 Adelie NA
2 Chinstrap 48.8
3 Gentoo NA
Okay, now let’s group by both species
and sex
:
means_by_group(penguins,
cat_var = c(species, sex),
quant_var = bill_length_mm)
Error in `group_by()`:
ℹ In argument: `c(species, sex)`.
Caused by error:
! `c(species, sex)` must be size 344 or 1, not 688.
Oh no! What now?! When c(species, sex)
is put inside {{ c(species, sex) }}
within the function, R is actually running the code inside {{ }}
. This combines the columns for those two variables into one long vector. What we really meant by c(species, sex)
is “group by both cut and color.”
To fix this, we need the pick()
function to get R to see {{ group_var }}
as a list of separate variables (like the way select()
works).
<- function(df, cat_var, quant_var) {
means_by_group %>%
df group_by(pick({{ cat_var }})) %>%
summarize(mean = mean({{ quant_var }})
) }
Now it’s back to working!
means_by_group(penguins,
cat_var = c(species, sex),
quant_var = bill_length_mm
)
# A tibble: 8 × 3
# Groups: species [3]
species sex mean
<fct> <fct> <dbl>
1 Adelie female 37.3
2 Adelie male 40.4
3 Adelie <NA> NA
4 Chinstrap female 46.6
5 Chinstrap male 51.1
6 Gentoo female 45.6
7 Gentoo male 49.5
8 Gentoo <NA> NA
Question 9: Create a new version of dplyr::count()
that also shows proportions instead of just sample sizes. The function should be able to handle counting by multiple variables.
Footnotes
There are ways to cheat your way around this, but we will avoid them!↩︎