for (i in some_vector) {
# Code to do stuff with i
}Iterating Over Functions
In this unit, you’ll review for iteration—repeatedly performing the same function on different inputs.
Iteration in R generally tends to look a bit different from other programming languages. Much of iteration we get for free! For example, if you want to double a numeric vector x in R, you can just write 2 * x, whereas in many other languages you would need to explicitly double each element of x using some sort of for loop.
In R, there are generally two methods for iteration—for() loops and functionals. We will start with a review of for() loops before hopping over to functionals.
for loops
In R, for loops have the following general structure:
some_vector can be any vector, including:
- An indexing vector:
1:3 - A character vector:
c("group1", "group2", "group3") - A vector of any other class
[1] "group1"
[1] "group2"
[1] "group3"
for (g in groups) {
print(g)
}[1] "group1"
[1] "group2"
[1] "group3"
for() loop Indices
The seq_along() function generates an integer sequence from 1 to the length of the vector supplied. A nice feature of seq_along() is that it generates an empty iteration vector if the vector you’re iterating over itself has length 0.
Closely related to seq_along() is seq_len(). While seq_along(x) generates an integer sequence from 1 to length(x), seq_len(x) takes x itself to be a length:
Dataframe Indices
seq_len() is useful for iterating over the rows of a data frame because seq_along() would iterate over columns:
small_data <- tibble(a = 1:2,
b = 2:3,
c = 4:5)
small_data# A tibble: 2 × 3
a b c
<int> <int> <int>
1 1 2 4
2 2 3 5
Storing Objects
Often we’ll want to store output created during a for loop. We can create storage containers with the vector() function:
char_storage <- vector("character",
length = 3)
char_storage[1] "" "" ""
num_storage <- vector("numeric",
length = 3)
num_storage[1] 0 0 0
list_storage <- vector("list",
length = 3)
list_storage[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
Our loop can then store objects created at each iteration stage:
for (i in seq_len(3)) {
char_storage[i] <- str_c("Number: ", i)
num_storage[i] <- 2*i
list_storage[[i]] <- i # Note the [[ for subsetting here
}
char_storage[1] "Number: 1" "Number: 2" "Number: 3"
num_storage[1] 2 4 6
list_storage[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
Initial Object Size
Notice that we initialized char_storage, num_storage, and list_storage to all be the same length as our iteration vector (seq_len(3)). Technically, this wasn’t necessary. We could have been much lazier when we initialized these objects: list_storage <- NA. Why didn’t we do this?
When you don’t initialize an object as a specific size (e.g., length = 3) then at every stage of the for() loop R needs to append a new index to that object. This really slows down your for() loop!
Here is a comparison of the run time between two functions that both store the index of the iteration. The first function (do_stuff_allocate) sets the initial object size (vector(length = reps)) before the loop is run. The second function (do_stuff_tackon) does not set an initial object size and instead initializes results as a vector of length 1 (results <- NA).
Let’s see how these two functions compare in their run time:
| Statistic | do_stuff_allocate | do_stuff_tackon |
|---|---|---|
| min | 12,751 µs | 403,317 µs |
| mean | 23,755 µs | 828,469 µs |
| median | 14,596 µs | 559,630 µs |
| max | 867,519 µs | 11,678,973 µs |
Lesson: You should always initialize your object as the size you want it to be.
Write a
for()-loop that prints the even numbers from 1:20.Can you produce the same output with the
seq()function?
Write a
for()-loop that iterates over themonth.namevector (built-in to base R) and stores a character vector of output containing strings like “Month 1: January”, “Month 2: February”.Can you produce the same output with
str_c()only?
- Write a
for()-loop that store theclass()(type) of every column in thempgdata frame.
Iteration with Functionals
A functional is a function that takes a function as an input and returns a vector as output. - Hadley Wickham
purrr is a tidyverse package that provides several useful functions for iteration. The main advantages of purrr include:
- Improved readability of R code
- Reduction in the “overhead” in writing a
for()-loop (creating storage containers and writing thefor (i in ...))
In purrr, we can use the family of map() functions to apply a function to each element of a list or vector. We can think of this as mapping an input (a list or vector) to a new output via a function. Let’s look at the purrr cheatsheet to look at graphical representations of how these functions work.
-
map()returns a list -
map_chr()returns a character vector -
map_lgl()returns a logical vector -
map_int()returns an integer vector -
map_dbl()returns a numeric vector -
map_vec()returns a vector of a different (non-atomic) type (like dates or factors)
A Single Output
To get the class() of each data frame column, map_chr() is the function we want because the class of a variable is a strings (e.g., "logical").
map_chr(mpg, .f = class)manufacturer model displ year cyl trans
"character" "character" "numeric" "integer" "integer" "character"
drv cty hwy fl class
"character" "integer" "integer" "character" "character"
The first input of map() functions must be a list. A dataframe is a special type of list, where the columns are the different elements of the list (e.g., mpg[["manufacturer"]]). map_chr() iterates over the columns (elements) of the mpg dataframe (list).
Let’s get the class of each variable in diamonds:
map_chr(diamonds, .f = class)Error in `map_chr()`:
ℹ In index: 2.
ℹ With name: cut.
Caused by error:
! Result must be length 1, not 2.
Multiple Outputs
What happened!? map_chr() was expecting to create a character vector with one element per column in diamonds. But something happened in column 2 with the cut variable. Let’s figure out what happened:
class(diamonds$cut)[1] "ordered" "factor"
Ah! cut has two classes. In this case, map() (which returns a list) is the best option because some variables have multiple classes:
map(diamonds, .f = class)$carat
[1] "numeric"
$cut
[1] "ordered" "factor"
$color
[1] "ordered" "factor"
$clarity
[1] "ordered" "factor"
$depth
[1] "numeric"
$table
[1] "numeric"
$price
[1] "integer"
$x
[1] "numeric"
$y
[1] "numeric"
$z
[1] "numeric"
The error we encountered with map_chr() is a nice feature of purrr because it requires us to be very sure of the type of output we are getting. Failing loudly is vastly preferable to getting unexpected outputs silently!
Combining with Tidy Selection
We can combine map_*() functions with tidy selection for some powerful variable summaries that require much less code than for() loops.
diamonds %>%
select(where(is.numeric)) %>%
map_dbl(.f = mean) carat depth table price x y
0.7979397 61.7494049 57.4571839 3932.7997219 5.7311572 5.7345260
z
3.5387338
diamonds %>%
select(!where(is.numeric)) %>%
map_int(.f = n_distinct) cut color clarity
5 7 8
- Using a map function reproduce this table which indicates whether a given column is numeric.
carat cut color clarity depth table price x y z
TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
- Using a map function reproduce this table which indicates how many levels are included in each categorical variable.
cut color clarity
5 7 8
Multiple Inputs
purrr also offers the map2() and pmap() family of functions that take multiple inputs and loop over them simultaneously. The purrr cheatsheet provides nice graphical representations of how these functions work.
For all the examples below, I’m going to work with this dataset of strings:
# A tibble: 3 × 3
string pattern replacement
<chr> <chr> <chr>
1 apple p P
2 banana n N
3 cherry h H
Two Inputs
The str_detect() function takes two arguments: a string and a pattern to detect. The function returns logical values (TRUE, FALSE) indicating whether the pattern was detected in the string. Let’s use this to see how the map2_lgl() function works!
map2_lgl(
.x = string_data$string,
.y = string_data$pattern,
.f = str_detect
)[1] TRUE TRUE TRUE
This should look a bit different, we are using string_data$string to input the variables we want into map2() rather than piping (|>) the variables into the function.
If we had tried to use the pipe operator, we would have gotten the following error message:
string_data |>
map2_lgl(
.x = string,
.y = pattern,
.f = str_detect
)Error: object 'string' not found
This error is because the pipe operator is inputting the entire string_data dataframe into the first argument of map2_lgl() (.x). Looking at the documentation for map2(), the .x and .y arguments should be specified as a pair of vectors, not dataframes.
If we wanted to use the pipe operator, we would need to join this with a data masking function (e.g., mutate()) that allows us to reference variable names for inputs into functions:
string_data |>
mutate(found = map2_lgl(
.x = string,
.y = pattern,
.f = str_detect
)
)# A tibble: 3 × 4
string pattern replacement found
<chr> <chr> <chr> <lgl>
1 apple p P TRUE
2 banana n N TRUE
3 cherry h H TRUE
Three or More Inputs
Now that we’ve conquered two inputs, let’s try three! The str_replace_all() function takes three inputs: a string, a pattern to look for, and a replacement pattern (to use when the pattern is found).
string_data# A tibble: 3 × 3
string pattern replacement
<chr> <chr> <chr>
1 apple p P
2 banana n N
3 cherry h H
pmap_chr(string_data, .f = str_replace_all)[1] "aPPle" "baNaNa" "cHerry"
Note how the column names in string_data exactly match the argument names in str_replace_all(). The iteration that is happening is across rows, and the multiple arguments in str_replace_all() are being matched by name. So, the first row is effectively running str_replace_all(string = "apple", pattern = "p", replacement = "P"), and similarly for the second and third row.
What if the column names didn’t match? Well, we would need to take a similar approach to what we did with map2():
string_data <- string_data |>
rename(word = string,
look_for = pattern,
replace_with = replacement)
pmap_chr(
.l = list(string = string_data$word,
pattern = string_data$look_for,
replacement = string_data$replace_with),
.f = str_replace_all
)[1] "aPPle" "baNaNa" "cHerry"
The main difference here is there is one argument (.l) where we specify the inputs to the function (instead of .x and .y). This argument is a list, where the elements of the list should take on the same names as the function arguments.
Similar to before, if I wanted to use pmap() to modify the word column of the dataset, I would need to pair it with mutate():
string_data |>
mutate(word = pmap_chr(
.l = list(string = word,
pattern = look_for,
replacement = replace_with),
.f = str_replace_all
)
)# A tibble: 3 × 3
word look_for replace_with
<chr> <chr> <chr>
1 aPPle p P
2 baNaNa n N
3 cHerry h H
- The function
str_c()concatenates strings. Create a small example that usesmap2_chr()to combine two character vectors element by element. Each element of the first vector should be combined with the corresponding element of the second vector, separated by a space.
- The function
str_sub()takes three arguments:string,start, andend. Create a small tibble containing the three inputs required forstr_sub(). Note that the names of these columns must be the same as the names of the arguments tostr_sub()!
- a string,
- a starting position, and
- an ending position.
- Now use
pmap_chr()to applystr_sub()row-by-row to the tibble created above.