Iterating Over Functions

In this unit, you’ll review for iteration—repeatedly performing the same function on different inputs.

Iteration in R generally tends to look a bit different from other programming languages. Much of iteration we get for free! For example, if you want to double a numeric vector x in R, you can just write 2 * x, whereas in many other languages you would need to explicitly double each element of x using some sort of for loop.

In R, there are generally two methods for iteration—for() loops and functionals. We will start with a review of for() loops before hopping over to functionals.

for loops

In R, for loops have the following general structure:

for (i in some_vector) {
    # Code to do stuff with i
}

some_vector can be any vector, including:

  • An indexing vector: 1:3
  • A character vector: c("group1", "group2", "group3")
  • A vector of any other class
groups <- c("group1", "group2", "group3")

for (i in 1:3) {
    print(groups[i])
}
[1] "group1"
[1] "group2"
[1] "group3"
for (g in groups) {
    print(g)
}
[1] "group1"
[1] "group2"
[1] "group3"

for() loop Indices

The seq_along() function generates an integer sequence from 1 to the length of the vector supplied. A nice feature of seq_along() is that it generates an empty iteration vector if the vector you’re iterating over itself has length 0.

seq_along(groups)
[1] 1 2 3
no_groups <- c()
seq_along(no_groups)
integer(0)
for (i in seq_along(groups)) {
    print(groups[i])
}
[1] "group1"
[1] "group2"
[1] "group3"
for (i in seq_along(no_groups)) {
    print(no_groups[i])
}

Closely related to seq_along() is seq_len(). While seq_along(x) generates an integer sequence from 1 to length(x), seq_len(x) takes x itself to be a length:

[1] 1 2 3
integer(0)
for (i in seq_len(length(groups))) {
    print(groups[i])
}
[1] "group1"
[1] "group2"
[1] "group3"
for (i in seq_len(length(no_groups))) {
    print(no_groups[i])
}

Dataframe Indices

seq_len() is useful for iterating over the rows of a data frame because seq_along() would iterate over columns:

small_data <- tibble(a = 1:2, 
                     b = 2:3, 
                     c = 4:5)
small_data
# A tibble: 2 × 3
      a     b     c
  <int> <int> <int>
1     1     2     4
2     2     3     5
for (col in seq_along(small_data)) {
    print(col)
}
[1] 1
[1] 2
[1] 3
for (r in seq_len(nrow(small_data))) {
    print(r)
}
[1] 1
[1] 2

Storing Objects

Often we’ll want to store output created during a for loop. We can create storage containers with the vector() function:

char_storage <- vector("character", 
                       length = 3)
char_storage
[1] "" "" ""
num_storage <- vector("numeric", 
                      length = 3)
num_storage
[1] 0 0 0
list_storage <- vector("list", 
                       length = 3)
list_storage
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

Our loop can then store objects created at each iteration stage:

for (i in seq_len(3)) {
    char_storage[i] <- str_c("Number: ", i)
    num_storage[i] <- 2*i
    list_storage[[i]] <- i # Note the [[ for subsetting here
}

char_storage
[1] "Number: 1" "Number: 2" "Number: 3"
num_storage
[1] 2 4 6
list_storage
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

Initial Object Size

Notice that we initialized char_storage, num_storage, and list_storage to all be the same length as our iteration vector (seq_len(3)). Technically, this wasn’t necessary. We could have been much lazier when we initialized these objects: list_storage <- NA. Why didn’t we do this?

When you don’t initialize an object as a specific size (e.g., length = 3) then at every stage of the for() loop R needs to append a new index to that object. This really slows down your for() loop!

Here is a comparison of the run time between two functions that both store the index of the iteration. The first function (do_stuff_allocate) sets the initial object size (vector(length = reps)) before the loop is run. The second function (do_stuff_tackon) does not set an initial object size and instead initializes results as a vector of length 1 (results <- NA).

do_stuff_allocate <- function(reps) {
  results <- vector(length = reps)
  for (i in seq_len(reps)) {
    results[i] <- i
  }
  return(results)
}

do_stuff_tackon <- function(reps) {
   results <- NA
  for (i in seq_len(reps)) {
    results <- c(results, i)
  }
  return(results)
}

Let’s see how these two functions compare in their run time:

Statistic do_stuff_allocate do_stuff_tackon
min 12,751 µs 403,317 µs
mean 23,755 µs 828,469 µs
median 14,596 µs 559,630 µs
max 867,519 µs 11,678,973 µs

Lesson: You should always initialize your object as the size you want it to be.

Check-inCheck In
  1. Write a for()-loop that prints the even numbers from 1:20.

  2. Can you produce the same output with the seq() function?

  1. Write a for()-loop that iterates over the month.name vector (built-in to base R) and stores a character vector of output containing strings like “Month 1: January”, “Month 2: February”.

  2. Can you produce the same output with str_c() only?

  1. Write a for()-loop that store the class() (type) of every column in the mpg data frame.

Iteration with Functionals

A functional is a function that takes a function as an input and returns a vector as output. - Hadley Wickham

Required-readingRequired Reading

purrr is a tidyverse package that provides several useful functions for iteration. The main advantages of purrr include:

  • Improved readability of R code
  • Reduction in the “overhead” in writing a for()-loop (creating storage containers and writing the for (i in ...))

In purrr, we can use the family of map() functions to apply a function to each element of a list or vector. We can think of this as mapping an input (a list or vector) to a new output via a function. Let’s look at the purrr cheatsheet to look at graphical representations of how these functions work.

  • map() returns a list
  • map_chr() returns a character vector
  • map_lgl() returns a logical vector
  • map_int() returns an integer vector
  • map_dbl() returns a numeric vector
  • map_vec() returns a vector of a different (non-atomic) type (like dates or factors)

A Single Output

To get the class() of each data frame column, map_chr() is the function we want because the class of a variable is a strings (e.g., "logical").

map_chr(mpg, .f = class)
manufacturer        model        displ         year          cyl        trans 
 "character"  "character"    "numeric"    "integer"    "integer"  "character" 
         drv          cty          hwy           fl        class 
 "character"    "integer"    "integer"  "character"  "character" 
NoteList Input

The first input of map() functions must be a list. A dataframe is a special type of list, where the columns are the different elements of the list (e.g., mpg[["manufacturer"]]). map_chr() iterates over the columns (elements) of the mpg dataframe (list).

Let’s get the class of each variable in diamonds:

map_chr(diamonds, .f = class)
Error in `map_chr()`:
ℹ In index: 2.
ℹ With name: cut.
Caused by error:
! Result must be length 1, not 2.

Multiple Outputs

What happened!? map_chr() was expecting to create a character vector with one element per column in diamonds. But something happened in column 2 with the cut variable. Let’s figure out what happened:

class(diamonds$cut)
[1] "ordered" "factor" 

Ah! cut has two classes. In this case, map() (which returns a list) is the best option because some variables have multiple classes:

map(diamonds, .f = class)
$carat
[1] "numeric"

$cut
[1] "ordered" "factor" 

$color
[1] "ordered" "factor" 

$clarity
[1] "ordered" "factor" 

$depth
[1] "numeric"

$table
[1] "numeric"

$price
[1] "integer"

$x
[1] "numeric"

$y
[1] "numeric"

$z
[1] "numeric"

The error we encountered with map_chr() is a nice feature of purrr because it requires us to be very sure of the type of output we are getting. Failing loudly is vastly preferable to getting unexpected outputs silently!

Combining with Tidy Selection

We can combine map_*() functions with tidy selection for some powerful variable summaries that require much less code than for() loops.

diamonds %>% 
  select(where(is.numeric)) %>% 
  map_dbl(.f = mean)
       carat        depth        table        price            x            y 
   0.7979397   61.7494049   57.4571839 3932.7997219    5.7311572    5.7345260 
           z 
   3.5387338 
diamonds %>% 
  select(!where(is.numeric)) %>% 
  map_int(.f = n_distinct)
    cut   color clarity 
      5       7       8 
Check-inCheck In
  1. Using a map function reproduce this table which indicates whether a given column is numeric.
  carat     cut   color clarity   depth   table   price       x       y       z 
   TRUE   FALSE   FALSE   FALSE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE 
  1. Using a map function reproduce this table which indicates how many levels are included in each categorical variable.
    cut   color clarity 
      5       7       8 

Multiple Inputs

purrr also offers the map2() and pmap() family of functions that take multiple inputs and loop over them simultaneously. The purrr cheatsheet provides nice graphical representations of how these functions work.

For all the examples below, I’m going to work with this dataset of strings:

string_data <- tibble(
    string = c("apple", "banana", "cherry"),
    pattern = c("p", "n", "h"),
    replacement = c("P", "N", "H")
)

string_data
# A tibble: 3 × 3
  string pattern replacement
  <chr>  <chr>   <chr>      
1 apple  p       P          
2 banana n       N          
3 cherry h       H          

Two Inputs

The str_detect() function takes two arguments: a string and a pattern to detect. The function returns logical values (TRUE, FALSE) indicating whether the pattern was detected in the string. Let’s use this to see how the map2_lgl() function works!

map2_lgl(
  .x = string_data$string,
  .y = string_data$pattern,
  .f = str_detect
)
[1] TRUE TRUE TRUE

This should look a bit different, we are using string_data$string to input the variables we want into map2() rather than piping (|>) the variables into the function.

If we had tried to use the pipe operator, we would have gotten the following error message:

string_data |> 
  map2_lgl(
    .x = string,
    .y = pattern,
    .f = str_detect
    )
Error: object 'string' not found

This error is because the pipe operator is inputting the entire string_data dataframe into the first argument of map2_lgl() (.x). Looking at the documentation for map2(), the .x and .y arguments should be specified as a pair of vectors, not dataframes.

If we wanted to use the pipe operator, we would need to join this with a data masking function (e.g., mutate()) that allows us to reference variable names for inputs into functions:

string_data |> 
  mutate(found = map2_lgl(
    .x = string,
    .y = pattern,
    .f = str_detect
    )
  )
# A tibble: 3 × 4
  string pattern replacement found
  <chr>  <chr>   <chr>       <lgl>
1 apple  p       P           TRUE 
2 banana n       N           TRUE 
3 cherry h       H           TRUE 

Three or More Inputs

Now that we’ve conquered two inputs, let’s try three! The str_replace_all() function takes three inputs: a string, a pattern to look for, and a replacement pattern (to use when the pattern is found).

string_data
# A tibble: 3 × 3
  string pattern replacement
  <chr>  <chr>   <chr>      
1 apple  p       P          
2 banana n       N          
3 cherry h       H          
pmap_chr(string_data, .f = str_replace_all)
[1] "aPPle"  "baNaNa" "cHerry"

Note how the column names in string_data exactly match the argument names in str_replace_all(). The iteration that is happening is across rows, and the multiple arguments in str_replace_all() are being matched by name. So, the first row is effectively running str_replace_all(string = "apple", pattern = "p", replacement = "P"), and similarly for the second and third row.

What if the column names didn’t match? Well, we would need to take a similar approach to what we did with map2():

string_data <- string_data |> 
  rename(word = string, 
         look_for = pattern, 
         replace_with = replacement)

pmap_chr(
  .l = list(string = string_data$word,
       pattern = string_data$look_for,
       replacement = string_data$replace_with),
  .f = str_replace_all
)
[1] "aPPle"  "baNaNa" "cHerry"

The main difference here is there is one argument (.l) where we specify the inputs to the function (instead of .x and .y). This argument is a list, where the elements of the list should take on the same names as the function arguments.

Similar to before, if I wanted to use pmap() to modify the word column of the dataset, I would need to pair it with mutate():

string_data |> 
  mutate(word = pmap_chr(
  .l = list(string = word,
       pattern = look_for,
       replacement = replace_with),
  .f = str_replace_all
  )
)
# A tibble: 3 × 3
  word   look_for replace_with
  <chr>  <chr>    <chr>       
1 aPPle  p        P           
2 baNaNa n        N           
3 cHerry h        H           
Check-inCheck In
  1. The function str_c() concatenates strings. Create a small example that uses map2_chr() to combine two character vectors element by element. Each element of the first vector should be combined with the corresponding element of the second vector, separated by a space.
string_data <- tibble(
  first  = c("apple", "banana", "cherry"),
  second = c("pie", "bread", "jam")
)
  1. The function str_sub() takes three arguments: string, start, and end. Create a small tibble containing the three inputs required for str_sub(). Note that the names of these columns must be the same as the names of the arguments to str_sub()!
  • a string,
  • a starting position, and
  • an ending position.
  1. Now use pmap_chr() to apply str_sub() row-by-row to the tibble created above.