Iterating Over Functions

In this unit, you’ll learn tools for iteration—repeatedly performing the same action on different objects. Iteration in R generally tends to look a bit different from other programming languages. Much of iteration we get for free! For example, if you want to double a numeric vector x in R, you can just write 2 * x, whereas in many other languages you would need to explicitly double each element of x using some sort of for loop.

In R, there are generally two methods for iteration—for() loops and functionals. We will start with a review of for() loops before hopping over to functionals.

for loops

In R, for loops have the following general structure:

for (i in some_vector) {
    # Code to do stuff with i
}

some_vector can be any vector, including:

  • An indexing vector: 1:3
  • A character vector: c("group1", "group2", "group3")
  • A vector of any other class
groups <- c("group1", "group2", "group3")

for (i in 1:3) {
    print(groups[i])
}
[1] "group1"
[1] "group2"
[1] "group3"
for (g in groups) {
    print(g)
}
[1] "group1"
[1] "group2"
[1] "group3"

The seq_along() function generates an integer sequence from 1 to the length of the vector supplied. A nice feature of seq_along() is that it generates an empty iteration vector if the vector you’re iterating over itself has length 0.

seq_along(groups)
[1] 1 2 3
no_groups <- c()
seq_along(no_groups)
integer(0)
for (i in seq_along(groups)) {
    print(groups[i])
}
[1] "group1"
[1] "group2"
[1] "group3"
for (i in seq_along(no_groups)) {
    print(no_groups[i])
}

Closely related to seq_along() is seq_len(). While seq_along(x) generates an integer sequence from 1 to length(x), seq_len(x) takes x itself to be a length:

seq_len(3)
[1] 1 2 3
seq_len(0)
integer(0)
for (i in seq_len(length(groups))) {
    print(groups[i])
}
[1] "group1"
[1] "group2"
[1] "group3"
for (i in seq_len(length(no_groups))) {
    print(no_groups[i])
}

seq_len() is useful for iterating over the rows of a data frame because seq_along() would iterate over columns:

small_data <- tibble(a = 1:2, 
                     b = 2:3, 
                     c = 4:5)

for (col in small_data) {
    print(col)
}
[1] 1 2
[1] 2 3
[1] 4 5
for (r in seq_len(nrow(small_data))) {
    print(r)
}
[1] 1
[1] 2

Often we’ll want to store output created during a for loop. We can create storage containers with the vector() function:

char_storage <- vector("character", 
                       length = 3)
char_storage
[1] "" "" ""
num_storage <- vector("numeric", 
                      length = 3)
num_storage
[1] 0 0 0
list_storage <- vector("list", 
                       length = 3)
list_storage
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL
for (i in seq_len(3)) {
    char_storage[i] <- str_c("Number: ", i)
    num_storage[i] <- 2*i
    list_storage[[i]] <- i # Note the [[ for subsetting here
}

char_storage
[1] "Number: 1" "Number: 2" "Number: 3"
num_storage
[1] 2 4 6
list_storage
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

Exercises

Write for()-loops that do each of the following:

  1. Prints the even numbers from 1:20.
  • Produce the same output with the seq() function!
Solutions
for (i in seq_len(10)) {
    print(2*i)
}
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14
[1] 16
[1] 18
[1] 20
seq(from = 2, to = 20, by = 2)
 [1]  2  4  6  8 10 12 14 16 18 20
  1. Iterates over the month.name vector (built-in to base R) and stores a character vector of output containing strings like “Month 1: January”, “Month 2: February”.
  • Produce the same output with str_c() only!
Solutions
month_strings <- vector("character", length = length(month.name))

for (i in seq_along(month.name)) {
    month_strings[i] <- str_c("Month ", i, ": ", month.name[i])
}
month_strings
 [1] "Month 1: January"   "Month 2: February"  "Month 3: March"    
 [4] "Month 4: April"     "Month 5: May"       "Month 6: June"     
 [7] "Month 7: July"      "Month 8: August"    "Month 9: September"
[10] "Month 10: October"  "Month 11: November" "Month 12: December"
str_c("Month ", 1:12, ": ", month.name)
 [1] "Month 1: January"   "Month 2: February"  "Month 3: March"    
 [4] "Month 4: April"     "Month 5: May"       "Month 6: June"     
 [7] "Month 7: July"      "Month 8: August"    "Month 9: September"
[10] "Month 10: October"  "Month 11: November" "Month 12: December"
  1. Store the class() (type) of every column in the mtcars data frame.
Solution
col_classes <- vector("character", 
                      ncol(mtcars)
                      )

# Data frames are **lists** of columns, so this loop iterates over the columns
for (i in seq_along(mtcars)) {
    col_classes[i] <- class(mtcars[[i]])
}

Iteration with Functionals

A functional is a function that takes a function as an input and returns a vector as output. - Hadley Wickham

đź“– Required Reading: Functionals

purrr is a tidyverse package that provides several useful functions for iteration. The main advantages of purrr include:

  • Improved readability of R code
  • Reduction in the “overhead” in writing a for loop (creating storage containers and writing the for (i in ...))

In purrr, we can use the family of map() functions to apply a function to each element of a list or vector. We can think of this as mapping an input (a list or vector) to a new output via a function. Let’s look at the purrr cheatsheet to look at graphical representations of how these functions work.

  • map() returns a list
  • map_chr() returns a character vector
  • map_lgl() returns a logical vector
  • map_int() returns an integer vector
  • map_dbl() returns a numeric vector
  • map_vec() returns a vector of a different (non-atomic) type (like dates)

To get the class() of each data frame column, map_chr() is sensible because classes are strings:

map_chr(mtcars, .f = class)
      mpg       cyl      disp        hp      drat        wt      qsec        vs 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
       am      gear      carb 
"numeric" "numeric" "numeric" 

Let’s get the class of each variable in diamonds:

map_chr(diamonds, class)
Error in `map_chr()`:
ℹ In index: 2.
ℹ With name: cut.
Caused by error:
! Result must be length 1, not 2.

What happened!? map_chr() was expecting to create a character vector with one element per element (column) in diamonds. But something happened in column 2 with the cut variable. Let’s figure out what happened:

class(diamonds$cut)
[1] "ordered" "factor" 

Ah! cut has two classes. In this case, map() (which returns a list) is the best option because some variables have multiple classes:

map(diamonds, class)
$carat
[1] "numeric"

$cut
[1] "ordered" "factor" 

$color
[1] "ordered" "factor" 

$clarity
[1] "ordered" "factor" 

$depth
[1] "numeric"

$table
[1] "numeric"

$price
[1] "integer"

$x
[1] "numeric"

$y
[1] "numeric"

$z
[1] "numeric"

The error we encountered with map_chr() is a nice feature of purrr because it allows us to be very sure of the type of output we are getting. Failing loudly is vastly preferable to getting unexpected outputs silently because we can catch errors earlier!

We can combine map_*() functions with tidy selection for some powerful variable summaries that require much less code than for() loops.

diamonds %>% 
  select(where(is.numeric)) %>% 
  map_dbl(.f = mean)
       carat        depth        table        price            x            y 
   0.7979397   61.7494049   57.4571839 3932.7997219    5.7311572    5.7345260 
           z 
   3.5387338 
diamonds %>% 
  select(!where(is.numeric)) %>% 
  map_int(.f = n_distinct)
    cut   color clarity 
      5       7       8 

Exercises

  1. Write a function called summ_stats() that takes a numeric vector x as input and returns the mean, median, standard deviation, and IQR as a data frame. You can use tibble() to create the data frame.
    • Example: tibble(a = 1:2, b = 2:3) creates a data frame with variables a and b.
Solution
summ_stats <- function(x) {
    tibble(
        mean = mean(x, na.rm = TRUE),
        median = median(x, na.rm = TRUE),
        sd = sd(x, na.rm = TRUE),
        iqr = IQR(x, na.rm = TRUE)
    ) 
}
  1. Use map() to apply your summ_stats() function to the numeric columns in the diamonds dataset.
Tip

Look up the bind_rows() documentation from dplyr to combine summary statistics for all quantitative variables into one data frame. The .id argument will be especially helpful in adding the variable names!

Consider pivoting !

The output would be easier for the reader to understand if the summary statistics were rows and the variable names were columns.

Solution
diamonds_num <- diamonds %>% 
  select(where(is.numeric))

map(diamonds_num, .f = summ_stats) %>% 
  bind_rows(.id = "variable") %>% 
  pivot_longer(cols = -variable, 
               names_to = "statistic", 
               values_to = "value") %>% 
  pivot_wider(names_from = variable, values_from = value)
# A tibble: 4 Ă— 8
  statistic carat depth table price     x     y     z
  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mean      0.798 61.7  57.5  3933.  5.73  5.73 3.54 
2 median    0.7   61.8  57    2401   5.7   5.71 3.53 
3 sd        0.474  1.43  2.23 3989.  1.12  1.14 0.706
4 iqr       0.64   1.5   3    4374.  1.83  1.82 1.13 
  1. Write a for() loop to achieve the same result. Which do you prefer in terms of ease of code writing and readability?
Solution
diamonds_summ_stats2 <- vector("list", length = ncol(diamonds_num))

for (i in seq_along(diamonds_num)) {
    diamonds_summ_stats2[[i]] <- summ_stats(diamonds_num[[i]])
}

diamonds_summ_stats2 %>%
    set_names(colnames(diamonds_num)) %>%
    bind_rows(.id = "variable")
# A tibble: 7 Ă— 5
  variable     mean  median       sd     iqr
  <chr>       <dbl>   <dbl>    <dbl>   <dbl>
1 carat       0.798    0.7     0.474    0.64
2 depth      61.7     61.8     1.43     1.5 
3 table      57.5     57       2.23     3   
4 price    3933.    2401    3989.    4374.  
5 x           5.73     5.7     1.12     1.83
6 y           5.73     5.71    1.14     1.82
7 z           3.54     3.53    0.706    1.13

Multiple Inputs

purrr also offers the pmap() family of functions that take multiple inputs and loops over them simultaneously. Let’s look at the purrr cheatsheet to look at graphical representations of how these functions work.

string_data <- tibble(
    string = c("apple", "banana", "cherry"),
    pattern = c("p", "n", "h"),
    replacement = c("P", "N", "H")
)
string_data
# A tibble: 3 Ă— 3
  string pattern replacement
  <chr>  <chr>   <chr>      
1 apple  p       P          
2 banana n       N          
3 cherry h       H          
pmap_chr(string_data, .f = str_replace_all)
[1] "aPPle"  "baNaNa" "cHerry"

Note how the column names in string_data exactly match the argument names in str_replace_all(). The iteration that is happening is across rows, and the multiple arguments in str_replace_all() are being matched by name.

We can also use pmap() to specify variations in some arguments but leave some arguments constant across the iterations:

string_data <- tibble(
    pattern = c("p", "n", "h"),
    replacement = c("P", "N", "H")
)

pmap_chr(string_data, str_replace_all, string = "ppp nnn hhh")
[1] "PPP nnn hhh" "ppp NNN hhh" "ppp nnn HHH"

Exercises

  1. Create 2 small examples that show how pmap() works with str_sub(). Your examples should:
  • Use different arguments for string, start, and end
  • Use different arguments for start and end but a fixed string
Solution
string_data <- tibble(
    string = c("apple", "banana", "cherry"),
    start = c(1, 2, 4),
    end = c(2, 3, 5)
)

pmap_chr(string_data, str_sub)
[1] "ap" "an" "rr"
string_data <- tibble(
    start = c(1, 2, 4),
    end = c(2, 3, 5)
)
pmap_chr(string_data, str_sub, string = "abcde")
[1] "ab" "bc" "de"