Loops and Iteration

After this lesson, you should be able to:


▶️ Watch Videos: 10 minutes

📖 Readings: 45-60 minutes

✅ Check-ins: 2


1 Iteration across data frame columns with across()

Often we will have to perform the same data wrangling on many variables (e.g., rounding numbers)

diamonds %>%
    mutate(
        carat = round(carat, 1),
        x = round(x, 1),
        y = round(y, 1),
        z = round(z, 1)
    )
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1   0.2 Ideal     E     SI2      61.5    55   326   4     4     2.4
 2   0.2 Premium   E     SI1      59.8    61   326   3.9   3.8   2.3
 3   0.2 Good      E     VS1      56.9    65   327   4     4.1   2.3
 4   0.3 Premium   I     VS2      62.4    58   334   4.2   4.2   2.6
 5   0.3 Good      J     SI2      63.3    58   335   4.3   4.3   2.8
 6   0.2 Very Good J     VVS2     62.8    57   336   3.9   4     2.5
 7   0.2 Very Good I     VVS1     62.3    57   336   4     4     2.5
 8   0.3 Very Good H     SI1      61.9    55   337   4.1   4.1   2.5
 9   0.2 Fair      E     VS2      65.1    61   337   3.9   3.8   2.5
10   0.2 Very Good H     VS1      59.4    61   338   4     4     2.4
# ℹ 53,930 more rows

dplyr provides the across() function for performing these repeated function calls:

# Option 1: Create our own named function
round_to_one <- function(x) {
    round(x, digits = 1)
}
diamonds %>% 
    mutate(across(.cols = c(carat, x, y, z), 
                  .fns = round_to_one
                  )
           )
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1   0.2 Ideal     E     SI2      61.5    55   326   4     4     2.4
 2   0.2 Premium   E     SI1      59.8    61   326   3.9   3.8   2.3
 3   0.2 Good      E     VS1      56.9    65   327   4     4.1   2.3
 4   0.3 Premium   I     VS2      62.4    58   334   4.2   4.2   2.6
 5   0.3 Good      J     SI2      63.3    58   335   4.3   4.3   2.8
 6   0.2 Very Good J     VVS2     62.8    57   336   3.9   4     2.5
 7   0.2 Very Good I     VVS1     62.3    57   336   4     4     2.5
 8   0.3 Very Good H     SI1      61.9    55   337   4.1   4.1   2.5
 9   0.2 Fair      E     VS2      65.1    61   337   3.9   3.8   2.5
10   0.2 Very Good H     VS1      59.4    61   338   4     4     2.4
# ℹ 53,930 more rows
# Option 2: Use an "anonymous" or "lambda" function that isn't named
diamonds %>% 
    mutate(across(.cols = c(carat, x, y, z), 
                  .fns = function(x) {round(x, digits = 1)} 
                  )
           )
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1   0.2 Ideal     E     SI2      61.5    55   326   4     4     2.4
 2   0.2 Premium   E     SI1      59.8    61   326   3.9   3.8   2.3
 3   0.2 Good      E     VS1      56.9    65   327   4     4.1   2.3
 4   0.3 Premium   I     VS2      62.4    58   334   4.2   4.2   2.6
 5   0.3 Good      J     SI2      63.3    58   335   4.3   4.3   2.8
 6   0.2 Very Good J     VVS2     62.8    57   336   3.9   4     2.5
 7   0.2 Very Good I     VVS1     62.3    57   336   4     4     2.5
 8   0.3 Very Good H     SI1      61.9    55   337   4.1   4.1   2.5
 9   0.2 Fair      E     VS2      65.1    61   337   3.9   3.8   2.5
10   0.2 Very Good H     VS1      59.4    61   338   4     4     2.4
# ℹ 53,930 more rows

When we look at the documentation for across(), we see that the .cols argument specifies which variables we want to transform, and it has a <tidy-select> tag. This means that the syntax we use for .cols follows the rules we learned about last week!

Learn More

If you are interested in seeing more examples of the across() function, navigate back to the across() documentation and read through the Examples section at the bottom. Click the “Run examples” link to view the output for all the examples.

✅: Check-in 8.1: Connecting across() with

pivot_wider() and pivot_longer()

  1. Fill in the code below to convert all numeric columns in the diamonds dataset into character columns.
diamonds |> 
  mutate(across(.cols = ____, 
                .fns = ____)
         )
  1. Fill in the code below to transform the x, y, and z columns so that the units of millimeters are displayed (e.g., “4.0 mm”).
diamonds %>%
    mutate(
      across(.cols = ____, 
             .fns = ~ str_c(____, "mm", sep = " ")
             )
      )
  1. Fill in the code below that accomplishes #2 using a pivot_longer() followed by a pivot_wider().
diamonds %>%
  # Add a unique identifier for each row
  # Needed because there is an x, y, z for each combination of carat, cut, color, clarity
  mutate(row_id = row_number()) %>%  
  pivot_longer(cols = ____, 
               names_to = "dimension",
               values_to = "value") %>%
  mutate(
    ____ = str_c(____, "mm", sep = " ")
         ) %>%
  pivot_wider(____ = "dimension", 
              values_from = "value") %>%
  select(-row_id) 
  1. Grouping diamonds by cut, clarity, and color then counting the number of observations and computing the mean of each numeric column.

  2. What happens if you use a list of functions in across(), but don’t name them? How is the output named?

1.1 Performing Multiple Operations

What if we wanted to perform multiple transformations on each of many variables?

Within the different values of diamond cut, let’s summarize the mean, median, and standard deviation of the numeric variables. When we look at the .fns argument in the across() documentation, we see that we can provide a list of functions:

diamonds %>%
    group_by(cut) %>% 
    summarize(across(.cols = where(is.numeric), 
                     .fns = list(mean = mean, 
                                 med = median, 
                                 sd = sd)
                     )
              )
# A tibble: 5 × 22
  cut     carat_mean carat_med carat_sd depth_mean depth_med depth_sd table_mean
  <ord>        <dbl>     <dbl>    <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
1 Fair         1.05       1       0.516       64.0      65      3.64        59.1
2 Good         0.849      0.82    0.454       62.4      63.4    2.17        58.7
3 Very G…      0.806      0.71    0.459       61.8      62.1    1.38        58.0
4 Premium      0.892      0.86    0.515       61.3      61.4    1.16        58.7
5 Ideal        0.703      0.54    0.433       61.7      61.8    0.719       56.0
# ℹ 14 more variables: table_med <dbl>, table_sd <dbl>, price_mean <dbl>,
#   price_med <dbl>, price_sd <dbl>, x_mean <dbl>, x_med <dbl>, x_sd <dbl>,
#   y_mean <dbl>, y_med <dbl>, y_sd <dbl>, z_mean <dbl>, z_med <dbl>,
#   z_sd <dbl>

What does the list of functions look like? What is the structure of this list object?

list_of_fcts <- list(mean = mean, 
                     med = median, 
                     sd = sd)
list_of_fcts
$mean
function (x, ...) 
UseMethod("mean")
<bytecode: 0x10e01f778>
<environment: namespace:base>

$med
function (x, na.rm = FALSE, ...) 
UseMethod("median")
<bytecode: 0x11f290f78>
<environment: namespace:stats>

$sd
function (x, na.rm = FALSE) 
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
    na.rm = na.rm))
<bytecode: 0x11f78e3c0>
<environment: namespace:stats>
str(list_of_fcts)
List of 3
 $ mean:function (x, ...)  
 $ med :function (x, na.rm = FALSE, ...)  
 $ sd  :function (x, na.rm = FALSE)  

Let’s explore lists a bit more…

Review of Lists

A list is a 1-dimensional data structure that has no restrictions on what type of content is stored within it. A list is a “vector”, but it is not an atomic vector - that is, it does not necessarily contain things that are all the same type.

mylist <- list(
    logicals = c(TRUE, TRUE, FALSE, FALSE, TRUE), 
    numeric_vec = 1:12, 
    third_thing = letters[1:2]
    )

mylist
$logicals
[1]  TRUE  TRUE FALSE FALSE  TRUE

$numeric_vec
 [1]  1  2  3  4  5  6  7  8  9 10 11 12

$third_thing
[1] "a" "b"

List components may have names (or not), be homogeneous (or not), have the same length (or not).

Indexing

Indexing necessarily differs between R and Python, and since the list types are also somewhat different (e.g. lists cannot be named in python), we will treat list indexing in the two languages separately.

A pepper shaker containing several individual paper packets of pepper

An unusual pepper shaker which we’ll call pepper

A pepper shaker containing a single individual paper packet of pepper.

When a list is indexed with single brackets, pepper[1], the return value is always a list containing the selected element(s).

A single individual paper packet of pepper, no longer contained within a pepper shaker.

When a list is indexed with double brackets, pepper[[1]], the return value is the selected element.

A pile of pepper, free from any containment structures.

To actually access the pepper, we have to use double indexing and index both the list object and the sub-object, as in pepper[[1]][[1]].
Figure 1: The types of indexing are made most memorable with a fantastic visual example from Grolemund and Wickham (2017), which I have repeated here.

There are 3 ways to index a list:

  • With single square brackets, just like we index atomic vectors. In this case, the return value is always a list.
mylist[1]
$logicals
[1]  TRUE  TRUE FALSE FALSE  TRUE
mylist[2]
$numeric_vec
 [1]  1  2  3  4  5  6  7  8  9 10 11 12
mylist[c(T, F, T)]
$logicals
[1]  TRUE  TRUE FALSE FALSE  TRUE

$third_thing
[1] "a" "b"
  • With double square brackets. In this case, the return value is the thing inside the specified position in the list, but you also can only get one entry in the main list at a time. You can also get things by name.
mylist[[1]]
[1]  TRUE  TRUE FALSE FALSE  TRUE
mylist[["third_thing"]]
[1] "a" "b"
  • Using x$name. This is equivalent to using x[["name"]]. Note that this does not work on unnamed entries in the list.
mylist$third_thing
[1] "a" "b"

To access the contents of a list object, we have to use double-indexing:

mylist[["third_thing"]][[1]]
[1] "a"

2 Vectorized Functions

The functions we’ve used thus far (round_to_one(), mean(), median(), sd()) all have a specific quality—they are vectorized. Meaning, by default, these functions operate on vectors of values rather than a single value. This is a feature that applies to atomic vectors (and we don’t even think about it):

x <- seq(from = -4, to = 12, by = 0.5)

abs(x)
 [1]  4.0  3.5  3.0  2.5  2.0  1.5  1.0  0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0
[16]  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0 10.5
[31] 11.0 11.5 12.0

Notice how the abs() function found the absolute value of each element of x without having to loop over each element? In programming languages which don’t have implicit support for vectorized computations, this above process might instead look like:

x <- seq(from = -4, to = 12, by = 0.5)

for(i in 1:length(x)){
  x[i] <- abs(x[i])
}

x
 [1]  4.0  3.5  3.0  2.5  2.0  1.5  1.0  0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0
[16]  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0 10.5
[31] 11.0 11.5 12.0
for()-loop refresher

If you would like a refresher on how for-loops work, I would recommend watching this video: iteration with for()-loops (10 minutes) and / or reading the for()-loops section of R for Data Science.

For atomic vectors, this process of applying a function to each element is easy to do this by default; with a list, however, we need to be a bit more explicit (because everything that’s passed into the function may not be the same type).

2.1 Is every function vectorized?

Short answer, no. There exist occasions where you either can’t or choose not to write a function that is vectorized. For example, if the function you’ve written makes use of if() statements, your function cannot operate on vectors. For example, take the pos_neg_zero() function below:

pos_neg_zero <- function(x){
  stopifnot(is.numeric(x))
  
  if(x > 0){
    return("Greater than 0!")
  } else if (x < 0){
    return("Less than 0!")
  } else {
    return("Equal to 0!")
      }
}

When I call the pos_neg_zero() function on a vector I receive an error:

x <- seq(from = -4, to = 4, by = 1)

pos_neg_zero(x)
Error in if (x > 0) {: the condition has length > 1

This error means that the if(x > 0) condition can only be checked for something of length 1. So, to use this function on the vector x, you would need to apply the function individually to each element:

result <- rep(NA, 
              length(x)
              )

for(i in 1:length(x)){
  result[i] <- pos_neg_zero(x[i])
}

result
[1] "Less than 0!"    "Less than 0!"    "Less than 0!"    "Less than 0!"   
[5] "Equal to 0!"     "Greater than 0!" "Greater than 0!" "Greater than 0!"
[9] "Greater than 0!"
Vector initialization

Note that I initialized a result vector to store the results of calling the pos_neg_zero() function for the vector x. Similar to C++ and Java, R is an assembly language that requires objects be created before they are used, which is why I couldn’t initialize result inside the for()-loop. Second, when I initialized the result vector I made it the size I wanted, rather than iteratively making it larger and larger (which makes operations incredibly slow).

Yes, I could have written a “better” function which used a vectorized function (e.g., case-when()) instead of a non-vectorized function (e.g., if()).

pos_neg_zero <- function(x){
  stopifnot(is.numeric(x))
  
  state <- case_when(x > 0 ~ "Greater than 0!", 
                     x < 0 ~ "Less than 0!", 
                     .default = "Equal to 0!")
  return(state)
}

When I call this function on the vector x, I no longer receive an error:

pos_neg_zero(x)
[1] "Less than 0!"    "Less than 0!"    "Less than 0!"    "Less than 0!"   
[5] "Equal to 0!"     "Greater than 0!" "Greater than 0!" "Greater than 0!"
[9] "Greater than 0!"

That’s because the case_when() is vectorized!

2.2 When can’t you vectorize your function?

It is not always the case that we can write a “better” vectorized function. For example, let’s suppose we are interested in finding the datatype of each column in a data frame. The typeof() function can tell us the datatype of a specific column in the penguins data frame:

typeof(penguins$species)
[1] "integer"

But, I want the datatype of every column in the penguins data frame! But applying the typeof() function to penguins returns the object structure of the penguins data frame, not the datatype of its columns.

typeof(penguins)
[1] "list"

What can you do? Well, we could rely on our old CS 101 friend, the for()-loop:

data_type <- rep(NA, 
                 length = ncol(penguins)
                 )

for(i in 1:ncol(penguins)){
  data_type[i] <- typeof(penguins[[i]])
}

## Getting a nicely formatted table!
tibble(column = names(penguins), 
       type = data_type) %>% 
  pivot_wider(names_from = column, 
              values_from = type) %>% 
  knitr::kable()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
integer integer double double integer integer integer integer

In R, for()-loops are not as important as they are in other languages because R is a functional programming language. In fact, we would prefer not to use for()-loops as they do not take advantage of R’s functional programming. Take for example, our friend across() that we talked about at the beginning of this reading:

penguins %>% 
  summarise(
    across(
      .cols = everything(), 
      .fns = ~sum(is.na(.x))
      )
    ) %>% 
  knitr::kable()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 0 2 2 2 2 11 0

The across() function looks like an “ordinary” function, it applies a specified function / functions to the columns specified. However, when you look at the source code for across() you will find a for()-loop:

for (j in seq_fns) {
  fn <- fns[[j]]
  out[[k]] <- fn(col, ...)
  k <- k + 1L
  }

This shows you that it is possible to include for()-loops in a function, and call that function instead of using the for()-loop directly.

3 Functional Programming

Yes, it might take some time to get used to the idea of having a for()-loop built into a function, but it’s worth the investment. In the rest of this coursework, you’ll learn about and use the purrr1 package, which houses functions that eliminate the need for many common for()-loops.

The apply family of functions in base R (apply(), lapply(), tapply(), etc.) solve a similar problem, but purrr has more consistent behavior, which makes it easier to learn. We will not be working with the base functions in this course.

Comparison of base R and purrr

The goal of using purrr functions instead of for() loops is to allow you to break common list manipulation challenges into independent pieces:

  • How can you solve the problem for a single element of your object (e.g., vector, data frame, list)?

  • Once you’ve solved that problem, purrr takes care of generalizing your solution to every element in the object.

  • If you’re working on a complex problem, how can you break the problem down into bite-sized pieces that each take one step closer to a solution? With purrr, you get lots of small pieces that you can compose together with the pipe.

I believe this structure makes it easier to solve complex problems, while also making your code easier to understand.

3.1 Reading, Videos & Tutorial

📖 Required Reading: R4DS – The map() Functions

Yes, you should be reading the first edition of R4DS, not the second edition.

📖 Optional Reading: Advanced R - Functionals

If you want to learn more about the concept of functionals!

▶️ Required Video: Iteration with the map() family (6 minutes)

purr cheatsheet

Cheatsheet for purrr functions can be found here.

✅: Check-in 8.2: Working with the map() Functions

  1. Fill in the correct map functions to:
  • Compute the mean of every column in mtcars.
____(.x = mtcars, .f = mean)
  • Determine the type of each column in the nycflights dataset (from the openintro package).
____(.x = flights, .f = typeof)
  • Compute the number of unique values in each column of the penguins dataset (from the palmerpenguins package).
____(.x = penguins, .f = n_distinct)
  • Determine whether or not each column in the penguins dataset is a factor.
____(.x = penguins, .f = is.factor)
  1. Last week we discussed the challenge of standardizing many columns in a data frame. For example, If we wanted to standardize a numeric variable to be centered at the mean and scaled by the standard deviation, we could use the following function:
standardize <- function(vec) {
  stopifnot(is.numeric(vec))
  
  # Center with mean
  deviations <- vec - mean(vec, na.rm = TRUE)
  # Scale with standard deviation
  newdata <- deviations / sd(vec, na.rm = TRUE)
  
  return(newdata)
}

Why does the following return a vector of NAs?

penguins |>
  mutate(
    body_mass_g = map_dbl(body_mass_g, standardize)
  )
  1. Because body_mass_g needs to be passed to standardize() as an argument
  2. Because mutate() operates on rows, so map_dbl() is supplying standardize() with one row of body_mass_g at a time
  3. Because map_dbl() only takes one input, so you need to use map2_dbl() instead
  4. Because there is no function named standardize(), so it cannot be applied to the body_mass_g column
  5. body_mass_g is not a data frame so it is not a valid argument for map_dbl()
  1. Thus far in the course, we have used the across() function to apply the same function to multiple columns. For example, if we wanted to apply the standardize() function from above to every numeric column, we could use the following code:
penguins %>% 
  mutate(across(.cols = where(is.numeric), 
                .fns = standardize)
         )
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen         -0.883         0.784            -1.42      -0.563 
 2 Adelie  Torgersen         -0.810         0.126            -1.06      -0.501 
 3 Adelie  Torgersen         -0.663         0.430            -0.421     -1.19  
 4 Adelie  Torgersen         NA            NA                NA         NA     
 5 Adelie  Torgersen         -1.32          1.09             -0.563     -0.937 
 6 Adelie  Torgersen         -0.847         1.75             -0.776     -0.688 
 7 Adelie  Torgersen         -0.920         0.329            -1.42      -0.719 
 8 Adelie  Torgersen         -0.865         1.24             -0.421      0.590 
 9 Adelie  Torgersen         -1.80          0.480            -0.563     -0.906 
10 Adelie  Torgersen         -0.352         1.54             -0.776      0.0602
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <dbl>

Which of the following map functions would return the same output?

## Option (a)
penguins |> 
  map_at(.at = c("bill_length_mm", 
                 "bill_depth_mm", 
                 "flipper_length_mm", 
                 "body_mass_g"), 
         .f = standardize)

## Option (b)
penguins |> 
  map_at(.at = c("bill_length_mm", 
                 "bill_depth_mm", 
                 "flipper_length_mm", 
                 "body_mass_g"), 
         .f = standardize) %>% 
  bind_cols()

## Option (c)
penguins |> 
  map_if(.p = is.numeric, .f = standardize) 

## Option (d)
penguins |> 
  map_if(.p = is.numeric, .f = standardize) %>% 
  bind_cols()

References

Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. 1st ed. O’Reilly Media. https://r4ds.had.co.nz/.

Footnotes

  1. I fully support more R packages being cat themed.↩︎