Writing More Efficient Code

In this module, you’ll learn a bit about how to make your code more efficient and more human readable.

Code Efficiency

For you, most of the work to make your code more efficient (and more readable) can be largely summarized with two principles:

  1. avoid larger calculations and memory demands, and
  2. avoid repeating slow steps

Let’s explore how we can write code which addresses each of these common pitfalls.

Avoid larger calculations and memory demands

In many of our analyses, we will find ourselves performing repeated tasks on the same, smaller data frame. For example, consider the following dataset:

library(fivethirtyeight)

glimpse(congress_age)
Rows: 18,635
Columns: 13
$ congress   <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80,…
$ chamber    <chr> "house", "house", "house", "house", "house", "house", "hous…
$ bioguide   <chr> "M000112", "D000448", "S000001", "E000023", "L000296", "G00…
$ firstname  <chr> "Joseph", "Robert", "Adolph", "Charles", "William", "James"…
$ middlename <chr> "Jefferson", "Lee", "Joachim", "Aubrey", NA, "A.", "Joseph"…
$ lastname   <chr> "Mansfield", "Doughton", "Sabath", "Eaton", "Lewis", "Galla…
$ suffix     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ birthday   <date> 1861-02-09, 1863-11-07, 1866-04-04, 1868-03-29, 1868-09-22…
$ state      <chr> "TX", "NC", "IL", "NJ", "KY", "PA", "CA", "NY", "WI", "MA",…
$ party      <chr> "D", "D", "D", "R", "R", "R", "R", "D", "R", "R", "D", "R",…
$ incumbent  <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRU…
$ termstart  <date> 1947-01-03, 1947-01-03, 1947-01-03, 1947-01-03, 1947-01-03…
$ age        <dbl> 86, 83, 81, 79, 78, 78, 78, 77, 76, 76, 75, 74, 74, 73, 73,…

Suppose we want to do some exploratory analysis on these data, summarizing and visualizing the ages of senators. Let’s use the tic() and toc() functions from the tictoc packages to check how much time it takes to run this code:

tic()

congress_age %>%
  filter(chamber == "senate") %>%
  summarize(mean(age))

congress_age %>%
  filter(chamber == "senate") %>%
  slice_max(n = 5, order_by = age)

congress_age %>%
  filter(chamber == "senate") %>%
  ggplot() + 
  geom_histogram(mapping = aes(x = age))
toc()
0.136 sec elapsed

These are all three reasonable things to do, and they can’t be done in the same pipeline. But wait, we just made R do the process of filtering to only include senators three separate times!

tic()
  congress_age %>%
    filter(chamber == "senate")
toc()
0.011 sec elapsed

Instead, how about we filter the data once, save it into an object, and call on that object:

tic()

 senate <- congress_age %>%
  filter(chamber == "senate")

senate %>%
  summarize(mean(age))

senate %>%
  slice_max(n = 5, order_by = age)

senate %>%
  ggplot() + 
  geom_histogram(mapping = aes(x = age))
toc()
0.099 sec elapsed

Not only did saving the senator data cut down on the runtime of our code, it made the code easier to understand! It is much clearer to the reader that each of the data exploration steps refers to the senate dataset.

Check-inCheck In
  1. What functions help us assess how long it takes our code to run?

  2. How would you change the code below to make it more efficient?

penguins |> 
  group_by(species) |>
  summarize(mean_bill_len = mean(bill_length_mm, na.rm = TRUE))

penguins |> 
  group_by(species) |>
  slice_min(order_by = bill_length_mm)

penguins |> 
  group_by(species) |> 
  mutate(
    scaled_bill_length = (bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) /
      sd(bill_length_mm, na.rm = TRUE)
    )

Avoid repeating slow steps

Avoid for-loops

for()-loops are deathly slow in R. If you absolutely must iterate over a process, rely on the map() or apply() function families. These functions offer the same versatility of a for()-loop but are (1) more human readable, and (2) much faster because they are written in C.

NoteWhy choose map() instead of apply()?

The map() and apply() functions will typically be close to the same speed, so the largest argument in favor of the map() family of functions are their consistent syntax, and their consistent object output.

The map() function always returns a list and the map_XXX() variants always return a vector of a specified data type. Whereas, apply() returns a vector or a list, lapply() returns a list, and sapply() returns a vector or a matrix.

Here is a motivating example demonstrating the difference in run time between two functions:

  1. a function which loops over every column in a data frame to extract the data type of that column
  2. a function which uses map_chr() to apply the class() function to every column of a data frame
loop_func <- function(df){
  
  # initialize the type object
  type <- rep(NA, ncol(df))
  
  for(i in seq_len(ncol(df))) {
    
    # store the class of the ith column of df into the ith entry of type 
    type[i] <- class(df[,i])
  }
  
  return(type)
}

map_func <- function(df){
  
  type <- map_chr(df, class)
  
  return(type)

  }

Okay, now that we’ve made these two functions, let’s see how they compare in their run time. Below, we’ve made a simple albeit large dataframe. The dataframe has five rows and 100,000 columns and the element of every column is a vector of 1s.

dat <- as.data.frame(
  matrix(1,
         nrow = 5,
         ncol = 100000)
  )

Let’s use tic() and toc() again to compare the run time for these two functions:

tic()
loop_func(dat)
toc()
3.769 sec elapsed
tic()
map_func(dat)
toc()
0.417 sec elapsed

Looking at the output, we see substantial differences in the run time of these two functions. The loop function took about 3.5 seconds whereas the map function took less than half a second.

Check-inCheck In
  1. Why might we prefer using map() functions instead of writing a for() loop?
  1. map() automatically makes the code run faster in all cases.
  2. map() makes code more readable, concise, and often faster.
  3. map() can only be used with numeric data, unlike loops.
  4. map() replaces all possible uses of loops in programming.
  1. Which map() function should be used in place of this for() loop?
numbers <- c(4, 9, 16, 25)

results <- vector("numeric", length(numbers))

for (i in seq_along(numbers)) {
  results[i] <- sqrt(numbers[i])
}
  1. map()
  2. map_dbl()
  3. map_chr()
  4. walk()

Use vectorized functions

Better even than map() or apply() is not to iterate at all! Writing vectorized functions is tough, but do-able in many cases. Here’s an example:

Required-videoRequired Video
Check-inCheck In
  1. Which of the following functions are vectorized?
  1. sqrt()
  2. if()
  3. ifelse()
  4. mean()
  1. Why does the code below produce an error?
x <- c(1, 2, 3, 4, 16, 25)

f <- function(x) {
  if (x > 0) x else 0
}

f(x)
Error in if (x > 0) x else 0: the condition has length > 1