Taming the Curly Braces: Writing Your Own Tidyverse Functions

Tuesday, November 4

Today we are going to do a hands-on coding activity! We will create our own versions of the table() and prop.table() base R functions.

This will help us…

  • explore lazy evaluation
  • learn more about the {{ }} operator
  • figure out when we need to use functions like all_of() and pick()
  • write more complex functions

Last Week…

We learned about writing functions!

Specifically, we learned about:

  • function syntax
  • optional & required arguments
  • input validation
  • last expression returns
add_something <- function(x, something = 2){
  stopifnot(is.numeric(x), is.numeric(something))
  
  x + something
}


x <- 15:22
add_something(x)
[1] 17 18 19 20 21 22 23 24
add_something(x, something = "dog")
Error in add_something(x, something = "dog"): is.numeric(something) is not TRUE

Writing Data Frame Functions

Moving Beyond Vectors

This week, we’re writing functions that take a data frame and variable names as arguments.

These functions can be incredibly powerful, but they require us to learn some interesting details about how some of the functions we’ve grown very accustomed to (e.g., select(), mutate(), group_by()) work “behind the scenes.”


We are going to use a hands-on activity to explore these concepts!

Open the “Tidy Eval” Colab Notebook Posted on Canvas

In the Week 8 Module, navigate to the Lecture Activity section.

Click on the Tidy Eval Colab Notebook link.

Make a copy of the notebook (like you do for Practice Activities)!

02:00

Goal #1

Recreate the table() function in R

Let’s Explore the table() Function First

Let’s start with one categorical variable.

table(penguins$species)

   Adelie Chinstrap    Gentoo 
      152        68       124 


Function Design

What do you notice about the layout of the table?

Let’s Explore the table() Function First

Okay, let’s add a second categorical variable.

table(penguins$species, penguins$island)
           
            Biscoe Dream Torgersen
  Adelie        44    56        52
  Chinstrap      0    68         0
  Gentoo       124     0         0


Function Design

What do you notice about the layout of the table?

Designing a tidy_table() Function

Based on this exploration, it seems like our function should have the following qualities:

  • accept a data frame
  • accept variable names as inputs
  • pivot the output to a wide format

Writing dplyr & tidyr Code to Accomplish the Task

Using the penguins data, write dplyr code (not table()) and tidyr code which will:

  • count() the number of penguins for each species and island
  • pivot the table to a wide format
  • replace NA values with 0s
05:00

A Working Solution

penguins |>
  count(species, island) |>
  pivot_wider(names_from = island, 
              values_from = n, 
              values_fill = 0)
# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <int> <int>     <int>
1 Adelie        44    56        52
2 Chinstrap      0    68         0
3 Gentoo       124     0         0

Now let’s generalize

Now that we have a working example, let’s try and generalize our code.

df |> 
  count(var1, var2) |> 
  pivot_wider(names_from = var2, 
              values_from = n, 
              values_fill = 0)

This works, but maybe we should be more specific about what variables go where…

df |> 
  count(row_var, col_var) |> 
  pivot_wider(names_from = col_var, 
              values_from = n, 
              values_fill = 0)

Let’s make a function

tidy_table <- function(df, col_var, row_var){
  
  df |> 
    count(col_var, row_var) |> 
    pivot_wider(names_from = col_var, 
                values_from = n, 
                values_fill = 0)
}

Copy the tidy_table() function in your Colab notebook!

02:00

Let’s try it out!

tidy_table(df = penguins, 
           col_var = island, 
           row_var = species)
Error in `count()`:
! Must group by variables found in `.data`.
Column `col_var` is not found.
Column `row_var` is not found.

Indirection

Indirection

The tidyverse functions use either “tidy selection” or “data masking.” Both of these features makes common tasks easier at the cost of making less commons tasks harder.

Data Masking – count()

Blurs the line between the two different meanings of the word “variable”:

  • env-variables – “programming” variables that live in an environment
    • These are typically created using a <-.
  • data-variables — “statistical” variables that live in a data frame.
    • These come from data files or are created manipulating existing variables.

When you have the data-variable in a function argument, you need to embrace the argument.

count({{ col_var }}, {{ row_var }})

Tidy Select – pivot_wider()

In the case of our function, the name of the columns we want to use are stored in an intermediate variable (e.g., col_var = island).


When you have the data-variable (col_var) in an env-variable (df) that is a function argument, you embrace the argument by surrounding it in doubled braces.

pivot_wider(names_from = {{ col_var }}, 
            values_from = n, 
            values_fill = 0)

Our Updated Function

tidy_table <- function(df, col_var, row_var){
  df |> 
    count({{ col_var }}, {{ row_var }}) |> 
    pivot_wider(names_from = {{ col_var }}, 
                values_from = n, 
                values_fill = 0)
}

Update your tidy_table() function in your Colab notebook!

01:00

Let’s give it another go!

tidy_table(df = penguins, 
           col_var = island, 
           row_var = species)
# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <int> <int>     <int>
1 Adelie        44    56        52
2 Gentoo       124     0         0
3 Chinstrap      0    68         0

What if only one variable was input?

tidy_table(df = penguins, species)
# A tibble: 1 × 3
  Adelie Chinstrap Gentoo
   <int>     <int>  <int>
1    152        68    124

Which argument is species being inserted into?

Argument order matters!

tidy_table <- function(df, row_var, col_var){
  df |> 
    count({{ col_var }}, {{ row_var }}) |> 
    pivot_wider(names_from = {{ col_var }}, 
                values_from = n, 
                values_fill = 0)
}


tidy_table(df = penguins, species)
Error in `pivot_wider()`:
! Must select at least one item.

Arguments that are absolutely necessary should come first!

tidy_table <- function(df, col_var, row_var){
  df |> 
    count({{ col_var }}, {{ row_var }}) |> 
    pivot_wider(names_from = {{ col_var }}, 
                values_from = n, 
                values_fill = 0)
}

To be more defensive you could check if row_var is missing()

tidy_table <- function(df, col_var, row_var){
  
  if(missing(row_var)){
    df |> 
    count({{ col_var }}) |> 
    pivot_wider(names_from = {{ col_var }}, 
                values_from = n, 
                values_fill = 0)
  }
  else {
    df |> 
    count({{ col_var }}, {{ row_var }}) |> 
    pivot_wider(names_from = {{ col_var }}, 
                values_from = n, 
                values_fill = 0)
  }
}

What if we wanted to use quoted variable names?

tidy_table(df = penguins, 
           row_var = "species", 
           col_var = "island")
Error in `pivot_wider()`:
! Can't select columns that don't exist.
✖ Column `island` doesn't exist.

What’s going on?

penguins |> 
    count("species", "island")
# A tibble: 1 × 3
  `"species"` `"island"`     n
  <chr>       <chr>      <int>
1 species     island       344

We need some helper functions!

  • For tidy selection, we need to use all_of()
pivot_wider(names_from = all_of(col_var), 
                values_from = n, 
                values_fill = 0)
  • For data masking, we need to combine all_of() with pick()
count(
  pick(
    all_of( 
      c(row_var, col_var)
      )
    )
  )

A Character Vector Function

quote_table <- function(df, row_var, col_var){
  df |> 
    count(
      pick(
        all_of(
          c(row_var, col_var)
          )
        )
      ) |> 
    pivot_wider(names_from = all_of(col_var), 
                values_from = n, 
                values_fill = 0)
}

Add the quote_table() function in your Colab notebook!

02:00

Did it work???

quote_table(df = penguins, 
            row_var = "species", 
            col_var = "island")
# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <int> <int>     <int>
1 Adelie        44    56        52
2 Chinstrap      0    68         0
3 Gentoo       124     0         0

Let’s take a 5-minute break!

Goal #2

Recreate the prop.table() function in R

Let’s Explore the prop.table() Function First

Let’s start with one categorical variable.

table(penguins$species) |> 
  prop.table()

   Adelie Chinstrap    Gentoo 
0.4418605 0.1976744 0.3604651 


Function Design

What do you notice about the proportions?

Let’s Explore the prop.table() Function First

Okay, let’s add a second categorical variable.

table(penguins$species, penguins$island) |> 
  prop.table()
           
               Biscoe     Dream Torgersen
  Adelie    0.1279070 0.1627907 0.1511628
  Chinstrap 0.0000000 0.1976744 0.0000000
  Gentoo    0.3604651 0.0000000 0.0000000

Function Design

What do you notice about the proportions?

An Optional Argument

The prop.table() function has an optional margin argument.

table(penguins$species, penguins$island) |> 
  prop.table(margin = 1)
           
               Biscoe     Dream Torgersen
  Adelie    0.2894737 0.3684211 0.3421053
  Chinstrap 0.0000000 1.0000000 0.0000000
  Gentoo    1.0000000 0.0000000 0.0000000

Function Design

What do you notice about the proportions?

Designing a tidy_prop_table() Function

Based on this exploration, it seems like our function should have the following qualities:

  • accept a data frame
  • accept variable names as inputs
  • calculate joint or marginal proportions for each group
  • pivot the output to a wide format

Writing dplyr Code to Accomplish the Task

Using the penguins data, write dplyr code (not table() or prop.table()) which will:

  • count() the number of penguins for each species and island
  • add a column for the joint proportion of each group
03:00

A Working Solution

These give joint proportions for the entire table.

penguins |> 
  count(species, island) |> 
  mutate(prop = n / sum(n)) 
# A tibble: 5 × 4
  species   island        n  prop
  <fct>     <fct>     <int> <dbl>
1 Adelie    Biscoe       44 0.128
2 Adelie    Dream        56 0.163
3 Adelie    Torgersen    52 0.151
4 Chinstrap Dream        68 0.198
5 Gentoo    Biscoe      124 0.360

What if I wanted marginal proportions for each species? (i.e., within a species, the proportions should add to 1)

Marginal Proportions for species

penguins |> 
  count(species, island) |> 
  group_by(species) |> 
  mutate(prop = n / sum(n)) 
# A tibble: 5 × 4
# Groups:   species [3]
  species   island        n  prop
  <fct>     <fct>     <int> <dbl>
1 Adelie    Biscoe       44 0.289
2 Adelie    Dream        56 0.368
3 Adelie    Torgersen    52 0.342
4 Chinstrap Dream        68 1    
5 Gentoo    Biscoe      124 1    

Notice that there is still a grouping variable?

What should I add to my code?

Much better!

penguins |> 
  count(species, island) |> 
  group_by(species) |> 
  mutate(prop = n / sum(n)) |> 
  ungroup()
# A tibble: 5 × 4
  species   island        n  prop
  <fct>     <fct>     <int> <dbl>
1 Adelie    Biscoe       44 0.289
2 Adelie    Dream        56 0.368
3 Adelie    Torgersen    52 0.342
4 Chinstrap Dream        68 1    
5 Gentoo    Biscoe      124 1    

What about pivoting?

For this table, we don’t care about the counts. Let’s add some code that:

  • removes the column of counts
  • pivots the table to a wide format
  • replaces NA values with 0s
03:00

A Working Solution

penguins |> 
  count(species, island) |> 
  group_by(species) |> 
  mutate(prop = n / sum(n)) |> 
  ungroup() |> 
  select(-n) |> 
  pivot_wider(names_from = island, 
              values_from = prop, 
              values_fill = 0)
# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <dbl> <dbl>     <dbl>
1 Adelie     0.289 0.368     0.342
2 Chinstrap  0     1         0    
3 Gentoo     1     0         0    

Now let’s generalize!

df |> 
  count(row_var, col_var) |> 
  group_by(col_var) |> 
  mutate(prop = n / sum(n)) |> 
  ungroup() |> 
  select(-n) |> 
  pivot_wider(names_from = col_var,
              values_from = prop, 
              values_fill = 0)

Let’s make a function

tidy_prop_table <- function(df, col_var, row_var){
  
  df |> 
    count({{ row_var }}, {{ col_var }}) |> 
    group_by({{ col_var }}) |> 
    mutate(prop = n / sum(n)) |> 
    ungroup() |> 
    select(-n) |> 
    pivot_wider(names_from = {{ col_var }},
                values_from = prop, 
                values_fill = 0)
  
}

Let’s try it out!

tidy_prop_table(df = penguins, 
                col_var = island, 
                row_var = species)
# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <dbl> <dbl>     <dbl>
1 Adelie     0.262 0.452         1
2 Chinstrap  0     0.548         0
3 Gentoo     0.738 0             0

What if I wanted to get marginal proportions for the row_var?


The margin argument of prop.table() has the following behavior:

  • when no margin is specified the proportions are joint
  • when margin = 1 the proportions are conditional on the rows
  • when margin = 2 the proportions are conditional on the columns

A Pseudocode Design

if( margin is missing){
  calculate joint proportions
} 

else if(margin is rows){
  calculate marginal proportions based on row variable
}

else {
  calculate marginal proportions based on column variable
}

Moving into R Code

tidy_prop_table <- function(df, col_var, row_var, margin = NULL){

  # Default to joint proportions
  if(is.null(margin)){
    df |>
    count({{ row_var }}, {{ col_var }}) |>
    mutate(prop = n / sum(n)) |>
    ungroup() |>
    select(-n) |>
    pivot_wider(names_from = {{ col_var }},
                values_from = prop,
                values_fill = 0)
  }
  else if(margin == "row"){
    df |>
      count({{ row_var }}, {{ col_var }}) |>
      group_by({{ row_var }}) |>
      mutate(prop = n / sum(n)) |>
      ungroup() |>
      select(-n) |>
      pivot_wider(names_from = {{ col_var }},
                  values_from = prop,
                  values_fill = 0) |>
      print()
  }
  else{
    df |>
      count({{ row_var }}, {{ col_var }}) |>
      group_by({{ col_var }}) |>
      mutate(prop = n / sum(n)) |>
      ungroup() |>
      select(-n) |>
      pivot_wider(names_from = {{ col_var }},
                  values_from = prop,
                  values_fill = 0)
  }

}

How’d we do?

Joint Proportions

tidy_prop_table(df = penguins, 
                col_var = species, 
                row_var = island)
# A tibble: 3 × 4
  island    Adelie Gentoo Chinstrap
  <fct>      <dbl>  <dbl>     <dbl>
1 Biscoe     0.128  0.360     0    
2 Dream      0.163  0         0.198
3 Torgersen  0.151  0         0    

How’d we do?

Marginal Proportions – Rows

tidy_prop_table(df = penguins, 
                col_var = species, 
                row_var = island, 
                margin = "row")
# A tibble: 3 × 4
  island    Adelie Gentoo Chinstrap
  <fct>      <dbl>  <dbl>     <dbl>
1 Biscoe     0.262  0.738     0    
2 Dream      0.452  0         0.548
3 Torgersen  1      0         0    

How’d we do?

Marginal Proportions – Columns

tidy_prop_table(df = penguins, 
                col_var = species, 
                row_var = island, 
                margin = "col")
# A tibble: 3 × 4
  island    Adelie Gentoo Chinstrap
  <fct>      <dbl>  <dbl>     <dbl>
1 Biscoe     0.289      1         0
2 Dream      0.368      0         1
3 Torgersen  0.342      0         0

What questions do we have?