Taming the Curly Braces: Writing Your Own Tidyverse Functions

Tuesday, November 4

Today we are going to do a hands-on coding activity! We will create our own versions of the table() and prop.table() base R functions.

This will help us…

explore lazy evaluation
learn more about the {{ }} operator
figure out when we need to use functions like all_of() and pick()
write more complex functions

Last Week…

We learned about writing functions!

Specifically, we learned about:

function syntax
optional & required arguments

input validation
last expression returns

add_something <- function(x, something = 2){
  stopifnot(is.numeric(x), is.numeric(something))
  
  x + something
}

x <- 15:22

add_something(x)

[1] 17 18 19 20 21 22 23 24

add_something(x, something = "dog")

Error in add_something(x, something = "dog"): is.numeric(something) is not TRUE

Writing Data Frame Functions

Moving Beyond Vectors

This week, we’re writing functions that take a data frame and variable names as arguments.

These functions can be incredibly powerful, but they require us to learn some interesting details about how some of the functions we’ve grown very accustomed to (e.g., select(), mutate(), group_by()) work “behind the scenes.”

We are going to use a hands-on activity to explore these concepts!

Open the “Tidy Eval” Colab Notebook Posted on Canvas

In the Week 8 Module, navigate to the Lecture Activity section.

Click on the Tidy Eval Colab Notebook link.

Make a copy of the notebook (like you do for Practice Activities)!

02:00

Goal #1

Recreate the table() function in R

Let’s Explore the `table()` Function First

Let’s start with one categorical variable.

table(penguins$species)


   Adelie Chinstrap    Gentoo 
      152        68       124

Function Design

What do you notice about the layout of the table?

Let’s Explore the `table()` Function First

Okay, let’s add a second categorical variable.

table(penguins$species, penguins$island)

           
            Biscoe Dream Torgersen
  Adelie        44    56        52
  Chinstrap      0    68         0
  Gentoo       124     0         0

Function Design

What do you notice about the layout of the table?

Designing a `tidy_table()` Function

Based on this exploration, it seems like our function should have the following qualities:

accept a data frame
accept variable names as inputs
pivot the output to a wide format

Writing dplyr & tidyr Code to Accomplish the Task

Using the penguins data, write dplyr code (not table()) and tidyr code which will:

count() the number of penguins for each species and island
pivot the table to a wide format
replace NA values with 0s

05:00

A Working Solution

penguins |>
  count(species, island) |>
  pivot_wider(names_from = island, 
              values_from = n, 
              values_fill = 0)

# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <int> <int>     <int>
1 Adelie        44    56        52
2 Chinstrap      0    68         0
3 Gentoo       124     0         0

Now let’s generalize

Now that we have a working example, let’s try and generalize our code.

df |> 
  count(var1, var2) |> 
  pivot_wider(names_from = var2, 
              values_from = n, 
              values_fill = 0)

This works, but maybe we should be more specific about what variables go where…

df |> 
  count(row_var, col_var) |> 
  pivot_wider(names_from = col_var, 
              values_from = n, 
              values_fill = 0)

Let’s make a function

tidy_table <- function(df, col_var, row_var){
  
  df |> 
    count(col_var, row_var) |> 
    pivot_wider(names_from = col_var, 
                values_from = n, 
                values_fill = 0)
}

Copy the tidy_table() function in your Colab notebook!

02:00

Let’s try it out!

tidy_table(df = penguins, 
           col_var = island, 
           row_var = species)

Error in `count()`:
! Must group by variables found in `.data`.
Column `col_var` is not found.
Column `row_var` is not found.

Indirection

The tidyverse functions use either “tidy selection” or “data masking.” Both of these features makes common tasks easier at the cost of making less commons tasks harder.

Data Masking – `count()`

Blurs the line between the two different meanings of the word “variable”:

env-variables – “programming” variables that live in an environment
- These are typically created using a <-.
data-variables — “statistical” variables that live in a data frame.
- These come from data files or are created manipulating existing variables.

When you have the data-variable in a function argument, you need to embrace the argument.

count({{ col_var }}, {{ row_var }})

Tidy Select – `pivot_wider()`

In the case of our function, the name of the columns we want to use are stored in an intermediate variable (e.g., col_var = island).

When you have the data-variable (col_var) in an env-variable (df) that is a function argument, you embrace the argument by surrounding it in doubled braces.

pivot_wider(names_from = {{ col_var }}, 
            values_from = n, 
            values_fill = 0)

Our Updated Function

tidy_table <- function(df, col_var, row_var){
  df |> 
    count({{ col_var }}, {{ row_var }}) |> 
    pivot_wider(names_from = {{ col_var }}, 
                values_from = n, 
                values_fill = 0)
}

Update your tidy_table() function in your Colab notebook!

01:00

Let’s give it another go!

tidy_table(df = penguins, 
           col_var = island, 
           row_var = species)

# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <int> <int>     <int>
1 Adelie        44    56        52
2 Gentoo       124     0         0
3 Chinstrap      0    68         0

What if only one variable was input?

tidy_table(df = penguins, species)

# A tibble: 1 × 3
  Adelie Chinstrap Gentoo
   <int>     <int>  <int>
1    152        68    124

Which argument is species being inserted into?

Argument order matters!

tidy_table <- function(df, row_var, col_var){
  df |> 
    count({{ col_var }}, {{ row_var }}) |> 
    pivot_wider(names_from = {{ col_var }}, 
                values_from = n, 
                values_fill = 0)
}

tidy_table(df = penguins, species)

Error in `pivot_wider()`:
! Must select at least one item.

Arguments that are absolutely necessary should come first!

tidy_table <- function(df, col_var, row_var){
  df |> 
    count({{ col_var }}, {{ row_var }}) |> 
    pivot_wider(names_from = {{ col_var }}, 
                values_from = n, 
                values_fill = 0)
}

To be more defensive you could check if `row_var` is `missing()`

tidy_table <- function(df, col_var, row_var){
  
  if(missing(row_var)){
    df |> 
    count({{ col_var }}) |> 
    pivot_wider(names_from = {{ col_var }}, 
                values_from = n, 
                values_fill = 0)
  }
  else {
    df |> 
    count({{ col_var }}, {{ row_var }}) |> 
    pivot_wider(names_from = {{ col_var }}, 
                values_from = n, 
                values_fill = 0)
  }
}

What if we wanted to use quoted variable names?

tidy_table(df = penguins, 
           row_var = "species", 
           col_var = "island")

Error in `pivot_wider()`:
! Can't select columns that don't exist.
✖ Column `island` doesn't exist.

What’s going on?

penguins |> 
    count("species", "island")

# A tibble: 1 × 3
  `"species"` `"island"`     n
  <chr>       <chr>      <int>
1 species     island       344

We need some helper functions!

For tidy selection, we need to use all_of()

pivot_wider(names_from = all_of(col_var), 
                values_from = n, 
                values_fill = 0)

For data masking, we need to combine all_of() with pick()

count(
  pick(
    all_of( 
      c(row_var, col_var)
      )
    )
  )

A Character Vector Function

quote_table <- function(df, row_var, col_var){
  df |> 
    count(
      pick(
        all_of(
          c(row_var, col_var)
          )
        )
      ) |> 
    pivot_wider(names_from = all_of(col_var), 
                values_from = n, 
                values_fill = 0)
}

Add the quote_table() function in your Colab notebook!

02:00

Did it work???

quote_table(df = penguins, 
            row_var = "species", 
            col_var = "island")

# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <int> <int>     <int>
1 Adelie        44    56        52
2 Chinstrap      0    68         0
3 Gentoo       124     0         0

Let’s take a 5-minute break!

Goal #2

Recreate the prop.table() function in R

Let’s Explore the `prop.table()` Function First

Let’s start with one categorical variable.

table(penguins$species) |> 
  prop.table()


   Adelie Chinstrap    Gentoo 
0.4418605 0.1976744 0.3604651

Function Design

What do you notice about the proportions?

Let’s Explore the `prop.table()` Function First

Okay, let’s add a second categorical variable.

table(penguins$species, penguins$island) |> 
  prop.table()

           
               Biscoe     Dream Torgersen
  Adelie    0.1279070 0.1627907 0.1511628
  Chinstrap 0.0000000 0.1976744 0.0000000
  Gentoo    0.3604651 0.0000000 0.0000000

Function Design

What do you notice about the proportions?

An Optional Argument

The prop.table() function has an optional margin argument.

table(penguins$species, penguins$island) |> 
  prop.table(margin = 1)

           
               Biscoe     Dream Torgersen
  Adelie    0.2894737 0.3684211 0.3421053
  Chinstrap 0.0000000 1.0000000 0.0000000
  Gentoo    1.0000000 0.0000000 0.0000000

Function Design

What do you notice about the proportions?

Designing a `tidy_prop_table()` Function

Based on this exploration, it seems like our function should have the following qualities:

accept a data frame
accept variable names as inputs
calculate joint or marginal proportions for each group
pivot the output to a wide format

Writing dplyr Code to Accomplish the Task

Using the penguins data, write dplyr code (not table() or prop.table()) which will:

count() the number of penguins for each species and island
add a column for the joint proportion of each group

03:00

A Working Solution

These give joint proportions for the entire table.

penguins |> 
  count(species, island) |> 
  mutate(prop = n / sum(n))

# A tibble: 5 × 4
  species   island        n  prop
  <fct>     <fct>     <int> <dbl>
1 Adelie    Biscoe       44 0.128
2 Adelie    Dream        56 0.163
3 Adelie    Torgersen    52 0.151
4 Chinstrap Dream        68 0.198
5 Gentoo    Biscoe      124 0.360

What if I wanted marginal proportions for each species? (i.e., within a species, the proportions should add to 1)

Marginal Proportions for `species`

penguins |> 
  count(species, island) |> 
  group_by(species) |> 
  mutate(prop = n / sum(n))

# A tibble: 5 × 4
# Groups:   species [3]
  species   island        n  prop
  <fct>     <fct>     <int> <dbl>
1 Adelie    Biscoe       44 0.289
2 Adelie    Dream        56 0.368
3 Adelie    Torgersen    52 0.342
4 Chinstrap Dream        68 1    
5 Gentoo    Biscoe      124 1

Notice that there is still a grouping variable?

What should I add to my code?

Much better!

penguins |> 
  count(species, island) |> 
  group_by(species) |> 
  mutate(prop = n / sum(n)) |> 
  ungroup()

# A tibble: 5 × 4
  species   island        n  prop
  <fct>     <fct>     <int> <dbl>
1 Adelie    Biscoe       44 0.289
2 Adelie    Dream        56 0.368
3 Adelie    Torgersen    52 0.342
4 Chinstrap Dream        68 1    
5 Gentoo    Biscoe      124 1

What about pivoting?

For this table, we don’t care about the counts. Let’s add some code that:

removes the column of counts
pivots the table to a wide format
replaces NA values with 0s

03:00

A Working Solution

penguins |> 
  count(species, island) |> 
  group_by(species) |> 
  mutate(prop = n / sum(n)) |> 
  ungroup() |> 
  select(-n) |> 
  pivot_wider(names_from = island, 
              values_from = prop, 
              values_fill = 0)

# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <dbl> <dbl>     <dbl>
1 Adelie     0.289 0.368     0.342
2 Chinstrap  0     1         0    
3 Gentoo     1     0         0

Now let’s generalize!

df |> 
  count(row_var, col_var) |> 
  group_by(col_var) |> 
  mutate(prop = n / sum(n)) |> 
  ungroup() |> 
  select(-n) |> 
  pivot_wider(names_from = col_var,
              values_from = prop, 
              values_fill = 0)

Let’s make a function

tidy_prop_table <- function(df, col_var, row_var){
  
  df |> 
    count({{ row_var }}, {{ col_var }}) |> 
    group_by({{ col_var }}) |> 
    mutate(prop = n / sum(n)) |> 
    ungroup() |> 
    select(-n) |> 
    pivot_wider(names_from = {{ col_var }},
                values_from = prop, 
                values_fill = 0)
  
}

Let’s try it out!

tidy_prop_table(df = penguins, 
                col_var = island, 
                row_var = species)

# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <dbl> <dbl>     <dbl>
1 Adelie     0.262 0.452         1
2 Chinstrap  0     0.548         0
3 Gentoo     0.738 0             0

What if I wanted to get marginal proportions for the `row_var`?

The margin argument of prop.table() has the following behavior:

when no margin is specified the proportions are joint
when margin = 1 the proportions are conditional on the rows
when margin = 2 the proportions are conditional on the columns

A Pseudocode Design

if( margin is missing){
  calculate joint proportions
} 

else if(margin is rows){
  calculate marginal proportions based on row variable
}

else {
  calculate marginal proportions based on column variable
}

Moving into R Code

tidy_prop_table <- function(df, col_var, row_var, margin = NULL){

  # Default to joint proportions
  if(is.null(margin)){
    df |>
    count({{ row_var }}, {{ col_var }}) |>
    mutate(prop = n / sum(n)) |>
    ungroup() |>
    select(-n) |>
    pivot_wider(names_from = {{ col_var }},
                values_from = prop,
                values_fill = 0)
  }
  else if(margin == "row"){
    df |>
      count({{ row_var }}, {{ col_var }}) |>
      group_by({{ row_var }}) |>
      mutate(prop = n / sum(n)) |>
      ungroup() |>
      select(-n) |>
      pivot_wider(names_from = {{ col_var }},
                  values_from = prop,
                  values_fill = 0) |>
      print()
  }
  else{
    df |>
      count({{ row_var }}, {{ col_var }}) |>
      group_by({{ col_var }}) |>
      mutate(prop = n / sum(n)) |>
      ungroup() |>
      select(-n) |>
      pivot_wider(names_from = {{ col_var }},
                  values_from = prop,
                  values_fill = 0)
  }

}

How’d we do?

Joint Proportions

tidy_prop_table(df = penguins, 
                col_var = species, 
                row_var = island)

# A tibble: 3 × 4
  island    Adelie Gentoo Chinstrap
  <fct>      <dbl>  <dbl>     <dbl>
1 Biscoe     0.128  0.360     0    
2 Dream      0.163  0         0.198
3 Torgersen  0.151  0         0

How’d we do?

Marginal Proportions – Rows

tidy_prop_table(df = penguins, 
                col_var = species, 
                row_var = island, 
                margin = "row")

# A tibble: 3 × 4
  island    Adelie Gentoo Chinstrap
  <fct>      <dbl>  <dbl>     <dbl>
1 Biscoe     0.262  0.738     0    
2 Dream      0.452  0         0.548
3 Torgersen  1      0         0

How’d we do?

Marginal Proportions – Columns

tidy_prop_table(df = penguins, 
                col_var = species, 
                row_var = island, 
                margin = "col")

# A tibble: 3 × 4
  island    Adelie Gentoo Chinstrap
  <fct>      <dbl>  <dbl>     <dbl>
1 Biscoe     0.289      1         0
2 Dream      0.368      0         1
3 Torgersen  0.342      0         0

What questions do we have?

Taming the Curly Braces: Writing Your Own Tidyverse Functions

Tuesday, November 4

Last Week…

We learned about writing functions!

Writing Data Frame Functions

Moving Beyond Vectors

Open the “Tidy Eval” Colab Notebook Posted on Canvas

Goal #1

Let’s Explore the table() Function First

Let’s Explore the table() Function First

Designing a tidy_table() Function

Writing dplyr & tidyr Code to Accomplish the Task

A Working Solution

Now let’s generalize

Let’s make a function

Let’s try it out!

Indirection

Indirection

Data Masking – count()

Tidy Select – pivot_wider()

Our Updated Function

Let’s give it another go!

What if only one variable was input?

Argument order matters!

Arguments that are absolutely necessary should come first!

To be more defensive you could check if row_var is missing()

What if we wanted to use quoted variable names?

What’s going on?

We need some helper functions!

A Character Vector Function

Did it work???

Let’s take a 5-minute break!

Goal #2

Let’s Explore the prop.table() Function First

Let’s Explore the prop.table() Function First

An Optional Argument

Designing a tidy_prop_table() Function

Writing dplyr Code to Accomplish the Task

A Working Solution

Marginal Proportions for species

Much better!

What about pivoting?

A Working Solution

Now let’s generalize!

Let’s make a function

Let’s try it out!

What if I wanted to get marginal proportions for the row_var?

A Pseudocode Design

Moving into R Code

How’d we do?

How’d we do?

How’d we do?

What questions do we have?

Let’s Explore the `table()` Function First

Let’s Explore the `table()` Function First

Designing a `tidy_table()` Function

Data Masking – `count()`

Tidy Select – `pivot_wider()`

To be more defensive you could check if `row_var` is `missing()`

Let’s Explore the `prop.table()` Function First

Let’s Explore the `prop.table()` Function First

Designing a `tidy_prop_table()` Function

Marginal Proportions for `species`

What if I wanted to get marginal proportions for the `row_var`?