Data Wrangling with dplyr

Tuesday, September 30

Today we will…

  • Warm-up for dplyr practice activity (40-minutes)
  • Set-up for the practice activity (5-minutes)
    • Review pair programming norms
  • Take a 5-minute break
  • Find your partner!
  • Complete the practice activity (60-minutes)

Data Wrangling with dplyr

A picture of a set of Mathlink blocks on a desk. The blocks are connected together forming three rows and six columns. The columns are red, orange, yellow, green, blue, and purple. Each block within a row and colum has a shape in the center (e.g., triangle, square hexagon).

Data Comes First!

Every function in dplyr has the data as the first argument. You can choose whether to:

declare your data as the first argument of the function

filter(.data = colleges, REGION == 5) |>
  mutate(TUITION_DIFF = TUITIONFEE_OUT - TUITIONFEE_IN)

or

pipe your data into the first argument of the function

colleges |>
  filter(REGION == 5) |>
  mutate(TUITION_DIFF = TUITIONFEE_OUT - TUITIONFEE_IN)

Data Frame

A picture of a set of Mathlink blocks on a desk. The blocks are connected together forming three rows and six columns. The columns are red, orange, yellow, green, blue, and purple. Each block within a row and colum has a shape in the center (e.g., triangle, square hexagon).

Data Frame

A picture of a set of Mathlink blocks on a desk. The blocks are connected together forming three rows and six columns. The columns are red, orange, yellow, green, blue, and purple. Each block within a row and colum has a shape in the center (e.g., triangle, square hexagon).

How many rows?

Data Frame

A picture of a set of Mathlink blocks on a desk. The blocks are connected together forming three rows and six columns. The columns are red, orange, yellow, green, blue, and purple. Each block within a row and colum has a shape in the center (e.g., triangle, square hexagon).

How many columns?

Observations

Here is one observation:

A picture of one row of the previous set of Mathlink blocks. The set only has one of each color (column): red, orange, yellow, green, blue, and purple.

Looking at the column green, how many sides does the observation have?

filter()

filter()

include rows based on one or more logical statements

With Your Data Frame

filter it (only include rows where)

  1. the red column only includes observations with three sides (triangles)

OR

  1. the green column only includes observations with more than four sides (pentagons, hexagons)

How would we write this as R code?

the red column only includes observations with three sides (triangles) OR the green column only includes observations with more than four sides (pentagons, hexagons)

A table of example logical operations (comparisons) that can be made between two objects a and b.

A table of boolean operations and (&), or (|) and not (!)

How would we write this as R code?



data |>
  filter(red == 3 | 
         green > 4)


“Take the data and then filter it to only include red observations with 3 sides or green observations with 4 or more sides.”

Changing from OR to AND



What if we wanted observations where the red column had three sides AND the green column had four or more sides?

How would the code change?

Default is AND



data |>
  filter(red == 3,
         green > 4)



the default in filter() is the AND condition

Why not use &?

If a , is equivalent to an & why not just use a &?

Well, with a lot of &s your code can get hard to read…

data |>
  filter(red == 3 & green > 4 & orange >= 3)


So instead we use commas.

data |>
  filter(red == 3, 
         green > 4, 
         orange >= 3)

Reset your Data Frame!

select()

select()

include columns based on one or more logical statements

With Your Data Frame



data |>
  select(red, 
         yellow, 
         green)

With Your Data Frame



data |>
  select(-green)

Reset your Data Frame!

mutate()

mutate()

create new columns or change existing columns

With Your Data Frame



data |>
  mutate(
    purple = c(4, 4, 5)
        )

if_else(condition, true, false)

  • condition is a logical test (or combination of logical tests)
  • true is the value output if the logical test is found to be TRUE
  • false is the value output if the logical test is found to be FALSE

With Your Data Frame



data |>
  mutate(
    orange = if_else(blue == 6, 4, 3)
         )

With Your Data Frame



data |>
  mutate(
    orange = if_else(blue == 6, 4, 3),
    green = orange + 1
    )

arrange()

Organize the rows of the data in order of a particular variable.

With Your Data Frame



data |>
  arrange(yellow)

What order does arrange() use as default?

arrange(): Descending Order

Default is ascending order…

…but can add desc() to get descending order!

data |>
  arrange(
    desc(yellow)
    )

arrange() + filter()

These functions implicitly arrange the data before slicing it (selecting rows).

  • slice_min() – select rows with the lowest value(s) of a variable
  • slice_max() – select rows with the highest value(s) of a variable

With Your Data Frame



data |>
  slice_max(red, n = 2)

Reset your Data Frame!

summarize()

summarize()

compute a table of summaries

With Your Data Frame



data |>
  summarize(
    max_purple = max(purple)
    )

With Your Data Frame



data |>
  summarize(
    max_purple = max(purple),
    max_blue = max(blue),
    min_red = min(red)
    )

group_by()

group_by()

put rows into groups based on values in column(s)

With Your Data Frame



data |>
  group_by(blue) |>
  summarize(
    max_red = max(red)
    )

With Your Data Frame



data |>
  group_by(blue) |>
  summarize(
    max_red = max(red),
    min_yellow = min(yellow)
    )

Reset your Data Frame!

With Your Data Frame



data |>
  group_by(orange, purple) |>
  summarize(
    min_blue = min(blue)
    )

Combine operations!

data |>
  filter(blue > 3) |>
  select(red, yellow, blue)|>
  mutate(green = blue - 1)

PA 3: Identify the Mystery College

Today you will use the dplyr package to clean some data and then use that cleaned data to figure out what college Ephelia has been accepted to.

A picture of a college building covered in green ivy. The building is reminescent of gothic architecture, with a brick exterior and many windows.

Finding Ephelia’s College

This activity will require knowledge of:

  • debugging code errors
  • function syntax
  • logical comparisons
  • identifying what actions need to be taken
  • locating what dplyr verb(s) can accomplish this task
  • chaining steps together with the pipe operator

This image is a comic illustration titled 'debugging' featuring a series of 10 round green characters with various facial expressions and captions describing the stages of debugging code. The first character is smiling confidently with the caption 'I got this.' The second character has a neutral, confused expression with the caption 'Huh. Really thought that was it.' The third character looks puzzled and silent, captioned '(...).' The fourth character looks annoyed, with the caption 'Fine. Restarting.' The fifth character is angry and blushing, captioned 'OH WTF.' The sixth character is having a meltdown, looking wild-eyed and frazzled with the caption 'Zombie meltdown.' The seventh character looks exhausted and sleepy, drooling a bit, captioned with '...' The eighth character has a lightbulb above its head and a hopeful expression, captioned 'A NEW HOPE!' The ninth character looks determined, typing on a keyboard with a caption of '[insert awesome theme song].' Finally, the tenth character is joyful and celebrating with confetti, captioned 'I love CODING' The illustration humorously captures the emotional rollercoaster of debugging code.

None of us have all these abilities. Each of us has some of these abilities.

dplyr Resources

Every of you should have a dplyr cheatsheet!

On the Front

  • Column 1: grouped summaries (group_by() + summarize())
  • Column 2: filter()ing values with logical comparisons
  • Column 3: select()ing and mutate()ing variables

On the Back

  • Column 2: summary functions you might like to use with summarize()

A picture of the dplyr cheatsheet, which contains helpful information on working with data in a variety of ways.

Pair Programming Expectations

A diagram shows a collaborative software development process in four stages arranged in a cycle. At the top, a woman speaks with the label 'VOCALIZE.' To the right, she points to a diagram with the label 'EXPLAIN.' At the bottom, a man types on a laptop with the label 'IMPLEMENT.' On the left, a computer monitor displays a bug symbol with the label 'DEBUG.' Arrows connect the stages in a loop: Vocalize → Explain → Implement → Debug → back to Vocalize.

Task Card



Every group should have a task card! These cards remind you of the expectations for each role and the collaborative norms we agree to.

Opening the Practice Activity in Google Colab

The Computer needs to:

  • Click on the Practice Activity 3 link from Canvas
  • Log-in to your Google account
  • Make a copy of the Colab notebook

A screenshot of the options provided when you click on the File pane within Google Colab. The option to save a copy in Drive is highlighted, to demonstrate how each student needs to make a copy of the notebook before sharing it with their partner.

Sharing with Your Partner

Once you have your copy, you need to:

  • Share your copy with your partner’s Google account
  • Make sure the Coder can open the file
  • The Computer should plug their laptop into the monitor
  • The Coder should close their computer
  • Get started!

A screenshot of the options provided when you click on the Share pane (in the upper right corner) within Google Colab. The user has typed in a Gmail address to share it with Laura Smith, who will receive an email with a link to the document when the Notify option is checked.

Why are we switching?

Throughout the activity you will swap roles—the Computer will become the Coder, and the Coder will become the Computer.


We are alternating roles so everyone:

  • has the chance to apply their coding skills
  • practices talking through their code
  • has the opportunity to explain their thinking

External Resources

During the Practice Activity, you are not permitted to use Google or ChatGPT for help.


You are permitted to use:

  • the dplyr cheatsheet,
  • the course textbook,
  • any notes you’ve taken, and
  • the course slides.

Submission

Submit the full name of the college Ephelia will attend to the Canvas Quiz.

  • Each person will input the full name of the college in the PA3 Canvas quiz.
  • The person who last occupied the role of Computer will print the notebook as a PDF and submit the PDF for the group.
    • Only one submission per group!
  • Complete the collaboration survey

5-minute break

Team Assignments - 9am

The partner whose birthday is closest to January 1st starts as the Computer, making a copy of the PA Colab notebook!

Team Assignments - 12pm

The partner whose birthday is closest to January 1st starts as the Computer, making a copy of the PA Colab notebook!