Using stringr to Work with Strings

Tuesday, October 14

Today we will…

  • Layout for Week 5
  • New material
    • String variables
    • Functions for working with strings
    • Regular expressions
  • PA 5.1: Scrambled Message

Week 5 Layout

Week 5 Layout

  • Today: Strings with stringr
    • Practice Activity: Decoding a Message
  • Thursday: Dates with lubridate
    • Practice Activity: Jewel Heist
  • Lab Assignment Solving a Murder Mystery
    • Using dplyr + stringr + ludridate

String Variables

What is a string?

A string is a bunch of characters.

There is a difference between…

…a string (many characters, one object)…

and

…a character vector (vector of strings).

my_string <- "Hi, my name is Bond!"
my_string
my_vector <- c("Hi", "my", "name", "is", "Bond")
my_vector

Strings in a Data Frame

We’ve encountered a lot of strings before in the datasets we’ve worked with.

  • penguins
    • species
    • island
    • sex
  • colleges
    • STABBR
    • INSTNM
  • military
    • Country


Until now, we’ve taken for granted the values of these string variables, but today we’re going to learn how to use expressions to look for and / or modify specific values!

Strings in a Data Frame

For the colleges dataset:

a string is:

colleges$INSTNM[214]


a character vector is:

colleges$INSTNM 


stringr

Common tasks

  • Identify strings containing a particular pattern.
  • Remove or replace a pattern.
  • Edit a string (e.g., make it lowercase).

Note

  • The stringr package loads with tidyverse.
  • All functions are of the form str_xxx().

string =

None of the stringr functions have a .data = argument! These functions only accept a character vector (string =) as an input.

str_detect(data = colleges, 
           string = INSTNM, 
           pattern = "California")


So, these functions will need to be combined with functions from dplyr to work with a dataset!

pattern =

The pattern argument appears in many stringr functions.

  • The pattern must be supplied inside quotes.
str_detect(colleges$INSTNM, 
           pattern = "Polytechnic")

str_remove(colleges$INSTNM, 
           pattern = "(University|College)")

str_replace(colleges$INSTNM, 
            pattern = "$u", 
            replacement = "U")


Let’s talk more about what some of these symbols mean.

Regular Expressions

Regular Expressions

“Regular expressions are a very terse language that allow you to describe patterns in strings.”

R for Data Science

Regular Expressions

…are tricky!

  • There are lots of new symbols to keep straight.
  • There are a lot of cases to think through.


We’re going to focus on:

  • anchors
  • quantifiers
  • character classes
  • groups

Anchor Characters: ^ $

^ – looks at the beginning of a string.

str_subset(colleges$INSTNM, 
           pattern = "^California State")

$ – looks at the end of a string.

str_subset(colleges$INSTNM, 
           pattern = "State University$")

Quantifier Characters: + *

+ – occurs 1 or more times

str_subset(colleges$INSTNM, 
           pattern = "St\\.+")

* – occurs 0 or more times

str_subset(colleges$INSTNM, 
           pattern = "\\s*-\\s*")

Quantifier Characters: {}

{n} – occurs exactly n times

str_subset(colleges$INSTNM, 
           pattern = "[A-Z]{4}")

{n,m} – occurs between n and m times

str_subset(colleges$INSTNM, 
           pattern = "[A-Z]{4,6}")

Want at least 4? {4,}

Character Groups: ()

  • () creates a group of characters
  • We can specify “either” / “or” within a group using |.
str_subset(colleges$INSTNM, 
           pattern = "(T|t)ech")

Character Classes: []

[ ] – specifies a range of characters.

  • [A-Z] or [:upper:] matches any capital letter.
  • [a-z] or [:lower:] matches any lowercase letter.
  • [A-z] or [:alpha:] matches any letter
  • [0-9] or [:digit:] matches any number
  • [:alnum:] matches any alpha numeric character
  • [:punct:] matches any punctuation character
str_subset(colleges$INSTNM, 
           pattern = "[A-Z]{4}")

str_subset(colleges$INSTNM, 
           pattern = "[:digit:]+")

str_subset(colleges$INSTNM, 
           pattern = "[:punct:]{1,}")

Excluding Characters

[^ ] – specifies characters not to match on

str_subset(colleges$INSTNM, 
           pattern = "[^y]$")

Beware: a ^ doesn’t always mean “not”


Starts with “University”

str_subset(colleges$INSTNM, 
           pattern = "^University")

Does not starts with “University”

str_subset(colleges$INSTNM, 
           pattern = "^[^University]")

Let’s use these expressions!

Detecting Patterns

str_detect()

Returns a logical vector indicating whether the pattern was found in each element of the supplied vector.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")


str_detect(my_vector, pattern = "Bond")
  • Pairs well with filter()
  • Works with summarise()
    • sum (to get total matches)
    • mean (to get proportion of matches)

str_detect() with filter()

Which colleges in the dataset have “Polytechnic” in their name?

colleges |> 
  filter(str_detect(INSTNM, pattern = "Polytechnic"))


str_detect() with summarize()

How many colleges in the colleges dataset have “Polytechnic” in their name?

colleges |> 
  summarize(
    count_polytech = sum(
      str_detect(INSTNM, pattern = "Polytechnic")
      ) 
    )

Replace / Remove Patterns

str_replace()

replace the first matched pattern in each string

str_replace(my_vector, 
            pattern = "Bond", 
            replace = "Franco")

str_replace() with mutate()

Make capitalization of “University” consistent

colleges |> 
  mutate(INSTNM = str_replace(INSTNM, 
                              pattern = "$u", 
                              replacement = "U")
         )

str_remove()

remove the first matched pattern in each string

str_remove(my_vector, 
           pattern = "Bond")


Related Functions

This is a special case of str_replace(x, pattern, replacement = "").

str_remove() with mutate()

Remove “College” or “University” at the end of each name so only the main institution name remains.

colleges |> 
  mutate(INSTNM = str_remove(INSTNM, 
                             pattern = "(College|University)"
                             )
         )

Find the Length of a String

str_length()

returns number of elements (characters) of a string

colleges |> 
  mutate(
    name_length = str_length(INSTNM)
         ) |> 
  select(INSTNM, name_length)

Change the Length

shorten or lengthen a string to a specified length

Extract values of a string based on a starting and ending location.

colleges |> 
  mutate(short_name = str_sub(INSTNM, 
                                  start = 1,
                                  end = 8)
         )

Make every string have a fixed length (width)

colleges |> 
  mutate(long_name = str_pad(INSTNM, 
                             width = 20, 
                             pad = "_", 
                             side = "both")
         )

Modify Characters

Edit Capitalization of Strings

Convert letters in a string to a specific capitalization format.

converts all letters in a string to lowercase.

colleges |> 
  mutate(INSTNM = str_to_lower(INSTNM))

converts all letters in a string to uppercase.

colleges |> 
  mutate(INSTNM = str_to_upper(INSTNM))

converts the first letter of each word to uppercase.

colleges |> 
  mutate(INSTNM = str_to_title(INSTNM))

Handling Whitespace

str_trim()

removes whitespace from start and end of string

colleges |> 
  mutate(
    INSTNM = str_trim(INSTNM, side = "both")
         )

Joining Strings

str_c()

join multiple strings into a single character vector

colleges |> 
  mutate(
    address = str_c(CITY, STABBR, ZIP, sep = ", ")
         )

Note

Similar to paste() and paste0() but with more precision.

Tips for String Success

  • Refer to the stringr cheatsheet

  • Refer to your handout!

  • Remember that str_xxx functions need the first argument to be a vector of strings, not a dataset!

    • You will use these functions inside dplyr verbs like filter() or mutate().

PA 5.1: Scrambled Message

In this activity, you will use functions from the stringr package and regular expressions to decode a message.

A pile of tiles from the game of Scrabble.

This activity will require knowledge of:

  • functions from dplyr
  • stringr functions for previewing string contents
  • regular expressions for locating patterns
  • stringr functions for removing whitespace
  • stringr functions for truncating strings
  • stringr functions for replacing patterns
  • stringr functions for combining multiple strings

None of us have all these abilities. Each of us has some of these abilities.

Pair Programming Expectations

External Resources

During the Practice Activity, you are not permitted to use Google or ChatGPT for help. . . .


You are permitted to use:

  • the handout,
  • the stringr cheatsheet,
  • the course textbook, and
  • the course slides.

Submission

Submit the name of the movie the quote is from.

  • Each person will input the full name of the movie the scrambled message is from into the PA5 quiz.
  • The person who last occupied the role of Typer will submit the link for your Colab notebook (don’t forget to use the share link!)
    • Only one submission per group!

5-minute break

Team Assignments - 9am

The person who has the most pets starts as the Typer (listening to explanations from the Talker)!

Team Assignments - 12pm

The person who has the most pets starts as the Typer (listening to explanations from the Talker)!

Exit Ticket