Working with Strings Using stringr

This week is all about special data types in R. Similar to the tools you learned last week for working with factors, this week you are going to learn about tools for working with strings and dates. By the end of this week you should be able to:

📽 Watch Videos: 20 minutes

📖 Readings: 60-75 minutes

✅ Preview Activities: 1

1 Working with Strings

Strings are a powerful way to work with text variables (e.g., day_of_week, job_title, gender). We’ve encountered a few functions related to strings in Week 3 when we were selecting columns from a dataset based on their names (e.g., select(colleges, starts_with(TUITION))) and also when we filtered a column based on its values (e.g., filter(colleges, STABBR == "CO"))). Notice that in both cases, we are using the literal names of the variables or values in the data.

This week, we are going to learn about regular expressions—a short hand way to search for values of a string.

📖 Required Reading: R4DS – Strings

Important

The stringr cheatsheet is in Week 5 of your coursepack! If you do not have a coursepack I would strongly recommending printing the stringr cheatsheet.

NoteCommon stringr functions and what they do
Across every function in the stringr package, x is the string (or vector of strings) and pattern is a pattern to be found within the string.
Task stringr Output
Find a pattern and replace it str_replace(x, pattern, replacement) and str_replace_all(x, pattern, replacement) Modified string or character vector
Convert a string from uppercase to lower case or visa versa str_to_lower(x), str_to_upper(x) , str_to_title(x) Modified string or character vector
Strip whitespace from the start / end of a string str_trim(x) , str_squish(x) Modified string or character vector
Detect if the string contains a pattern str_detect(x, pattern) Logical
Count how many times a pattern appears in the string str_count(x, pattern) Numeric
Find the first appearance of the pattern within the string str_locate(x, pattern) Integer matrix (start position, end position)
Find all appearances of the pattern within the string str_locate_all(x, pattern) Integer matrix (start position, end position)
Detect if a string contains a pattern at the start / end str_starts(x, pattern), str_ends(x, pattern) Logical
Subset a string from index a to b str_sub(x, a, b) Modified string or character vector

1.1 Regular Expressions

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. - Jamie Zawinski

Regular expressions (or regex) is a concise and powerful language created to describe patterns in strings. You can think of regular expressions as an advanced versions of “Find” in a text editor, helping you search for specific patterns in strings. These expressions use symbols to define flexible search patterns. For example, the symbol . is used to match any character (letter, number, punctuation) and the symbol \d is used to match any digit.

We’re going to learn a little bit more about regular expressions and how you can use them to search for specific patterns.

Important

A cheatsheet on regular expressions is included in Week 5 of your coursepack! If you do not have a coursepack I would strongly recommending printing the regular expression cheatsheet.

Check-in 5.1: Functions from stringr

1 Which of the follow are differences between length() and str_length()?

  • length() gives the number of elements in a vector
  • str_length() gives the number of characters in a string
  • str_length() gives the number of strings in a vector
  • length() gives the dimensions of a dataframe

2 What of the following is true about str_replace()?

  • str_replace() replaces the first instance of the pattern
  • str_replace() replaces the last instance of the pattern
  • str_replace() replaces every instance of the pattern

3 str_trim() allows you to remove whitespace on what sides?

  • left
  • right
  • both

4 Which of the following does str_sub() use to create a substring? Select all that apply!

  • starting position
  • ending position
  • pattern to search for

5 Which of the following does str_subset() use to create a substring? Select all that apply!

  • starting position
  • ending position
  • pattern to search for

6 What does the collapse argument do in str_c()?

  • specifies a string to be used when combining inputs into a single string
  • specifies whether the string should be collapsed