Special Data Types

This week is all about special data types in R. Similar to the tools you learned last week for working with factors, this week you are going to learn about tools for working with strings and dates. By the end of this week you should be able to:

Clean and extract information from character strings using stringr
Work with date and time variables using lubridate

▶️ Watch Videos: 20 minutes

📖 Readings: 60-75 minutes

✅ Preview Activities: 2

1 Part 1: Strings

Nearly always, when multiple variables are stored in a single column, they are stored as character variables. There are many different “levels” of working with strings in programming, from simple find-and-replaced of fixed (constant) strings to regular expressions, which are extremely powerful (and extremely complicated).

📖 Required Reading: R4DS – Strings

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. - Jamie Zawinski

Alternately, the xkcd version of the above quote

stringr

Download the stringr cheatsheet.

Table of string functions in the R `stringr` package. `x` is the string or vector of strings, `pattern` is a pattern to be found within the string, `a` and `b` are indexes, and `encoding` is a string encoding, such as UTF8 or ASCII.
Task	stringr
Replace `pattern` with `replacement`	`str_replace(x, pattern, replacement)` and `str_replace_all(x, pattern, replacement)`
Convert case	`str_to_lower(x)`, `str_to_upper(x)` , `str_to_title(x)`
Strip whitespace from start/end	`str_trim(x)` , `str_squish(x)`
Pad strings to a specific length	`str_pad(x, …)`
Test if the string contains a pattern	`str_detect(x, pattern)`
Count how many times a pattern appears in the string	`str_count(x, pattern)`
Find the first appearance of the pattern within the string	`str_locate(x, pattern)`
Find all appearances of the pattern within the string	`str_locate_all(x, pattern)`
Detect a match at the start/end of the string	`str_starts(x, pattern)` ,`str_ends(x, pattern)`
Subset a string from index a to b	`str_sub(x, a, b)`
Convert string encoding	`str_conv(x, encoding)`

1.1 Regular Expressions

Matching exact strings is easy - it’s just like using find and replace.

library(stringr)

human_talk <- "blah, blah, blah. Do you want to go for a walk?"
dog_hears <- str_extract(human_talk, "walk")
dog_hears

[1] "walk"

But, if you can master even a small amount of regular expression notation, you’ll have exponentially more power to do good (or evil) when working with strings. You can get by without regular expressions if you’re creative, but often they’re much simpler.

📖 Recommended Reading: R4DS Chapter 15 (Regular Expressions)

Read at least through Section 15.4.1.

Short Regular Expressions Primer

You may find it helpful to follow along with this section using this web app built to test R regular expressions. The subset of regular expression syntax we’re going to cover here is fairly limited, but you can find regular expressions to do just about anything string-related. As with any tool, there are situations where it’s useful, and situations where you should not use a regular expression, no matter how much you want to.

Here are the basics of regular expressions:

[] enclose sets of characters
For example, [abc] will match any single character a, b, c
- specifies a range of characters (A-z matches all upper and lower case letters)
. matches any character (except a newline)
To match special characters, escape them using \ (in most languages) or \\ (in R). So \. or \\. will match a literal ., \$ or \\$ will match a literal $.

num_string <- "phone: 123-456-7890, nuid: 12345678, ssn: 123-45-6789"

str_extract(num_string, "[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]")

[1] "123-45-6789"

Repeating Patterns

Listing out all of those numbers can get repetitive, though. How do we specify repetition?

* means repeat between 0 and inf times
+ means 1 or more times
? means 0 or 1 times – most useful when you’re looking for something optional
{a, b} means repeat between a and b times, where a and b are integers. `
Note that b can be blank. So [abc]{3,} will match abc, aaaa, cbbaa, but not ab, bb, or a.
For a single number of repeated characters, you can use {a}. So {3, } means “3 or more times” and {3} means “exactly 3 times”

num_string <- "phone: 123-456-7890, nuid: 12345678, ssn: 123-45-6789"

# Matches a sequence of *three* numbers, followed by a dash, 
# then a sequence of *two* numbers, followed by a dash, 
# then a sequence of *four* numbers, followed by a dash.
ssn <- str_extract(num_string, "[0-9]{3}-[0-9]{2}-[0-9]{4}")
ssn

[1] "123-45-6789"

# Matches a sequence of *three* numbers, followed by any character, 
# then a sequence of *three* numbers, followed by any character, 
# then a sequence of *four* numbers, followed by any character, 
phone <- str_extract(num_string, "[0-9]{3}.[0-9]{3}.[0-9]{4}")
phone

[1] "123-456-7890"

# Matches a sequence of *eight* numbers 
nuid <- str_extract(num_string, "[0-9]{8}")
nuid

[1] "12345678"

Anchoring

There are also ways to “anchor” a pattern to a part of the string (e.g. the beginning or the end)

^ has multiple meanings:
- if it’s the first character in a pattern, ^ matches the beginning of a string
- if it follows [, e.g. [^abc], ^ means “not” - for instance, “the collection of all characters that aren’t a, b, or c”.
$ means the end of a string

Combined with pre and post-processing, these let you make sense out of semi-structured string data, such as addresses.

address <- "1600 Pennsylvania Ave NW, Washington D.C., 20500"

# Match a sequence of one or more digits at the beginning of the string
house_num <- str_extract(address, "^[0-9]{1,}")
house_num

[1] "1600"

# Match everything alphanumeric up to the comma
street <- str_extract(address, "[A-z0-9 ]{1,}")
# Remove house number from street address
street <- str_remove(street, house_num) |> 
  # Trim any leading or trailing whitespace from remaining string
  str_trim() 
street

[1] "Pennsylvania Ave NW"

# Match one or more characters between the two commas  
city <- str_extract(address, ",.+,") |> 
  # Remove the leading and trailing commas
  str_remove_all(",") |> 
  # Trim any leading or trailing whitespace from remaining string
  str_trim()
city

[1] "Washington D.C."

# Matches both 5 and 9 digit zip codes found at the end of the string
zip <- str_extract(address, "[0-9-]{5,10}$") 
zip

[1] "20500"

Making Groups

() are used to capture information. So ([0-9]{4}) captures any 4-digit number
a|b will select a or b.

If you’ve captured information using (), you can reference that information using back references. In most languages, those look like this: \1 for the first reference, \9 for the ninth. In R, back references are \\1 through \\9, because the \ character is special, so you have to escape it.

phone_num_variants <- c("(123) 456-7980", "123.456.7890", "+1 123-456-7890")

phone_regex <- "\\+?[0-9]{0,3}? ?\\(?([0-9]{3})?\\)?.?([0-9]{3}).?([0-9]{4})"
# \\+?[0-9]{0,3} matches the country code, if specified, 
#    but won't take the first 3 digits from the area code 
#    unless a country code is also specified
# \\( and \\) match literal parentheses if they exist
# ([0-9]{3})? captures the area code, if it exists
# .? matches any character
# ([0-9]{3}) captures the exchange code
# ([0-9]{4}) captures the 4-digit individual code

str_extract(phone_num_variants, phone_regex)

[1] "(123) 456-7980"  "123.456.7890"    "+1 123-456-7890"

# We didn't capture the country code, so it remained in the string

human_talk <- "blah, blah, blah. Do you want to go for a walk? I think I'm going to treat myself to some ice cream for working so hard. "
dog_hears <- str_extract_all(human_talk, "walk|treat")
dog_hears

[[1]]
[1] "walk"  "treat"

Putting it all together, we can test our regular expressions to ensure that they are specific enough to pull out what we want, while not pulling out other similar information:

strings <- c("abcdefghijklmnopqrstuvwxyzABAB",
             "banana orange strawberry apple",
             "ana went to montana to eat a banana",
             "call me at 432-394-2873. Do you want to go for a walk? I'm going to treat myself to some ice cream for working so hard.",
             "phone: (123) 456-7890, nuid: 12345678, bank account balance: $50,000,000.23",
             "1600 Pennsylvania Ave NW, Washington D.C., 20500")

phone_regex <- "\\+?[0-9]{0,3}? ?\\(?([0-9]{3})?\\)?.?([0-9]{3}).([0-9]{4})"
dog_regex <- "(walk|treat)"
addr_regex <- "([0-9]*) ([A-z0-9 ]{3,}), ([A-z\\. ]{3,}), ([0-9]{5})"
# Find patterns where two characters are repeated
abab_regex <- "(..)\\1"

# Create a table for whether each regex was detected in each string
tibble(
  text = strings,
  phone = str_detect(strings, phone_regex),
  dog = str_detect(strings, dog_regex),
  addr = str_detect(strings, addr_regex),
  abab = str_detect(strings, abab_regex)
  )

# A tibble: 6 × 5
  text                                                   phone dog   addr  abab 
  <chr>                                                  <lgl> <lgl> <lgl> <lgl>
1 abcdefghijklmnopqrstuvwxyzABAB                         FALSE FALSE FALSE TRUE 
2 banana orange strawberry apple                         FALSE FALSE FALSE TRUE 
3 ana went to montana to eat a banana                    FALSE FALSE FALSE TRUE 
4 call me at 432-394-2873. Do you want to go for a walk… TRUE  TRUE  FALSE FALSE
5 phone: (123) 456-7890, nuid: 12345678, bank account b… TRUE  FALSE FALSE FALSE
6 1600 Pennsylvania Ave NW, Washington D.C., 20500       FALSE FALSE TRUE  FALSE

✅ Check-in 5.1: Functions from `stringr`

1 Which of the follow are differences between length() and str_length()?

length() gives the number of elements in a vector
str_length() gives the number of characters in a string
str_length() gives the number of strings in a vector
length() gives the dimensions of a dataframe

2 What of the following is true about str_replace()?

str_replace() replaces the first instance of the pattern
str_replace() replaces the last instance of the pattern
str_replace() replaces every instance of the pattern

3 str_trim() allows you to remove whitespace on what sides

left
right
both

4 Which of the following does str_sub() use to create a substring?

starting position
ending position
pattern to search for

5 Which of the following does str_subset() use to create a substring?

starting position
ending position
pattern to search for

6 What does the collapse argument do in str_c()?

specifies a string to be used when combining inputs into a single string
specifies whether the string should be collapsed

2 Part 2: Dates

In order to fill in an important part of our toolbox, we need to learn how to work with date variables. These variables feel like they should be simple and intuitive given we all work with schedules and calendars everyday. However, there are little nuances that we will learn to make working with dates and times easier.

📖 Required Reading: R4DS – Dates and Times

Extra resources

lubridate website
Download the lubridate cheatsheet
A more in-depth discussion of the POSIXlt and POSIXct data classes.
A tutorial on lubridate - scroll down for details on intervals if you have trouble with %within% and %--%

✅ Check-in 5.2: Functions from `lubridate`

Q1 Which of the following is true about the year() function?

year() creates a duration object to be added to a datetime
year() extracts the year of a datetime object

Q3 What tz would you use for San Luis Obispo? Use the exact input you would use in R!

Q3 Which of the following is true about the %within% operator?

it checks if a date is included in an interval
it returns a logical value
it creates an interval with a start and end time

Q4 Which of the following is true about the %--% operator?

it creates an interval with a start and end time
it returns a logical value
it checks if a date is included in an interval

Q5 What day does the make_date() function use as default if no day argument is provided?

▶️ Watch Videos: 20 minutes

📖 Readings: 60-75 minutes

✅ Preview Activities: 2

1 Part 1: Strings

📖 Required Reading: R4DS – Strings

1.1 Regular Expressions

📖 Recommended Reading: R4DS Chapter 15 (Regular Expressions)

✅ Check-in 5.1: Functions from stringr

2 Part 2: Dates

📖 Required Reading: R4DS – Dates and Times

✅ Check-in 5.2: Functions from lubridate

✅ Check-in 5.1: Functions from `stringr`

✅ Check-in 5.2: Functions from `lubridate`