<- "phone: 123-456-7890, nuid: 12345678, ssn: 123-45-6789"
num_string
str_extract(num_string,
pattern = "[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]")
[1] "123-45-6789"
This week is all about special data types in R. Similar to the tools you learned last week for working with factors, this week you are going to learn about tools for working with strings and dates. By the end of this week you should be able to:
str_XXX()
function will provide (e.g., character vector, logical vector, matrix, numeric vector)str_XXX()
function is best suited for a particular problem (e.g., replacing, detecting a pattern, removing whitespace)[]
), repeated patterns ({2,}
), anchoring (^
and $
), and groups (()
)Strings are a powerful way to work with text variables (e.g., day_of_week
, job_title
, gender
). We’ve encountered a few functions related to strings in Week 3 when we were selecting columns from a dataset based on their names (e.g., select(colleges, starts_with(TUITION)
)) and also when we filtered a column based on its values (e.g., filter(colleges, STABBR == "CO"))
). Notice that in both cases, we are using the literal names of the variables or values in the data.
This week, we are going to learn about regular expressions—a short hand way to search for values of a string.
The stringr cheatsheet is in Week 5 of your coursepack! If you do not have a coursepack I would strongly recommending printing the stringr cheatsheet.
Task | stringr | Output |
Find a pattern and replace it |
str_replace(x, pattern, replacement) and str_replace_all(x, pattern, replacement) |
Modified string or character vector |
Convert a string from uppercase to lower case or visa versa | str_to_lower(x) , str_to_upper(x) , str_to_title(x) |
Modified string or character vector |
Strip whitespace from the start / end of a string | str_trim(x) , str_squish(x) |
Modified string or character vector |
Detect if the string contains a pattern | str_detect(x, pattern) |
Logical |
Count how many times a pattern appears in the string | str_count(x, pattern) |
Numeric |
Find the first appearance of the pattern within the string | str_locate(x, pattern) |
Integer matrix (start position, end position) |
Find all appearances of the pattern within the string | str_locate_all(x, pattern) |
Integer matrix (start position, end position) |
Detect if a string contains a pattern at the start / end | str_starts(x, pattern) , str_ends(x, pattern) |
Logical |
Subset a string from index a to b | str_sub(x, a, b) |
Modified string or character vector |
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. - Jamie Zawinski
Regular expressions (or regex) is a concise and powerful language created to describe patterns in strings. You can think of regular expressions as an advanced versions of “Find” in a text editor, helping you search for specific patterns in strings. These expressions use symbols to define flexible search patterns. For example, the symbol .
is used to match any character (letter, number, punctuation) and the symbol \d
is used to match any digit.
We’re going to learn a little bit more about regular expressions and how you can use them to search for specific patterns.
A cheatsheet on regular expressions is included in Week 5 of your coursepack! If you do not have a coursepack I would strongly recommending printing the regular expression cheatsheet.
You may find it helpful to follow along with this section using this web app built to test R regular expressions. The subset of regular expression syntax we’re going to cover here is fairly limited, but you can find regular expressions to do just about anything string-related. As with any tool, there are situations where it’s useful, and situations where you should not use a regular expression, no matter how much you want to.
Here are the basics of regular expressions:
[]
enclose sets of characters
[abc]
will match any single character a
, b
, c
-
specifies a range of characters
A-z
will match all upper and lower case letters (A-Z, and then a-z).
matches any character (except a newline)
To match special characters, you need to escape them using a \
(in most languages) or \\
(in R).
\.
or \\.
will match a literal .
, \$
or \\$
will match a literal $
.[1] "123-45-6789"
Listing out all of those numbers can get repetitive, though. How do we specify repetition?
*
means repeat between 0 and inf times+
means 1 or more times?
means 0 or 1 times – most useful when you’re looking for something optional{a, b}
means repeat between a
and b
times, where a
and b
are integers.
b
can be blank. So [abc]{3,}
will match abc
, aaaa
, cbbaa
, but not ab
, bb
, or a
.{a}
specifies an exact number of repeated charaters.
{3}
means “exactly 3 times” whereas {3,}
means “3 or more times.”[1] "phone: 123-456-7890, nuid: 12345678, ssn: 123-45-6789"
[1] "123-45-6789"
[1] "123-456-7890"
[1] "12345678"
There are also ways to “anchor” a pattern to a part of the string (e.g. the beginning or the end)
^
has multiple meanings:
^
is the first character in a pattern (e.g., ^Al
) it matches the beginning of a string.^
follows a [
(e.g., [^abc]
) then it means “not.” So, [^abc]
means “the collection of all characters that are not a, b, or c.”$
means the end of a string (e.g., bold$
)Combined with pre and post-processing, these let you make sense out of semi-structured string data, such as addresses.
Grabbing the house number
[1] "1600"
Grabbing the street
[1] "1600 Pennsylvania Ave NW"
[1] "Pennsylvania Ave NW"
Grabbing the city
[1] "Washington D.C."
Grabbing the zip code
()
are used to capture information
([0-9]{4})
captures any 4-digit number|
means “or”
a|b
will select a or b.Making a group of characters
[1] "apple" "apricot" "avocado"
[4] "banana" "bell pepper" "bilberry"
[7] "blackberry" "blackcurrant" "blood orange"
[10] "blueberry" "boysenberry" "breadfruit"
[13] "canary melon" "cantaloupe" "cherimoya"
[16] "cherry" "chili pepper" "clementine"
[19] "cloudberry" "coconut" "cranberry"
[22] "cucumber" "currant" "damson"
[25] "date" "dragonfruit" "durian"
[28] "eggplant" "elderberry" "feijoa"
[31] "fig" "goji berry" "gooseberry"
[34] "grape" "grapefruit" "guava"
[37] "honeydew" "huckleberry" "jackfruit"
[40] "jambul" "jujube" "kiwi fruit"
[43] "kumquat" "lemon" "lime"
[46] "loquat" "lychee" "mandarine"
[49] "mango" "mulberry" "nectarine"
[52] "nut" "olive" "orange"
[55] "pamelo" "papaya" "passionfruit"
[58] "peach" "pear" "persimmon"
[61] "physalis" "pineapple" "plum"
[64] "pomegranate" "pomelo" "purple mangosteen"
[67] "quince" "raisin" "rambutan"
[70] "raspberry" "redcurrant" "rock melon"
[73] "salal berry" "satsuma" "star fruit"
[76] "strawberry" "tamarillo" "tangerine"
[79] "ugli fruit" "watermelon"
[1] │ a<pp>le
[5] │ bell pe<pp>er
[17] │ chili pe<pp>er
[62] │ pinea<pp>le
Using an “or”
[1] │ <apple>
[13] │ canary <melon>
[20] │ coco<nut>
[52] │ <nut>
[62] │ pine<apple>
[72] │ rock <melon>
[80] │ water<melon>
Referencing groups
stringr
1 Which of the follow are differences between length()
and str_length()
?
length()
gives the number of elements in a vectorstr_length()
gives the number of characters in a stringstr_length()
gives the number of strings in a vectorlength()
gives the dimensions of a dataframe2 What of the following is true about str_replace()
?
str_replace()
replaces the first instance of the pattern
str_replace()
replaces the last instance of the pattern
str_replace()
replaces every instance of the pattern
3 str_trim()
allows you to remove whitespace on what sides?
4 Which of the following does str_sub()
use to create a substring? Select all that apply!
5 Which of the following does str_subset()
use to create a substring? Select all that apply!
6 What does the collapse
argument do in str_c()
?