Web Scraping
After this lesson, you should be able to:
- Use CSS Selectors and the Selector Gadget tool to locate data of interest within a webpage
- Use the
html_elements()andhtml_text()functions within thervestpackages to scrape data from webpage using CSS selectors
Web scraping
We have talked about how to acquire data from APIs. Whenever an API is available for your project, you should default to getting data from the API. Sometimes an API will not be available, and web scraping is another means of getting data.
Web scraping describes the use of code to extract information displayed on a web page. In R, the rvest package (meant to sound like “harvest”) offers tools for scraping.
Scraping ethics
robots.txt
robots.txt is a file that some websites will publish to clarify what can and cannot be scraped and other constraints about scraping. When a website publishes this file, this we need to comply with the information in it for moral and legal reasons.
We will look through the information in this tutorial and apply this to the cheese.com robots.txt file.
The cheese.com robots.txt is very simple, it says that anyone is allowed to scrape (User-agent: *). The file doesn’t give any additional instructions, such as Crawl-delay (how long you need to wait between scraping each page), Visit-time (restrictions on time of day that scraping is allowed), or Request-rate (restrictions on simultaneous requests).
Further considerations
The article Ethics in Web Scraping describes some good principles to ensure that we are valuing the labor that website owners invested to provide data and creating good from the information we do scrape.
HTML structure
HTML (hypertext markup language) is the formatting language used to create webpages. Let’s look at the core parts of HTML from the rvest vignette.
Finding CSS Selectors
In order to gather information from a webpage, we must learn the language used to identify patterns of specific information. For example, on the [alphabetical page of cheeses]](https://www.cheese.com/alphabetical/), we can see that the data is represented in a consistent pattern of image + name of cheese.
We will identify data in a web page using a pattern matching language called CSS Selectors that can refer to specific patterns in HTML, the language used to write web pages.
For example:
- Selecting by tag:
-
"a"selects all hyperlinks in a webpage (“a” represents “anchor” links in HTML) -
"p"selects all paragraph elements
-
- Selecting by ID and class:
-
".description"selects all elements withclassequal to “description”- The
.at the beginning is what signifiesclassselection. - This is one of the most common CSS selectors for scraping because in HTML, the
classattribute is extremely commonly used to format webpage elements. (Any number of HTML elements can have the sameclass, which is not true for theidattribute.)
- The
-
"#mainTitle"selects the SINGLE element with id equal to “mainTitle”- The
#at the beginning is what signifiesidselection.
- The
-
<p class="title">Title of resource 1</p>
<p class="description">Description of resource 1</p>
<p class="title">Title of resource 2</p>
<p class="description">Description of resource 2</p>Websites change often! So if you are going to scrape a lot of data, it is probably worthwhile to save and date a copy of the website. Otherwise, you may return after some time and your scraping code will include all of the wrong CSS selectors.
Although you can learn how to use CSS Selectors by hand, we will use a shortcut by installing the Selector Gadget tool.
- There is a version available for Chrome–add it to Chrome via the Chome Web Store.
- Make sure to pin the extension to the menu bar. (Click the 3 dots > Extensions > Manage extensions. Click the “Details” button under SelectorGadget and toggle the “Pin to toolbar” option.)
- There is also a version that can be saved as a bookmark in the browser–see here.
Head over to the alphabetical page of the first 100 cheeses. Click the Selector Gadget extension icon or bookmark button. As you mouse over the webpage, different parts will be highlighted in orange.
- What CSS selector would you use to obtain the image of the cheese?
- What CSS selector would you use to obtain the name of the cheese?
- What CSS selector would you use to obtain the link for each cheese?
Retrieving Data Using rvest and CSS Selectors
Now that we have identified CSS selectors for the information we need, let’s fetch the data using the rvest package.
Once the webpage is loaded, we can retrieve data using the CSS selectors we specified earlier. The code below retrieves three pieces of information about the cheeses on this website:
1. name
I’m adding an h3 to my CSS selector, since the store information is also selected with the .product-item. If I don’t grab the h3 element, then the cheeses with store information will show up like this: "Stores >\n \n \n \n \n Aged Gouda"
cheeses %>%
html_elements(".product-item") %>%
html_element("h3") %>%
html_text(trim = T) |>
head()[1] "2 Year Aged Cumin Gouda" "3-Cheese Italian Blend"
[3] "30 Month Aged Parmigiano Reggiano" "3yrs Aged Vintage Gouda"
[5] "Aarewasser" "Abbaye de Belloc"
2. HTML
cheeses %>%
html_elements(".product-item") %>%
html_element("a") %>%
html_attr("href") %>%
url_absolute("https://www.cheese.com") |>
head()[1] "https://www.cheese.com/2-year-aged-cumin-gouda/"
[2] "https://www.cheese.com/3-cheese-italian-blend/"
[3] "https://www.cheese.com/30-month-aged-parmigiano-reggiano-150g/"
[4] "https://www.cheese.com/3yrs-aged-vintage-gouda/"
[5] "https://www.cheese.com/aarewasser/"
[6] "https://www.cheese.com/abbaye-de-belloc/"
The urls that are output by html_attr("href") are relative urls (e.g., "/2-year-aged-cumin-gouda/") to get te full url, we need to paste the base url (https://www.cheese.com) to the beginning of the relative url. The url_absolute() function can do this for us.
3. status of the image (exists or missing)
cheeses %>%
html_elements(".product-item") %>%
html_element("a img") %>%
html_attr("class") %>%
head()[1] "image-exists" "image-missing" "image-exists" "image-exists"
[5] "image-exists" "image-exists"
This seems like a great place to use str_detect()!
Iteration
Our goal is to obtain a dataset of characteristics about different cheeses from https://www.cheese.com. Before doing any coding, let’s plan our approach!
- What functions will you write?
- We probably need a function to grab the text from the page.
- We can then use this function to grab different parts of an HTML page and piece them together
- What arguments will they have?
- The function scraping the page should probably only take one input: the page
- The function grabbing text should have two inputs, (1) the page and (2) the CSS selector to use
- How will you use your functions?
- These functions need to be iterated! We would need to call the function to scrape a page once for each page we want to scrape.
Let’s carry out this plan to obtain the following information for all cheeses in the database:
- cheese name
- URL for the cheese’s webpage (e.g., https://www.cheese.com/gouda/)
- whether or not the cheese has a picture (e.g., gouda has a picture, but bianco does not).
To be kind to the website owners, we will add a 1 second pause between page queries. (Note that you can view 100 cheeses at a time.)
- All three of the CSS selectors from Section 2.2, have something in common. Can you spot it? Specifically, what input to
html_elements()do they all share?
- Now that we’ve found the similarities, let’s spot the differences. Match each CSS selector to where it is (exclusively) used.
- cheese name
- cheese URL
- cheese image
h3
a
a img
- There are also differences in the HTML elements that are being extracted from the pages. Match each HTML extractor function to where it is used.
- Let’s use all this information to write a general function that can be used for all three cases (name, URL, image). The function should take four arguments, (1) page to scrape, (2) css_selector, (3) node, and (4) what element to extract. The following function calls should give you an idea of how the function should operate.
get_info_from_page(cheeses,
css_selector = ".product-item",
node = "h3 a")
get_info_from_page(cheeses,
css_selector = ".product-item",
node = "a",
extract = "href") |>
str_detect(pattern = "(#store-online-tabs)$")
url_absolute("https://www.cheese.com")Complete the function below.
get_info_from_page <- function(page, css_selector, node, extract = "text") {
# Grab the CSS and node elements from the page
elements <- page %>%
________(css_selector) %>%
________(node)
if (extract == "text") {
# Extract the text from the element and trim it
________(elements, trim = TRUE)
} else {
# Extract what is specified
html_attr(elements, extract)
}
}- Our next step is to use this function in a larger
scrape_page()function that will obtain the necessary information for each page of cheeses. To be kind to the website owners, let’s add a 1 second pause between page queries.
Fill in the code below to create the scrape_cheese_page() function.
scrape_cheese_page <- function(url) {
# 1 second crawl delay
Sys.sleep(1)
# Read the page
page <- ________(url)
# Grab cheese name
cheese_names <- get_info_from_page(page,
css_selector = ".product-item",
node = ________)
# Grab cheese URL
cheese_url <- get_info_from_page(page,
css_selector = ".product-item",
node = ________,
extract = "href") |>
url_absolute("https://www.cheese.com")
# Grab whether cheese has image
cheese_image <- get_info_from_page(page,
css_selector = ".product-item",
node = ________,
extract = "class") |>
# Detects if cheese has image
str_detect("image-exists")
#Put page elements into a data frame / tibble
tibble(cheese = cheese_names,
url = cheese_url,
image = cheese_image
)
}- Finally, let’s iterate over all the pages of cheese! We can figure out how many pages we need to scrape over by grabbing the number of pages from the bottom of the first alphabetical page (per page = 100). This method is a bit less error prone than manually typing
21!
# A less error prone
total_pages <- cheeses %>%
html_elements(".page-link:nth-child(10)") %>%
html_text()Fill in the code below to map() over every page of cheese.
base_url <- "https://www.cheese.com/alphabetical/?per_page=100&page="
urls_all_pages <- str_c(base_url, 1:______)
all_pages <- map(.x = ________,
.f = ________)- The
map()function returns a list of dataframes. What function should we use to bind each of these dataframes together to form one large dataframe on cheeses?
bind_cols()bind_rows()inner_join()full_join()
# A tibble: 2,046 × 3
cheese url image
<chr> <chr> <lgl>
1 2 Year Aged Cumin Gouda https://www.cheese.com/2-year-aged-c… TRUE
2 3-Cheese Italian Blend https://www.cheese.com/3-cheese-ital… FALSE
3 30 Month Aged Parmigiano Reggiano https://www.cheese.com/30-month-aged… TRUE
4 3yrs Aged Vintage Gouda https://www.cheese.com/3yrs-aged-vin… TRUE
5 Aarewasser https://www.cheese.com/aarewasser/ TRUE
6 Abbaye de Belloc https://www.cheese.com/abbaye-de-bel… TRUE
7 Abbaye de Belval https://www.cheese.com/abbaye-de-bel… FALSE
8 Abbaye de Citeaux https://www.cheese.com/abbaye-de-cit… TRUE
9 Abbaye de Tamié https://www.cheese.com/tamie/ TRUE
10 Abbaye de Timadeuc https://www.cheese.com/abbaye-de-tim… TRUE
# ℹ 2,036 more rows