Web Scraping

Learning goals

After this lesson, you should be able to:

  • Use CSS Selectors and the Selector Gadget tool to locate data of interest within a webpage
  • Use the html_elements() and html_text() functions within the rvest packages to scrape data from webpage using CSS selectors

Web scraping

We have talked about how to acquire data from APIs. Whenever an API is available for your project, you should default to getting data from the API. Sometimes an API will not be available, and web scraping is another means of getting data.

Web scraping describes the use of code to extract information displayed on a web page. In R, the rvest package (meant to sound like “harvest”) offers tools for scraping.

Additional readings:

Scraping ethics

robots.txt

robots.txt is a file that some websites will publish to clarify what can and cannot be scraped and other constraints about scraping. When a website publishes this file, this we need to comply with the information in it for moral and legal reasons.

We will look through the information in this tutorial and apply this to the NIH robots.txt file.

From our investigation of the NIH robots.txt, we learn:

  • User-agent: *: Anyone is allowed to scrape
  • Crawl-delay: 2: Need to wait 2 seconds between each page scraped
  • No Visit-time entry: no restrictions on time of day that scraping is allowed
  • No Request-rate entry: no restrictions on simultaneous requests
  • No mention of ?page=, news-events, news-releases, or https://science.education.nih.gov/ in the Disallow sections. (This is what we want to scrape today.)

Further considerations

The article Ethics in Web Scraping describes some good principles to ensure that we are valuing the labor that website owners invested to provide data and creating good from the information we do scrape.

HTML structure

HTML (hypertext markup language) is the formatting language used to create webpages. Let’s look at the core parts of HTML from the rvest vignette.

Finding CSS Selectors

In order to gather information from a webpage, we must learn the language used to identify patterns of specific information. For example, on the NIH News Releases page, we can see that the data is represented in a consistent pattern of image + title + abstract.

We will identify data in a web page using a pattern matching language called CSS Selectors that can refer to specific patterns in HTML, the language used to write web pages.

For example:

  • Selecting by tag:
    • "a" selects all hyperlinks in a webpage (“a” represents “anchor” links in HTML)
    • "p" selects all paragraph elements
  • Selecting by ID and class:
    • ".description" selects all elements with class equal to “description”
      • The . at the beginning is what signifies class selection.
      • This is one of the most common CSS selectors for scraping because in HTML, the class attribute is extremely commonly used to format webpage elements. (Any number of HTML elements can have the same class, which is not true for the id attribute.)
    • "#mainTitle" selects the SINGLE element with id equal to “mainTitle”
      • The # at the beginning is what signifies id selection.
<p class="title">Title of resource 1</p>
<p class="description">Description of resource 1</p>

<p class="title">Title of resource 2</p>
<p class="description">Description of resource 2</p>
Warning

Websites change often! So if you are going to scrape a lot of data, it is probably worthwhile to save and date a copy of the website. Otherwise, you may return after some time and your scraping code will include all of the wrong CSS selectors.

Although you can learn how to use CSS Selectors by hand, we will use a shortcut by installing the Selector Gadget tool.

  • There is a version available for Chrome–add it to Chrome via the Chome Web Store.
    • Make sure to pin the extension to the menu bar. (Click the 3 dots > Extensions > Manage extensions. Click the “Details” button under SelectorGadget and toggle the “Pin to toolbar” option.)
  • There is also a version that can be saved as a bookmark in the browser–see here.

Required Video: Selector Gadget Tutorial Video

Check-in: Webscraping

Head over to the NIH News Releases page. Click the Selector Gadget extension icon or bookmark button. As you mouse over the webpage, different parts will be highlighted in orange.

  1. What CSS selector would you use to obtain the title of each article?
  1. What CSS selector would you use to obtain the publication date of each article?
  1. What CSS selector would you use to obtain the article abstract paragraph (which will also include the publication date)?

Retrieving Data Using rvest and CSS Selectors

Now that we have identified CSS selectors for the information we need, let’s fetch the data using the rvest package.

library(rvest)

nih <- read_html("https://www.nih.gov/news-events/news-releases")

Once the webpage is loaded, we can retrieve data using the CSS selectors we specified earlier. The following code retrieves the article titles:

# Retrieve titles of articles
article_titles <- nih %>%
    html_elements(".teaser-title") %>%
    html_text()

# Inspect titles
head(article_titles)
[1] "Infant with rare, incurable disease is first to successfully receive personalized gene therapy treatment"                 
[2] "NIH researchers discover tissue biomarker that may indicate higher risk of aggressive breast cancer development and death"
[3] "FDA and NIH announce innovative joint Nutrition Regulatory Science Program"                                               
[4] "Incidence rates of some cancer types have risen in people under age 50"                                                   
[5] "NIH, CMS Partner to Advance Understanding of Autism Through Secure Access to Select Medicare and Medicaid Data"           
[6] "Many genes in male and female placentas expressed differently"                                                            

Iteration

Our goal is to get article titles, publication dates, and abstract text for news releases across several pages of results. Before doing any coding, let’s plan our approach!

  • What functions will you write?
    • We probably need a function to grab the text from the page.
    • We can then use this function to grab different parts of an HTML page and piece them together
  • What arguments will they have?
    • The function grabbing text should have two inputs, (1) the page and (2) the CSS selector to use
    • The function scraping the page should probably only take one input: the page
  • How will you use your functions?
    • These functions need to be iterated! We would need to call the function to scrape a page once for each page we want to scrape.

Check-in: Webscraping

Let’s carry out this plan to get the article title, publication date, and abstract text for the first five pages of news releases in a single data frame. You will need to write at least one function, and you will need iteration—either a for loop or the appropriate map_() functions.

Advice:

  • Mouse over the page buttons at the very bottom of the news home page to see what the URLs look like.
  • The abstract should not have the publication date–use stringr and regular expressions to remove the publication date.
  • Include Sys.sleep(2) in your function to respect the Crawl-delay: 2 in the NIH robots.txt file.
  • Recall that bind_rows() from dplyr takes a list of data frames and stacks them on top of each other.
Solution
# Helper function to reduce html_elements() %>% html_text() code duplication
get_text_from_page <- function(page, css_selector) {
    
  page %>%
    html_elements(css_selector) %>%
    html_text()
}

scrape_page <- function(url) {
    
    # Respect the 2 second crawl delay in the NIH robots.txt file
    Sys.sleep(2)
    
    # Read the page
    page <- read_html(url)
    
    # Grab elements from the page
    article_titles <- get_text_from_page(page, ".teaser-title")
    article_dates <- get_text_from_page(page, ".date-display-single")
    article_abstracts <- get_text_from_page(page, ".teaser-description")
    
    # Clean article abstracts!
    article_abstracts <- str_remove(article_abstracts, "^.+—") %>% 
      trimws()
    
    # Put page elements into a dataframe
    tibble(
        title = article_titles,
        date = article_dates,
        abstract = article_abstracts
    )
}

Using a for-loop:

# Create an empty list for the first five pages
pages <- vector("list", length = 5)

for (i in 1:5) {
    base_url <- "https://www.nih.gov/news-events/news-releases"
    if (i == 1) {
        url <- base_url
    } else {
        url <- str_c(base_url, "?page=", i-1)
    }
    pages[[i]] <- scrape_page(url)
}

df_articles <- bind_rows(pages)

head(df_articles)

Using purrr::map():

# Create a character vector of URLs for the first 5 pages
base_url <- "https://www.nih.gov/news-events/news-releases"

urls_all_pages <- c(base_url, 
                    str_c(base_url, 
                          "?page=", 
                          1:4)
                    )

pages2 <- purrr::map(urls_all_pages, scrape_page)

df_articles2 <- bind_rows(pages2)

head(df_articles2)