library(rvest)
library(httr)
url <- "https://www.example.com/data"
session <- session(url, user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"))
my_table <- session %>%
read_html() %>%
html_table()Project Checkpoint 3
Now that you have explored your original dataset and produced data summaries addressing your primary research questions, it is time to integrate new country-level data.
By the end of the week, your updated report should contain:
-
two new country-level data sources
Stat 431: One of these must be acquired using APIs or webscraping
Stat 541: Both of these must be acquired using APIs or webscraping.
descriptions of each of the new datasets
a “meta” dataset containing all the acquired data (joined together)
at least two data summaries OR visualizations incorporating the additional country-level data
Stat 541 Only
At least one of your datasets (webscraped or API) must require iteration. Meaning, you must iterate over a set of inputs (e.g., countries, pages) to acquire a new dataset.
Warning! Most APIs require you to make an account to access them to prevent too much automated data collection. Some APIs also charge a fee to use. You should not use any paid APIs for this class!
Helpful Links
Here are some sources for country-level data that you may use if you so choose.
These are just to get you started. You are not obligated to use these!
Tabular Data
Scrapable Data
The key aspect to webscraping is the information you are scraping needs to be included in the static HTML. Meaning, the table / information cannot be rendered dynamically via JavaScript (like this website).
Wikipedia is an excellent source for scrapable data, as all the tables are static HTML tables. Here are some good places to look:
When scraping websites with rvest, you may encounter errors such as "HTTP 403 Forbidden" or receive empty results even when the page appears to have data. This often happens because the website is blocking automated requests that don’t identify themselves as a browser.
To fix this, you can set a user agent — a string that tells the server what kind of client is making the request to mimic a real browser. In rvest, you can do this using the session() function with httr’s user_agent():
Note that even with a user agent, some websites actively prohibit scraping in their Terms of Service. Always check a site’s terms before scraping, and be respectful of the site by avoiding rapid repeated requests that could overload their server!
APIs
Much of the data we have relied on for years - such as NOAA and BEA data listed above - are no longer being collected and made available, due to US Government cuts to these public statistics services. Other data sources that result from independent academic research at US institutions are rapidly dwindling as well, also due to cuts to this funding.
Although these data sources may eventually be restored, or replaced with data from alternative agencies, we will never be able to go back and fill the gap for these lost years.