Beyond CSV Files
Real-world data comes in many forms:
Today: Tools and techniques for handling complex data sources
gapminder
and palmerpenguins
are sometimes not adequate for students to gain practical experienceWeb scraping extracts data from websites programmatically
rvest
: Core web scraping package for Rpolite
: Ethical scraping practiceshttr2
: HTTP requests and authentication (we’ll get into this with APIs)Always check robots.txt and respect rate limits!
Web Scraping Ethics: Use robots.txt checker to review the “language” of this file
Example: Google’s robot.txt
polite ensures you:
What would you expect scraped_data
to contain?
This workflow automatically: 1) Checks robots.txt, 2) Introduces your scraper and 3) Respects crawl delays
Essential Learning Materials:
rvest
reference with examplesAPIs (Application Programming Interfaces) provide:
NOTE: Always prefer APIs over web scraping when available! But why??
Student Exercise 💡 Compare same data via API vs scraping (OpenWeather API vs weather websites)
httr2
features:
httr2
API Workflow# Create and execute request
response <- request("https://api.nasa.gov/insight_weather/") %>%
req_url_query(api_key = "DEMO_KEY", feedtype = "json", ver = "1.0") %>%
req_perform()
# Parse JSON response
data <- response %>%
resp_body_json()
data %>%
purrr::pluck("675", "AT") %>%
as_tibble() %>%
mutate(day = 675)
# A tibble: 1 × 5
av ct mn mx day
<dbl> <int> <dbl> <dbl> <dbl>
1 -62.3 177556 -96.9 -15.9 675
Student Exercise 💡 NASA has a lot of APIs to play with!
Core Documentation & Tutorials:
Student-Friendly Tutorials:
Traditional R tools struggle with:
Solution: Modern high-performance tools
Performance gains:
data.table
: 10-100x faster than base Rarrow
+ duckdb
: Handle larger-than-memory datadata.table
Based on this code alone, can you guess what is happening to large_data
?
data.table
Depending on student needs, data.table
can be a very useful R development package:
Thousands of R packages use on data.table
for these reasons!
# Read large parquet file
huge_data <- arrow::open_dataset(here::here("materials", "data", "huge_data.parquet"))
# Query with DuckDB (SQL or dplyr syntax)
result <- huge_data %>%
arrow::to_duckdb() %>%
filter(year == 2023) %>%
group_by(species) %>%
summarise(avg_value = mean(bill_length_mm)) %>%
collect()
Zero-copy integration between Arrow and DuckDB!
dplyr
This means that 33.8 GB data analyzed in 16 GB RAM
Technical Documentation:
data.table
community blogTeaching-Focused Materials:
Performance Comparisons for Context:
Databases provide:
# Connect to database
con <- dbConnect(RSQLite::SQLite(), "mydata.db")
# Write data to database
dbWriteTable(con, "customers", customer_data)
# Query with SQL
result <- dbGetQuery(con, "
SELECT customer_id, SUM(amount) as total
FROM orders
GROUP BY customer_id
")
# Always disconnect
dbDisconnect(con)
dplyr
Syntaxdbplyr
translates dplyr code to SQL automatically!
Teaching Moment: Use show_query()
to reveal the generated SQL - great for connecting R to SQL concepts
Core Textbook Materials:
Teaching-Ready Tutorials:
Official Package Documentation:
Data Source | Primary Tools | Best For |
---|---|---|
Web Pages | rvest, polite | Structured web data |
APIs | httr2, jsonlite | Real-time, authenticated data |
Large Files | arrow, duckdb | Bigger-than-memory analysis |
Databases | DBI, dbplyr | Structured, relational data |
Choose the right tool for your data source and size!
BRAINSTORM: What kinds of projects could you include that:
Course Syllabi & Examples:
Practice Datasets & APIs:
Ethics & Legal Resources:
Things to include:
Will you require any specific research questions to be addressed, or is it open-ended?
Will you require any specific elements, such as “join at least two datasets” or “use a dataset of at least 10000 rows”?
How technical should the report be - is the audience other data scientists, or general public?
How will you grade successful but inefficient pipelines?
How will you make sure the pipeline really accomplishes what the report claims?