More Complex Data

Beyond CSV Files

Why Move Beyond CSV?

Real-world data comes in many forms:

  • Web data: HTML tables, JSON APIs, XML feeds
  • Real-time data: APIs that update continuously
  • Large datasets: Files too big for memory
  • Structured databases: Relational data with foreign keys

Today: Tools and techniques for handling complex data sources

Why Integrate Other Data Sources?

  1. Real-world data are rarely nicely formatted
  2. Knowing how to access and format data is crucial
  3. Much of these data are large and (continuously) updating
  4. Simple data sets like gapminder and palmerpenguins are sometimes not adequate for students to gain practical experience

Web Scraping

Web Scraping Fundamentals

Web scraping extracts data from websites programmatically

  • rvest: Core web scraping package for R
  • polite: Ethical scraping practices
  • httr2: HTTP requests and authentication (we’ll get into this with APIs)

Always check robots.txt and respect rate limits!

Web Scraping Ethics: Use robots.txt checker to review the “language” of this file

Example: Google’s robot.txt

Essential Web Scraping Packages

# pak::pak(c("rvest", "polite"))
library(rvest)
library(polite)

polite ensures you:

  • Check robots.txt permissions
  • Respect rate limits automatically
  • Cache requests to avoid redundancy

Polite Web Scraping Workflow

# 1. Bow and introduce yourself
session <- bow(
  "https://example.com", 
  user_agent = "Your Name <email@example.com>"
)

# 2. Scrape politely
scraped_data <- scrape(session) %>%
  html_elements("title") %>%
  html_text()

What would you expect scraped_data to contain?

Polite Web Scraping Workflow

scraped_data
[1] "Example Domain"

Polite Web Scraping Workflow

# 1. Bow and introduce yourself
session <- bow(
  "https://example.com", 
  user_agent = "Your Name <email@example.com>"
)

# 2. Scrape politely
scraped_data <- scrape(session) %>%
  html_elements("title") %>%
  html_text()

This workflow automatically: 1) Checks robots.txt, 2) Introduces your scraper and 3) Respects crawl delays

Web Scraping Resources for Teaching

Essential Learning Materials:

APIs & JSON/XML

APIs: A Better Way to Get Data

APIs (Application Programming Interfaces) provide:

  • Structured access to data
  • Rate limiting built-in
  • Authentication systems
  • Documentation and support

NOTE: Always prefer APIs over web scraping when available! But why??

Student Exercise 💡 Compare same data via API vs scraping (OpenWeather API vs weather websites)

Modern R API Tools

# Install if needed
# pak::pak(c("httr2", "jsonlite", "xml2"))

library(httr2)    # Modern HTTP client
library(jsonlite) # JSON parsing
library(xml2)     # XML handling

httr2 features:

  • Pipe-friendly syntax
  • Built-in retry logic
  • Automatic request throttling
  • Better error handling than httr

httr2 API Workflow

# Create and execute request 
response <- request("https://api.nasa.gov/insight_weather/") %>%
  req_url_query(api_key = "DEMO_KEY", feedtype = "json", ver = "1.0") %>%
  req_perform()

# Parse JSON response
data <- response %>%
  resp_body_json() 

data %>%
  purrr::pluck("675", "AT") %>% 
  as_tibble() %>% 
  mutate(day = 675)
# A tibble: 1 × 5
     av     ct    mn    mx   day
  <dbl>  <int> <dbl> <dbl> <dbl>
1 -62.3 177556 -96.9 -15.9   675

Student Exercise 💡 NASA has a lot of APIs to play with!

Working with JSON Data

# Convert JSON to data frame
json_data <- '{
  "users": [
    {"name": "Alice", "age": 25},
    {"name": "Bob", "age": 30}
  ]
}'

# Parse with jsonlite
df <- jsonlite::fromJSON(json_data)
df
$users
   name age
1 Alice  25
2   Bob  30
df[["users"]]
   name age
1 Alice  25
2   Bob  30

API & JSON/XML Teaching Resources

Core Documentation & Tutorials:

Student-Friendly Tutorials:

Large Data Performance

When Data Gets Big

Traditional R tools struggle with:

  • Files larger than RAM
  • Millions of rows
  • Complex joins across tables
  • Repeated analysis on same data

Solution: Modern high-performance tools

The Performance Trio

# Install if needed
# pak::pak(c("data.table", "arrow", "duckdb"))

# High-performance data processing
library(data.table)  # In-memory speed
library(arrow)       # Columnar storage and data transfer
library(duckdb)      # In-process analytics

Performance gains:

  • data.table: 10-100x faster than base R
  • arrow + duckdb: Handle larger-than-memory data
  • Combined: Up to 20x faster pipelines

data.table

# Read large parquet file
large_data <- data.table::fread(here::here("materials", "data", "large_data.csv"))

# data.table
setDT(large_data)
large_data[year == 2023, .(avg_value = mean(value)), by = species]

Based on this code alone, can you guess what is happening to large_data?

data.table

Depending on student needs, data.table can be a very useful R development package:

  • careful dev cycle
  • concise syntax
  • efficient and fast in memory data

Thousands of R packages use on data.table for these reasons!

Arrow + DuckDB Integration

# Read large parquet file
huge_data <- arrow::open_dataset(here::here("materials", "data", "huge_data.parquet"))

# Query with DuckDB (SQL or dplyr syntax)
result <- huge_data %>%
  arrow::to_duckdb() %>%
  filter(year == 2023) %>%
  group_by(species) %>%
  summarise(avg_value = mean(bill_length_mm)) %>%
  collect()

Zero-copy integration between Arrow and DuckDB!

Why This Approach Works

  • Arrow: Columnar format, efficient I/O
  • DuckDB: In-process OLAP database
  • Zero-copy: No serialization overhead
  • Familiar syntax: Works with dplyr

This means that 33.8 GB data analyzed in 16 GB RAM

Large Data Teaching Resources

Technical Documentation:

Teaching-Focused Materials:

Performance Comparisons for Context:

Databases

Why Use Databases?

Databases provide:

  • Structured storage with relationships
  • ACID compliance for data integrity
  • Concurrent access for teams
  • Scalability beyond single machines
  • SQL ecosystem and tools

R Database Ecosystem

# Install if needed
# pak::pak(c("DBI", "RSQLite", "RPostgreSQL", "RMySQL", "dbplyr"))

library(DBI)         # core database interface
library(RSQLite)     # SQLite (embedded)
library(RPostgreSQL) # PostgreSQL
library(RMySQL)      # MySQL/MariaDB
library(dbplyr)      # Database-aware dplyr

Basic Database Workflow

# Connect to database
con <- dbConnect(RSQLite::SQLite(), "mydata.db")

# Write data to database
dbWriteTable(con, "customers", customer_data)

# Query with SQL
result <- dbGetQuery(con, "
  SELECT customer_id, SUM(amount) as total
  FROM orders 
  GROUP BY customer_id
")

# Always disconnect
dbDisconnect(con)

Database with dplyr Syntax

# Connect to table as dplyr object
customers_tbl <- tbl(con, "customers")

# Use familiar dplyr syntax
summary <- customers_tbl %>%
  filter(status == "active") %>%
  group_by(region) %>%
  summarise(avg_value = mean(lifetime_value)) %>%
  collect()  # Execute and bring to R

dbplyr translates dplyr code to SQL automatically!

Teaching Moment: Use show_query() to reveal the generated SQL - great for connecting R to SQL concepts

Database Teaching Resources

Core Textbook Materials:

Teaching-Ready Tutorials:

Official Package Documentation:

Summary: Your Complex Data Toolkit

Data Source Primary Tools Best For
Web Pages rvest, polite Structured web data
APIs httr2, jsonlite Real-time, authenticated data
Large Files arrow, duckdb Bigger-than-memory analysis
Databases DBI, dbplyr Structured, relational data

Choose the right tool for your data source and size!

Ideas for Projects

BRAINSTORM: What kinds of projects could you include that:

  1. Gives students exposure to these data sources
  2. While continuing to focus on the topic of your course

Additional Teaching Resources

Course Syllabi & Examples:

Practice Datasets & APIs:

Ethics & Legal Resources:

Create your Project A Data Analysis assignment

Things to include:

  • Will you require any specific research questions to be addressed, or is it open-ended?

  • Will you require any specific elements, such as “join at least two datasets” or “use a dataset of at least 10000 rows”?

  • How technical should the report be - is the audience other data scientists, or general public?

  • How will you grade successful but inefficient pipelines?

  • How will you make sure the pipeline really accomplishes what the report claims?