More Complex Data

Beyond CSV Files

Why Move Beyond CSV?

Real-world data comes in many forms:

Web data: HTML tables, JSON APIs, XML feeds
Real-time data: APIs that update continuously
Large datasets: Files too big for memory
Structured databases: Relational data with foreign keys

Today: Tools and techniques for handling complex data sources

Why Integrate Other Data Sources?

Real-world data are rarely nicely formatted
Knowing how to access and format data is crucial
Much of these data are large and (continuously) updating
Simple data sets like gapminder and palmerpenguins are sometimes not adequate for students to gain practical experience

Web Scraping

Web Scraping Fundamentals

Web scraping extracts data from websites programmatically

rvest: Core web scraping package for R
polite: Ethical scraping practices
httr2: HTTP requests and authentication (we’ll get into this with APIs)

Always check robots.txt and respect rate limits!

Web Scraping Ethics: Use robots.txt checker to review the “language” of this file

Example: Google’s robot.txt

Essential Web Scraping Packages

# pak::pak(c("rvest", "polite"))
library(rvest)
library(polite)

polite ensures you:

Check robots.txt permissions
Respect rate limits automatically
Cache requests to avoid redundancy

Polite Web Scraping Workflow

# 1. Bow and introduce yourself
session <- bow(
  "https://example.com", 
  user_agent = "Your Name <email@example.com>"
)

# 2. Scrape politely
scraped_data <- scrape(session) %>%
  html_elements("title") %>%
  html_text()

What would you expect scraped_data to contain?

Polite Web Scraping Workflow

scraped_data

[1] "Example Domain"

Polite Web Scraping Workflow

# 1. Bow and introduce yourself
session <- bow(
  "https://example.com", 
  user_agent = "Your Name <email@example.com>"
)

# 2. Scrape politely
scraped_data <- scrape(session) %>%
  html_elements("title") %>%
  html_text()

This workflow automatically: 1) Checks robots.txt, 2) Introduces your scraper and 3) Respects crawl delays

Web Scraping Resources for Teaching

Essential Learning Materials:

R for Data Science (2e) - Web Scraping Chapter - Comprehensive beginner tutorial
rvest Package Documentation - Official rvest reference with examples
Web Scraping the Polite Way - Ethics-focused tutorial
ZenRows Complete Guide 2025 - Step-by-step practical guide
Crime by Numbers Chapter 19 - Real-world examples
Automated Web Scraping Workshop - Complete lesson plan (187 slides)

APIs & JSON/XML

APIs: A Better Way to Get Data

APIs (Application Programming Interfaces) provide:

Structured access to data
Rate limiting built-in
Authentication systems
Documentation and support

NOTE: Always prefer APIs over web scraping when available! But why??

Student Exercise 💡 Compare same data via API vs scraping (OpenWeather API vs weather websites)

Modern R API Tools

# Install if needed
# pak::pak(c("httr2", "jsonlite", "xml2"))

library(httr2)    # Modern HTTP client
library(jsonlite) # JSON parsing
library(xml2)     # XML handling

httr2 features:

Pipe-friendly syntax
Built-in retry logic

Automatic request throttling
Better error handling than httr

`httr2` API Workflow

# Create and execute request 
response <- request("https://api.nasa.gov/insight_weather/") %>%
  req_url_query(api_key = "DEMO_KEY", feedtype = "json", ver = "1.0") %>%
  req_perform()

# Parse JSON response
data <- response %>%
  resp_body_json() 

data %>%
  purrr::pluck("675", "AT") %>% 
  as_tibble() %>% 
  mutate(day = 675)

# A tibble: 1 × 5
     av     ct    mn    mx   day
  <dbl>  <int> <dbl> <dbl> <dbl>
1 -62.3 177556 -96.9 -15.9   675

Student Exercise 💡 NASA has a lot of APIs to play with!

Working with JSON Data

# Convert JSON to data frame
json_data <- '{
  "users": [
    {"name": "Alice", "age": 25},
    {"name": "Bob", "age": 30}
  ]
}'

# Parse with jsonlite
df <- jsonlite::fromJSON(json_data)
df

$users
   name age
1 Alice  25
2   Bob  30

df[["users"]]

   name age
1 Alice  25
2   Bob  30

API & JSON/XML Teaching Resources

Core Documentation & Tutorials:

httr2 Package Documentation - Official reference
Wrapping APIs with httr2 - Advanced techniques
jsonlite: JSON APIs Vignette - JSON parsing examples
Working with Web Data APIs (R-bloggers) - Conceptual overview

Student-Friendly Tutorials:

DataScience+ API Tutorial - Step-by-step guide
RPubs REST API Tutorial - Practical examples
MockUp Blog JSON Parsing - Real-world data

Large Data Performance

When Data Gets Big

Traditional R tools struggle with:

Files larger than RAM
Millions of rows
Complex joins across tables
Repeated analysis on same data

Solution: Modern high-performance tools

The Performance Trio

# Install if needed
# pak::pak(c("data.table", "arrow", "duckdb"))

# High-performance data processing
library(data.table)  # In-memory speed
library(arrow)       # Columnar storage and data transfer
library(duckdb)      # In-process analytics

Performance gains:

data.table: 10-100x faster than base R
arrow + duckdb: Handle larger-than-memory data
Combined: Up to 20x faster pipelines

`data.table`

# Read large parquet file
large_data <- data.table::fread(here::here("materials", "data", "large_data.csv"))

# data.table
setDT(large_data)
large_data[year == 2023, .(avg_value = mean(value)), by = species]

Based on this code alone, can you guess what is happening to large_data?

`data.table`

Depending on student needs, data.table can be a very useful R development package:

careful dev cycle
concise syntax
efficient and fast in memory data

Thousands of R packages use on data.table for these reasons!

Arrow + DuckDB Integration

# Read large parquet file
huge_data <- arrow::open_dataset(here::here("materials", "data", "huge_data.parquet"))

# Query with DuckDB (SQL or dplyr syntax)
result <- huge_data %>%
  arrow::to_duckdb() %>%
  filter(year == 2023) %>%
  group_by(species) %>%
  summarise(avg_value = mean(bill_length_mm)) %>%
  collect()

Zero-copy integration between Arrow and DuckDB!

Why This Approach Works

Arrow: Columnar format, efficient I/O
DuckDB: In-process OLAP database
Zero-copy: No serialization overhead
Familiar syntax: Works with dplyr

This means that 33.8 GB data analyzed in 16 GB RAM

Large Data Teaching Resources

Technical Documentation:

The Raft - data.table community blog
Arrow + DuckDB Integration Guide - Zero-copy concepts
Arrow R Package Documentation - Complete reference
DuckDB R Documentation - SQL in R examples

Teaching-Focused Materials:

NCEAS Arctic Research Tutorial - Academic workshop
Christophe Nicault’s Tutorial - Step-by-step guide
Thomas Mock’s Bigger Data Tutorial - Practical examples

Performance Comparisons for Context:

Appsilon Performance Study - Benchmarking results
R-bloggers Larger-than-Memory Guide - Practical approach

Databases

Why Use Databases?

Databases provide:

Structured storage with relationships
ACID compliance for data integrity
Concurrent access for teams
Scalability beyond single machines
SQL ecosystem and tools

R Database Ecosystem

# Install if needed
# pak::pak(c("DBI", "RSQLite", "RPostgreSQL", "RMySQL", "dbplyr"))

library(DBI)         # core database interface
library(RSQLite)     # SQLite (embedded)
library(RPostgreSQL) # PostgreSQL
library(RMySQL)      # MySQL/MariaDB
library(dbplyr)      # Database-aware dplyr

Basic Database Workflow

# Connect to database
con <- dbConnect(RSQLite::SQLite(), "mydata.db")

# Write data to database
dbWriteTable(con, "customers", customer_data)

# Query with SQL
result <- dbGetQuery(con, "
  SELECT customer_id, SUM(amount) as total
  FROM orders 
  GROUP BY customer_id
")

# Always disconnect
dbDisconnect(con)

Database with `dplyr` Syntax

# Connect to table as dplyr object
customers_tbl <- tbl(con, "customers")

# Use familiar dplyr syntax
summary <- customers_tbl %>%
  filter(status == "active") %>%
  group_by(region) %>%
  summarise(avg_value = mean(lifetime_value)) %>%
  collect()  # Execute and bring to R

dbplyr translates dplyr code to SQL automatically!

Teaching Moment: Use show_query() to reveal the generated SQL - great for connecting R to SQL concepts

Database Teaching Resources

Core Textbook Materials:

R for Data Science (2e) - Databases Chapter - Student-friendly introduction
DBI Package Documentation - Complete reference guide
RSQLite Tutorial & Vignette - Embedded database examples

Teaching-Ready Tutorials:

R-Coder SQL Tutorial - Step-by-step with code
SQL Docs R Tutorial - Interactive workflows
AWS R and Databases Tutorial - Cloud perspective

Official Package Documentation:

dbplyr Documentation - dplyr translation to SQL
Databases using R (RStudio) - Best practices

Summary: Your Complex Data Toolkit

Data Source	Primary Tools	Best For
Web Pages	rvest, polite	Structured web data
APIs	httr2, jsonlite	Real-time, authenticated data
Large Files	arrow, duckdb	Bigger-than-memory analysis
Databases	DBI, dbplyr	Structured, relational data

Choose the right tool for your data source and size!

Ideas for Projects

BRAINSTORM: What kinds of projects could you include that:

Gives students exposure to these data sources
While continuing to focus on the topic of your course

Additional Teaching Resources

Course Syllabi & Examples:

UC Davis STA 141C - Complete web scraping course
Data Science in a Box - Includes web data modules

Practice Datasets & APIs:

JSONPlaceholder - Fake REST API for testing
httpbin.org - HTTP request/response service
OpenWeather API - Real weather data with free tier
REST Countries API - Country data, no auth required

Ethics & Legal Resources:

robots.txt Specification - Official standard
Web Scraping Ethics Guide - Legal considerations
API Terms of Service Examples - Teaching about data use agreements

Create your Project A Data Analysis assignment

Things to include:

Will you require any specific research questions to be addressed, or is it open-ended?
Will you require any specific elements, such as “join at least two datasets” or “use a dataset of at least 10000 rows”?
How technical should the report be - is the audience other data scientists, or general public?
How will you grade successful but inefficient pipelines?
How will you make sure the pipeline really accomplishes what the report claims?

More Complex Data

Why Move Beyond CSV?

Why Integrate Other Data Sources?

Web Scraping

Web Scraping Fundamentals

Essential Web Scraping Packages

Polite Web Scraping Workflow

Polite Web Scraping Workflow

Polite Web Scraping Workflow

Web Scraping Resources for Teaching

APIs & JSON/XML

APIs: A Better Way to Get Data

Modern R API Tools

httr2 API Workflow

Working with JSON Data

API & JSON/XML Teaching Resources

Large Data Performance

When Data Gets Big

The Performance Trio

data.table

data.table

Arrow + DuckDB Integration

Why This Approach Works

Large Data Teaching Resources

Databases

Why Use Databases?

R Database Ecosystem

Basic Database Workflow

Database with dplyr Syntax

Database Teaching Resources

Summary: Your Complex Data Toolkit

Ideas for Projects

Additional Teaching Resources

Create your Project A Data Analysis assignment

`httr2` API Workflow

`data.table`

`data.table`

Database with `dplyr` Syntax