Performance, Efficiency, and Multi-Language Integration
Today: Teaching students to think like professional developers
By the end of this session, you will:
Focus: Resources to teach professional development practices
The Good: Systematic, meaningful optimization
The Bad: Premature optimization without evidence
The Ugly: Micro-optimizing at the expense of readability
Teaching principle: Measure first, optimize second
library(data.table)
library(dplyr)
dt = as.data.table(penguins)
# Compare different approaches systematically
microbenchmark(
base_approach = aggregate(body_mass_g ~ species, penguins, mean),
dplyr_approach = penguins %>%
group_by(species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)),
data.table_approach = dt[, .(mean_mass = mean(body_mass_g, na.rm = TRUE)), by = species],
times = 100
)
Unit: microseconds
expr min lq mean median uq max neval
base_approach 268.668 301.7715 341.7602 316.0005 327.7505 1848.792 100
dplyr_approach 835.625 872.1045 975.6732 888.4380 920.2300 6758.750 100
data.table_approach 199.043 220.4585 296.5323 232.3755 247.9175 3203.959 100
Anti-pattern: Optimizing code that runs once per analysis and takes 0.001 seconds
# UGLY: Unreadable "optimized" code
ugly_fast <- penguins[!is.na(body_mass_g),
.(m=.Internal(mean(body_mass_g))),
keyby=.(s=species)]
# GOOD: Clear, maintainable code that's fast enough
clear_code <- penguins %>%
filter(!is.na(body_mass_g)) %>%
group_by(species) %>%
summarize(mean_mass = mean(body_mass_g))
Teaching Lesson: Readability matters more than micro-optimizations
profvis
library(profvis)
# set up fake large data
pen2 = replicate(10000, penguins, simplify = FALSE) |>
bind_rows() |>
mutate(species = paste(species, 1:1000))
# Profile a more complex operation
profvis({
# Simulate some data processing
results <- pen2 |>
filter(!is.na(bill_length_mm)) |>
group_by(species, island) |>
summarize(
mean_bill = mean(bill_length_mm),
sd_bill = sd(bill_length_mm),
.groups = "drop"
) |>
arrange(desc(mean_bill))
})
Student Activity: Profile their own code to identify real bottlenecks
Comprehensive Learning Materials:
Package Documentation:
Teaching Best Practices:
Computational Efficiency:
Cognitive Efficiency:
Key insight: Cognitive efficiency often matters more than computational efficiency
# High readability, adequate performance
readable_analysis <- penguins %>%
filter(!is.na(bill_length_mm), !is.na(body_mass_g)) %>%
mutate(bill_to_mass_ratio = bill_length_mm / body_mass_g) %>%
group_by(species) %>%
summarize(
mean_ratio = mean(bill_to_mass_ratio),
median_ratio = median(bill_to_mass_ratio),
n_observations = n()
)
# Versus optimized but less readable version...
idx <- !is.na(penguins$bill_length_mm) & !is.na(penguins$body_mass_g)
s <- penguins$species[idx]
r <- penguins$bill_length_mm[idx] / penguins$body_mass_g[idx]
sl <- split(r, s)
very_ugly_analysis <- data.frame(
species = names(sl),
mean_ratio = vapply(sl, mean, numeric(1)),
median_ratio = vapply(sl, median, numeric(1)),
n_observations = vapply(sl, length, integer(1))
)
Teaching Strategy: Start with readable code, optimize only when necessary
Reduce Mental Overhead:
calculate_species_averages()
not calc_sp_avg()
Research-backed: Poor code readability increases cognitive load and reduces maintenance efficiency by up to 58%
# GOOD: Clear intent, logical flow
analyze_penguin_measures <- function(data, measurement_var) {
data %>%
filter(!is.na({{measurement_var}})) %>%
group_by(species, island) %>%
summarize(
mean_value = mean({{measurement_var}}, na.rm = TRUE),
std_dev = sd({{measurement_var}}, na.rm = TRUE),
sample_size = n(),
.groups = "drop"
) %>%
arrange(species, island)
}
# Usage is self-documenting
bill_analysis <- analyze_penguin_measures(penguins, bill_length_mm)
mass_analysis <- analyze_penguin_measures(penguins, body_mass_g)
# BAD: Unclear purpose, complex logic
f <- function(d, v) {
d[!is.na(d[[deparse(substitute(v))]]), ] %>%
split(paste(d$species, d$island)) %>%
map_dfr(~ data.frame(
grp = .x$species[1],
isl = .x$island[1],
m = mean(.x[[deparse(substitute(v))]]),
s = sd(.x[[deparse(substitute(v))]]),
n = nrow(.x)
))
}
# What does this do? How do I use it?
result <- f(penguins, bill_length_mm) # Unclear!
# Show students the progression:
# 1. Inefficient but clear
slow_approach <- function(data) {
results <- data.frame()
for(species in unique(data$species)) {
subset_data <- data[data$species == species, ]
mean_mass <- mean(subset_data$body_mass_g, na.rm = TRUE)
results <- rbind(results, data.frame(species = species, mean_mass = mean_mass))
}
return(results)
}
# 2. Efficient and still clear
fast_approach <- function(data) {
data %>%
group_by(species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE))
}
Pedagogical Value: Students see the evolution from working code to efficient code
Readability and Maintainability (2025 Research):
Practical Guidelines:
Teaching Materials:
The modern data science workflow is increasingly multilingual:
Teaching philosophy: “It’s not Python vs R, it’s Python AND R”
reticulate
+ Arrow Stack2025 Update: reticulate 1.41 uses uv
backend for simplified Python environment management
# Python: Machine learning on same data
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# Use the R data directly
X = r.penguins[['bill_length_mm', 'bill_depth_mm']].dropna()
y = r.penguins['body_mass_g'].dropna()
# Train model
rf_model = RandomForestRegressor()
rf_model.fit(X, y)
# R: Data preparation and exploration
library(palmerpenguins)
library(ggplot2)
# Exploratory analysis in R
penguins %>%
ggplot(aes(bill_length_mm, body_mass_g, color = species)) +
geom_point() +
labs(title = "Penguin Measurements by Species")
# Pass cleaned data to Python for ML
py$clean_penguins <- penguins %>%
drop_na() %>%
select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, species)
# Python: Machine learning pipeline
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Use R's cleaned data
df = r.clean_penguins
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = df['species']
# Train classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
# Return predictions to R
predictions = clf.predict(X_test)
Some multi-lingual packages already available:
polars
(same for both R and python)arrow
and pyarrow
duckDB
(same for both R and python)These have very similar syntax and have the same underlying behavior.
Official Documentation & Guides:
Teaching-Focused Resources:
2025 Updates:
Performance Optimization:
Code Quality Teaching:
Multilingual Data Science:
Performance Portfolio:
Code Quality Review:
Multilingual Projects:
Real-World Applications: