Thinking Like a Developer

Performance, Efficiency, and Multi-Language Integration

Why Developer Thinking?

Performance awareness: Speed testing and optimization
Code efficiency: Computational AND cognitive considerations
Multi-language fluency: R + Python integration
Professional standards: Maintainable, readable code

Today: Teaching students to think like professional developers

Learning Objectives

By the end of this session, you will:

Understand speed testing principles (good, bad, ugly approaches)
Know how to balance computational and cognitive efficiency
Be familiar with R-Python integration using reticulate and arrow

Focus: Resources to teach professional development practices

Speed Testing

The Good, Bad, and Ugly of Performance

The Good: Systematic, meaningful optimization

The Bad: Premature optimization without evidence

The Ugly: Micro-optimizing at the expense of readability

Teaching principle: Measure first, optimize second

Essential Performance Testing Tools

# Install performance tools
# pak::pak(c("microbenchmark", "profvis", "bench", "palmerpenguins"))

# Core performance packages
library(microbenchmark)  # Micro-benchmarking
library(profvis)         # Profiling and visualization
library(bench)           # Modern benchmarking
library(palmerpenguins)  # Example data

The Good: Systematic Benchmarking

library(data.table)
library(dplyr)

dt = as.data.table(penguins)

# Compare different approaches systematically
microbenchmark(
  base_approach = aggregate(body_mass_g ~ species, penguins, mean),
  dplyr_approach = penguins %>% 
    group_by(species) %>% 
    summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)),
  data.table_approach = dt[, .(mean_mass = mean(body_mass_g, na.rm = TRUE)), by = species],
  times = 100
)

Unit: microseconds
                expr     min       lq     mean   median       uq      max neval
       base_approach 268.668 301.7715 341.7602 316.0005 327.7505 1848.792   100
      dplyr_approach 835.625 872.1045 975.6732 888.4380 920.2300 6758.750   100
 data.table_approach 199.043 220.4585 296.5323 232.3755 247.9175 3203.959   100

The Bad: Premature Optimization

# DON'T DO THIS: Optimizing before understanding the problem
# Spending hours optimizing this:
fast_but_unreadable <- function(x) {
  .Call("C_fast_mean", x, PACKAGE = "mypackage")
}

# When this is fast enough and much clearer:
readable_solution <- function(x) {
  mean(x, na.rm = TRUE)
}

Anti-pattern: Optimizing code that runs once per analysis and takes 0.001 seconds

The Ugly: Sacrificing Readability

# UGLY: Unreadable "optimized" code
ugly_fast <- penguins[!is.na(body_mass_g), 
                     .(m=.Internal(mean(body_mass_g))), 
                     keyby=.(s=species)]

# GOOD: Clear, maintainable code that's fast enough
clear_code <- penguins %>%
  filter(!is.na(body_mass_g)) %>%
  group_by(species) %>%
  summarize(mean_mass = mean(body_mass_g))

Teaching Lesson: Readability matters more than micro-optimizations

Profiling with `profvis`

library(profvis)

# set up fake large data
pen2 = replicate(10000, penguins, simplify = FALSE) |> 
  bind_rows() |> 
  mutate(species = paste(species, 1:1000))

# Profile a more complex operation
profvis({
  # Simulate some data processing
  results <- pen2 |> 
    filter(!is.na(bill_length_mm)) |> 
    group_by(species, island) |> 
    summarize(
      mean_bill = mean(bill_length_mm),
      sd_bill = sd(bill_length_mm),
      .groups = "drop"
    ) |> 
    arrange(desc(mean_bill))
})

profviz docs

Student Activity: Profile their own code to identify real bottlenecks

Sample Assignment Prompts

“Optimize this slow function, but explain why the optimization is worth the complexity”
“Find the performance bottleneck in this analysis pipeline”
“Compare three approaches and recommend one for a team project”

Performance Testing Teaching Resources

Comprehensive Learning Materials:

Advanced R - Measuring Performance - Hadley Wickham’s comprehensive guide
Mastering Software Development in R - Academic perspective
USC Biostats R Handbook - Practical examples
Advanced R Solutions - Problem-solving approach

Package Documentation:

microbenchmark Documentation - Accurate timing functions
Appsilon Microbenchmark Guide - Real-world examples
RPubs Efficient R Code - Student-friendly tutorial

Teaching Best Practices:

Focus on median times, not minimum
Use realistic data sizes for benchmarks
Always profile before optimizing
Set performance targets before starting
Remember, benchmarking is always built on assumptions!

Code Efficiency

Two Types of Efficiency

Computational Efficiency:

How fast does the code run?
How much memory does it use?
Does it scale well with larger data?

Cognitive Efficiency:

How easy is it to understand?
How quickly can someone modify it?
How fast can you type it?
How likely are bugs to be introduced?

Key insight: Cognitive efficiency often matters more than computational efficiency

The Readability-Performance Trade-off

# High readability, adequate performance
readable_analysis <- penguins %>%
  filter(!is.na(bill_length_mm), !is.na(body_mass_g)) %>%
  mutate(bill_to_mass_ratio = bill_length_mm / body_mass_g) %>%
  group_by(species) %>%
  summarize(
    mean_ratio = mean(bill_to_mass_ratio),
    median_ratio = median(bill_to_mass_ratio),
    n_observations = n()
  )

# Versus optimized but less readable version...
idx <- !is.na(penguins$bill_length_mm) & !is.na(penguins$body_mass_g)
s <- penguins$species[idx]
r <- penguins$bill_length_mm[idx] / penguins$body_mass_g[idx]
sl <- split(r, s)

very_ugly_analysis <- data.frame(
  species = names(sl),
  mean_ratio = vapply(sl, mean, numeric(1)),
  median_ratio = vapply(sl, median, numeric(1)),
  n_observations = vapply(sl, length, integer(1))
)

Teaching Strategy: Start with readable code, optimize only when necessary

Cognitive Load Principles

Reduce Mental Overhead:

Meaningful names: calculate_species_averages() not calc_sp_avg()
Consistent style: Pick one approach and stick to it
Appropriate abstraction: Functions for repeated logic
Clear structure: Logical flow from input to output

Research-backed: Poor code readability increases cognitive load and reduces maintenance efficiency by up to 58%

Good Cognitive Efficiency Example

# GOOD: Clear intent, logical flow
analyze_penguin_measures <- function(data, measurement_var) {
  data %>%
    filter(!is.na({{measurement_var}})) %>%
    group_by(species, island) %>%
    summarize(
      mean_value = mean({{measurement_var}}, na.rm = TRUE),
      std_dev = sd({{measurement_var}}, na.rm = TRUE),
      sample_size = n(),
      .groups = "drop"
    ) %>%
    arrange(species, island)
}

# Usage is self-documenting
bill_analysis <- analyze_penguin_measures(penguins, bill_length_mm)
mass_analysis <- analyze_penguin_measures(penguins, body_mass_g)

Bad Cognitive Efficiency Example

# BAD: Unclear purpose, complex logic
f <- function(d, v) {
  d[!is.na(d[[deparse(substitute(v))]]), ] %>%
    split(paste(d$species, d$island)) %>%
    map_dfr(~ data.frame(
      grp = .x$species[1], 
      isl = .x$island[1],
      m = mean(.x[[deparse(substitute(v))]]),
      s = sd(.x[[deparse(substitute(v))]]),
      n = nrow(.x)
    ))
}

# What does this do? How do I use it?
result <- f(penguins, bill_length_mm)  # Unclear!

Teaching Computational Efficiency

# Show students the progression:

# 1. Inefficient but clear
slow_approach <- function(data) {
  results <- data.frame()
  for(species in unique(data$species)) {
    subset_data <- data[data$species == species, ]
    mean_mass <- mean(subset_data$body_mass_g, na.rm = TRUE)
    results <- rbind(results, data.frame(species = species, mean_mass = mean_mass))
  }
  return(results)
}

# 2. Efficient and still clear
fast_approach <- function(data) {
  data %>%
    group_by(species) %>%
    summarize(mean_mass = mean(body_mass_g, na.rm = TRUE))
}

Pedagogical Value: Students see the evolution from working code to efficient code

Code Efficiency Teaching Resources

Readability and Maintainability (2025 Research):

Code Readability Research - 58% variance in maintenance efficiency
Cognitive Load Studies - Impact on developer comprehension
Premature Optimization Pitfalls - 2025 best practices

Practical Guidelines:

Stack Overflow: Readability vs Performance - Community wisdom
Code Quality Principles - Professional standards
LinkedIn: Balancing Optimization - Industry perspectives

Teaching Materials:

Focus on meaningful variable names and function design
Emphasize “readable first, optimize later” principle
Use real examples showing readability impact on debugging
Teach systematic profiling before optimization

Multi-Language Integration

Why R + Python Together?

The modern data science workflow is increasingly multilingual:

Best tool for the job
Team collaboration
Ecosystem access
Career preparation

Teaching philosophy: “It’s not Python vs R, it’s Python AND R”

The `reticulate` + Arrow Stack

# Install multilingual tools
# pak::pak(c("reticulate", "arrow", "palmerpenguins"))

# Setup Python environment
library(reticulate)
library(arrow)

# Modern 2025 approach: automatic Python setup
py_require(c("pandas", "numpy", "pyarrow"))

2025 Update: reticulate 1.41 uses uv backend for simplified Python environment management

Efficient Data Transfer with Arrow

# R: Prepare data as Arrow table
library(arrow)
penguins_arrow <- arrow_table(penguins)

# Pass to Python (zero-copy!)
py$penguins_data <- penguins_arrow

# Python chunk processes data

# Python: Work with the data
import pandas as pd

# Convert from Arrow (still efficient)
df = r.penguins_data.to_pandas()

# Python-specific analysis
from sklearn.cluster import KMeans
# ... machine learning code ...

# Return results to R
processed_data = df.groupby('species').mean()

Seamless Language Switching

# R: Statistical modeling
library(broom)

model_results <- penguins %>%
  filter(!is.na(bill_length_mm), !is.na(body_mass_g)) %>%
  nest_by(species) %>%
  mutate(
    model = list(lm(body_mass_g ~ bill_length_mm, data = data)),
    tidy_results = list(tidy(model))
  )

# Python: Machine learning on same data
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Use the R data directly
X = r.penguins[['bill_length_mm', 'bill_depth_mm']].dropna()
y = r.penguins['body_mass_g'].dropna()

# Train model
rf_model = RandomForestRegressor()
rf_model.fit(X, y)

Teaching Multi-Language Workflows

# R: Data preparation and exploration
library(palmerpenguins)
library(ggplot2)

# Exploratory analysis in R
penguins %>%
  ggplot(aes(bill_length_mm, body_mass_g, color = species)) +
  geom_point() +
  labs(title = "Penguin Measurements by Species")

# Pass cleaned data to Python for ML
py$clean_penguins <- penguins %>%
  drop_na() %>%
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, species)

# Python: Machine learning pipeline
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Use R's cleaned data
df = r.clean_penguins
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = df['species']

# Train classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Return predictions to R
predictions = clf.predict(X_test)

Multi-lingual Packages

Some multi-lingual packages already available:

polars (same for both R and python)
arrow and pyarrow
duckDB (same for both R and python)

These have very similar syntax and have the same underlying behavior.

Multi-Language Teaching Resources

Official Documentation & Guides:

Arrow R-Python Integration - Official cross-language guide
reticulate Documentation - Complete R-Python integration
Danielle Navarro’s Arrow Tutorial - Practical examples

Teaching-Focused Resources:

Atorus Research Multilingual Markdown - Educational approach
Microsoft Data Science Guide - Team collaboration
Flukeandfeather Introduction - Beginner-friendly tutorial

2025 Updates:

reticulate 1.41 with uv backend for simplified setup
Improved Arrow integration for zero-copy data transfer
Growing emphasis on multilingual data science teams

Professional Development for Instructors

Performance Optimization:

Advanced R Performance - Comprehensive optimization guide
R Inferno - Common performance pitfalls
Efficient R Programming - Practical optimization strategies

Code Quality Teaching:

Clean Code Principles - Industry standards
Code Quality Research - Academic perspectives
Refactoring Techniques - Systematic improvement methods

Multilingual Data Science:

Python for R Users - Transition strategies
Polyglot Data Science - Team workflows
Cross-Language Best Practices - Research software guidelines

Assessment Ideas for Developer Thinking

Performance Portfolio:

Document before/after optimization with benchmarks
Explain trade-offs between performance and readability
Profile and optimize a provided slow function

Code Quality Review:

Peer review exercises with readability rubrics
Refactor legacy code for better maintainability
Write functions with clear interfaces and documentation

Multilingual Projects:

Implement analysis using both R and Python
Compare language-specific approaches to same problem
Create reproducible multilingual workflow documentation

Real-World Applications:

Optimize code for large datasets
Debug performance issues in complex pipelines
Collaborate on mixed-language team projects