Thinking Like a Developer

Performance, Efficiency, and Multi-Language Integration

Why Developer Thinking?

  • Performance awareness: Speed testing and optimization
  • Code efficiency: Computational AND cognitive considerations
  • Multi-language fluency: R + Python integration
  • Professional standards: Maintainable, readable code

Today: Teaching students to think like professional developers

Learning Objectives

By the end of this session, you will:

  1. Understand speed testing principles (good, bad, ugly approaches)
  2. Know how to balance computational and cognitive efficiency
  3. Be familiar with R-Python integration using reticulate and arrow

Focus: Resources to teach professional development practices

Speed Testing

The Good, Bad, and Ugly of Performance

The Good: Systematic, meaningful optimization

The Bad: Premature optimization without evidence

The Ugly: Micro-optimizing at the expense of readability

Teaching principle: Measure first, optimize second

Essential Performance Testing Tools

# Install performance tools
# pak::pak(c("microbenchmark", "profvis", "bench", "palmerpenguins"))

# Core performance packages
library(microbenchmark)  # Micro-benchmarking
library(profvis)         # Profiling and visualization
library(bench)           # Modern benchmarking
library(palmerpenguins)  # Example data

The Good: Systematic Benchmarking

library(data.table)
library(dplyr)

dt = as.data.table(penguins)

# Compare different approaches systematically
microbenchmark(
  base_approach = aggregate(body_mass_g ~ species, penguins, mean),
  dplyr_approach = penguins %>% 
    group_by(species) %>% 
    summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)),
  data.table_approach = dt[, .(mean_mass = mean(body_mass_g, na.rm = TRUE)), by = species],
  times = 100
)
Unit: microseconds
                expr     min       lq     mean   median       uq      max neval
       base_approach 268.668 301.7715 341.7602 316.0005 327.7505 1848.792   100
      dplyr_approach 835.625 872.1045 975.6732 888.4380 920.2300 6758.750   100
 data.table_approach 199.043 220.4585 296.5323 232.3755 247.9175 3203.959   100

The Bad: Premature Optimization

# DON'T DO THIS: Optimizing before understanding the problem
# Spending hours optimizing this:
fast_but_unreadable <- function(x) {
  .Call("C_fast_mean", x, PACKAGE = "mypackage")
}

# When this is fast enough and much clearer:
readable_solution <- function(x) {
  mean(x, na.rm = TRUE)
}

Anti-pattern: Optimizing code that runs once per analysis and takes 0.001 seconds

The Ugly: Sacrificing Readability

# UGLY: Unreadable "optimized" code
ugly_fast <- penguins[!is.na(body_mass_g), 
                     .(m=.Internal(mean(body_mass_g))), 
                     keyby=.(s=species)]

# GOOD: Clear, maintainable code that's fast enough
clear_code <- penguins %>%
  filter(!is.na(body_mass_g)) %>%
  group_by(species) %>%
  summarize(mean_mass = mean(body_mass_g))

Teaching Lesson: Readability matters more than micro-optimizations

Profiling with profvis

library(profvis)

# set up fake large data
pen2 = replicate(10000, penguins, simplify = FALSE) |> 
  bind_rows() |> 
  mutate(species = paste(species, 1:1000))

# Profile a more complex operation
profvis({
  # Simulate some data processing
  results <- pen2 |> 
    filter(!is.na(bill_length_mm)) |> 
    group_by(species, island) |> 
    summarize(
      mean_bill = mean(bill_length_mm),
      sd_bill = sd(bill_length_mm),
      .groups = "drop"
    ) |> 
    arrange(desc(mean_bill))
})

profviz docs

Student Activity: Profile their own code to identify real bottlenecks

Sample Assignment Prompts

  • “Optimize this slow function, but explain why the optimization is worth the complexity”
  • “Find the performance bottleneck in this analysis pipeline”
  • “Compare three approaches and recommend one for a team project”

Performance Testing Teaching Resources

Comprehensive Learning Materials:

Package Documentation:

Teaching Best Practices:

  • Focus on median times, not minimum
  • Use realistic data sizes for benchmarks
  • Always profile before optimizing
  • Set performance targets before starting
  • Remember, benchmarking is always built on assumptions!

Code Efficiency

Two Types of Efficiency

Computational Efficiency:

  • How fast does the code run?
  • How much memory does it use?
  • Does it scale well with larger data?

Cognitive Efficiency:

  • How easy is it to understand?
  • How quickly can someone modify it?
  • How fast can you type it?
  • How likely are bugs to be introduced?

Key insight: Cognitive efficiency often matters more than computational efficiency

The Readability-Performance Trade-off

# High readability, adequate performance
readable_analysis <- penguins %>%
  filter(!is.na(bill_length_mm), !is.na(body_mass_g)) %>%
  mutate(bill_to_mass_ratio = bill_length_mm / body_mass_g) %>%
  group_by(species) %>%
  summarize(
    mean_ratio = mean(bill_to_mass_ratio),
    median_ratio = median(bill_to_mass_ratio),
    n_observations = n()
  )

# Versus optimized but less readable version...
idx <- !is.na(penguins$bill_length_mm) & !is.na(penguins$body_mass_g)
s <- penguins$species[idx]
r <- penguins$bill_length_mm[idx] / penguins$body_mass_g[idx]
sl <- split(r, s)

very_ugly_analysis <- data.frame(
  species = names(sl),
  mean_ratio = vapply(sl, mean, numeric(1)),
  median_ratio = vapply(sl, median, numeric(1)),
  n_observations = vapply(sl, length, integer(1))
)

Teaching Strategy: Start with readable code, optimize only when necessary

Cognitive Load Principles

Reduce Mental Overhead:

  1. Meaningful names: calculate_species_averages() not calc_sp_avg()
  2. Consistent style: Pick one approach and stick to it
  3. Appropriate abstraction: Functions for repeated logic
  4. Clear structure: Logical flow from input to output

Research-backed: Poor code readability increases cognitive load and reduces maintenance efficiency by up to 58%

Good Cognitive Efficiency Example

# GOOD: Clear intent, logical flow
analyze_penguin_measures <- function(data, measurement_var) {
  data %>%
    filter(!is.na({{measurement_var}})) %>%
    group_by(species, island) %>%
    summarize(
      mean_value = mean({{measurement_var}}, na.rm = TRUE),
      std_dev = sd({{measurement_var}}, na.rm = TRUE),
      sample_size = n(),
      .groups = "drop"
    ) %>%
    arrange(species, island)
}

# Usage is self-documenting
bill_analysis <- analyze_penguin_measures(penguins, bill_length_mm)
mass_analysis <- analyze_penguin_measures(penguins, body_mass_g)

Bad Cognitive Efficiency Example

# BAD: Unclear purpose, complex logic
f <- function(d, v) {
  d[!is.na(d[[deparse(substitute(v))]]), ] %>%
    split(paste(d$species, d$island)) %>%
    map_dfr(~ data.frame(
      grp = .x$species[1], 
      isl = .x$island[1],
      m = mean(.x[[deparse(substitute(v))]]),
      s = sd(.x[[deparse(substitute(v))]]),
      n = nrow(.x)
    ))
}

# What does this do? How do I use it?
result <- f(penguins, bill_length_mm)  # Unclear!

Teaching Computational Efficiency

# Show students the progression:

# 1. Inefficient but clear
slow_approach <- function(data) {
  results <- data.frame()
  for(species in unique(data$species)) {
    subset_data <- data[data$species == species, ]
    mean_mass <- mean(subset_data$body_mass_g, na.rm = TRUE)
    results <- rbind(results, data.frame(species = species, mean_mass = mean_mass))
  }
  return(results)
}

# 2. Efficient and still clear
fast_approach <- function(data) {
  data %>%
    group_by(species) %>%
    summarize(mean_mass = mean(body_mass_g, na.rm = TRUE))
}

Pedagogical Value: Students see the evolution from working code to efficient code

Code Efficiency Teaching Resources

Readability and Maintainability (2025 Research):

Practical Guidelines:

Teaching Materials:

  • Focus on meaningful variable names and function design
  • Emphasize “readable first, optimize later” principle
  • Use real examples showing readability impact on debugging
  • Teach systematic profiling before optimization

Multi-Language Integration

Why R + Python Together?

The modern data science workflow is increasingly multilingual:

  • Best tool for the job
  • Team collaboration
  • Ecosystem access
  • Career preparation

Teaching philosophy: “It’s not Python vs R, it’s Python AND R”

The reticulate + Arrow Stack

# Install multilingual tools
# pak::pak(c("reticulate", "arrow", "palmerpenguins"))

# Setup Python environment
library(reticulate)
library(arrow)

# Modern 2025 approach: automatic Python setup
py_require(c("pandas", "numpy", "pyarrow"))

2025 Update: reticulate 1.41 uses uv backend for simplified Python environment management

Efficient Data Transfer with Arrow

# R: Prepare data as Arrow table
library(arrow)
penguins_arrow <- arrow_table(penguins)

# Pass to Python (zero-copy!)
py$penguins_data <- penguins_arrow

# Python chunk processes data
# Python: Work with the data
import pandas as pd

# Convert from Arrow (still efficient)
df = r.penguins_data.to_pandas()

# Python-specific analysis
from sklearn.cluster import KMeans
# ... machine learning code ...

# Return results to R
processed_data = df.groupby('species').mean()

Seamless Language Switching

# R: Statistical modeling
library(broom)

model_results <- penguins %>%
  filter(!is.na(bill_length_mm), !is.na(body_mass_g)) %>%
  nest_by(species) %>%
  mutate(
    model = list(lm(body_mass_g ~ bill_length_mm, data = data)),
    tidy_results = list(tidy(model))
  )
# Python: Machine learning on same data
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Use the R data directly
X = r.penguins[['bill_length_mm', 'bill_depth_mm']].dropna()
y = r.penguins['body_mass_g'].dropna()

# Train model
rf_model = RandomForestRegressor()
rf_model.fit(X, y)

Teaching Multi-Language Workflows

# R: Data preparation and exploration
library(palmerpenguins)
library(ggplot2)

# Exploratory analysis in R
penguins %>%
  ggplot(aes(bill_length_mm, body_mass_g, color = species)) +
  geom_point() +
  labs(title = "Penguin Measurements by Species")

# Pass cleaned data to Python for ML
py$clean_penguins <- penguins %>%
  drop_na() %>%
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, species)
# Python: Machine learning pipeline
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Use R's cleaned data
df = r.clean_penguins
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = df['species']

# Train classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Return predictions to R
predictions = clf.predict(X_test)

Multi-lingual Packages

Some multi-lingual packages already available:

  • polars (same for both R and python)
  • arrow and pyarrow
  • duckDB (same for both R and python)

These have very similar syntax and have the same underlying behavior.

Multi-Language Teaching Resources

Official Documentation & Guides:

Teaching-Focused Resources:

2025 Updates:

  • reticulate 1.41 with uv backend for simplified setup
  • Improved Arrow integration for zero-copy data transfer
  • Growing emphasis on multilingual data science teams

Professional Development for Instructors

Performance Optimization:

Code Quality Teaching:

Multilingual Data Science:

Assessment Ideas for Developer Thinking

Performance Portfolio:

  • Document before/after optimization with benchmarks
  • Explain trade-offs between performance and readability
  • Profile and optimize a provided slow function

Code Quality Review:

  • Peer review exercises with readability rubrics
  • Refactor legacy code for better maintainability
  • Write functions with clear interfaces and documentation

Multilingual Projects:

  • Implement analysis using both R and Python
  • Compare language-specific approaches to same problem
  • Create reproducible multilingual workflow documentation

Real-World Applications:

  • Optimize code for large datasets
  • Debug performance issues in complex pipelines
  • Collaborate on mixed-language team projects