Gallary of Exemplary Student Work

Challenge 2

Here are some of the exemplary assignments from your peers!

Themes

This code is brought to us by Jonathan Flores! I really appreciated the extensive themes they included and how these themes make the plot clearer to the reader.

ggplot(data = surveys,
    mapping = aes(x = species, y = weight)) +
  geom_boxplot(outliers = FALSE) +
  geom_jitter(color = "steel blue", alpha = 0.08) +
  labs(title = "Distribution of Weight by Species Boxplot",
       subtitle = "Vertical Orientation",
       x = "Species Name", y = "Weight (g)") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45),
        axis.ticks.length.x = unit(8, "mm"),
        axis.ticks.x = element_line(color = "white"),
        panel.border = element_rect(color = "black", linewidth = 1),
        plot.title = element_text(face = "bold"),
        plot.subtitle = element_text(color = "steelblue"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

Unique Colors

This code is brought to us by Fisher Fraley! I really appreciated how they found a unique color theme from a context they liked!

# Color Palette: https://color.adobe.com/TRON-color-theme-6970180/
tron_colors <- c("#DF7212", "#6FC3DF", "#E6FFFF", "#FFE64D", "#0C141F")

ggplot(data = surveys, aes(x = weight, y = species, color = sex)) +
  geom_boxplot(outlier.shape = NA, width = 0.6) +
  geom_jitter(width = 0.15, height = 0, alpha = 0.3, size = 1) +
  scale_color_manual(values = tron_colors) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
  labs(
    title = "Rodent Weight by Species",
    subtitle = "Y-Axis = Rodent Species",
    x = "Rodent Weight (g)",
    y = NULL,
    caption = "Source: Portal Project Database"
   )

Removing the Legend (and More)

This code is brought to us by Lucy Thackray! I thought it was special that Lucy stacked all three challenged to make a clean, easy to read plot.

colors <- c("#f07857", "#4fb06d" )

ggplot(data = surveys,
       mapping = aes(x = weight, y = species)) +
  geom_jitter( alpha = 0.5) +
  geom_boxplot(outliers = FALSE, mapping = aes(color = sex)) +
  labs(x = "Weight (g)", y = "Species", color = "Sex",
       subtitle = "<span style = 'color:#f07857;'>Female</span> vs <span style = 'color:#4fb06d;'>Male</span> rodent species weights") +
  theme_bw() +
  theme(panel.grid.minor = element_blank()) +
   theme(
    legend.position = "none", 
    plot.subtitle = element_markdown()) +
   theme(axis.ticks.y = element_blank()) +
scale_color_manual(values = colors)

Lab 3

across() + mutate()

This code is brought to us by a student who wishes to remain anonymous! What I appreciated about this code was the use of across() when converting multiple columns to numeric. I also loved their use of the most current syntax for the across() function!

teacher_evals_clean <- teacher_evals |>
  rename(sex = gender) |>
  mutate(teacher_id = as.character(teacher_id),
         question_no = as.character(question_no),
         across(
           .cols = c(no_participants, 
                     resp_share, 
                     SET_score_avg, 
                     percent_failed_cur, 
                     seniority),
           .fns = ~ as.numeric(.x)
           )
         )|>
  filter(no_participants > 10) |>
  select(course_id,
         teacher_id,
         question_no,
         no_participants,
         resp_share,
         SET_score_avg,
         percent_failed_cur,
         academic_degree,
         seniority,
         sex)

Confirming Unique Row Identification

This code is brought to us by Akshat Khandelwal! I thought it was very clever how Akshat played with different combinations of variables to confirm which combination uniquely identified a row.

teacher_question <- teacher_evals |>
  distinct(teacher_id, question_no)

print(nrow(teacher_question) == nrow(teacher_evals))  # False
[1] FALSE
teacher_question_course <- teacher_evals |>
  distinct(teacher_id, question_no, course_id)

print(nrow(teacher_question_course) == nrow(teacher_evals))  # True
[1] TRUE
course_question <- teacher_evals |>
  distinct(question_no, course_id) 

print(nrow(course_question) == nrow(teacher_evals)) # False
[1] FALSE

Proportions in Demographic Table

This code is brought to us by Arturo Ordaz-Gutierrez! I really appreciated how Art added the proportions to this demographic table so it is easier to see the representation of each group.

teacher_evals_clean |>
  distinct(teacher_id, .keep_all = TRUE) |>
  count(academic_degree, sex) |>
  mutate(percent = round(100 * n / sum(n), 1)) |>
  arrange(desc(percent))
# A tibble: 8 × 4
  academic_degree sex        n percent
  <chr>           <chr>  <int>   <dbl>
1 dr              male      95    32.3
2 dr              female    73    24.8
3 ma              male      39    13.3
4 ma              female    36    12.2
5 no_dgr          female    24     8.2
6 no_dgr          male      19     6.5
7 prof            male       7     2.4
8 prof            female     1     0.3

across() + summarize()

This code is brought to us by Rachel Survilas! I loved how Rachel noticed they were using the same n_distinct() function across multiple columns and decided to use across()! Also, the syntax they are using is on point!

teacher_evals_clean |> 
  summarize(
    across(
      .cols = c(teacher_id, course_id),
      .fns = ~ n_distinct(.x))
    )
# A tibble: 1 × 2
  teacher_id course_id
       <int>     <int>
1        294       921

Demographic Barplots with Annotations

This code is brought to us by a student who wishes to remain anonymous! I thought this code was great because they chose to visualize the teacher demographics! They even added annotations to the top of their barplots!

individuals <- teacher_evals_clean |>
  group_by(teacher_id, sex, academic_degree, seniority) |> 
  summarise(n = n()) |> 
  ungroup()

individuals |> 
  group_by(sex) |> 
  summarise(n = n()) |> 
  ggplot(mapping = aes(x = sex, y = n)) +
  geom_col() +
      geom_text(aes(label = n), vjust = -0.5) +
      labs(x = "Sex", y = "Count")
individuals |> 
  group_by(seniority) |> 
  summarise(n = n()) |> 
  ggplot(mapping = aes(x = seniority, y = n)) +
  geom_col() +
      geom_text(aes(label = n), vjust = -0.5) +
      labs(x = "Seniority in Years", y = "Count")
individuals |> 
  group_by(academic_degree) |> 
  summarise(n = n()) |> 
  ggplot(mapping = aes(x = academic_degree, y = n)) +
  geom_col() +
      geom_text(aes(label = n), vjust = -0.5) +
      labs(x = "Academic Degree Held", y = "Count")

Slicing the Max & Min Together

This code is brought to us by a student who wishes to remain anonymous! I thought this code was great because they got both the maximum values and the minimum values in one pipeline. Very efficient!

teacher_evals_clean |>
  filter(seniority == 1) |>
  group_by(teacher_id) |>
  summarize(
    avg_rating = mean(percent_failed_cur, na.rm = TRUE),
    .groups = "drop"
  ) |>
  filter(
    avg_rating == min(avg_rating)|
    avg_rating == max(avg_rating)
  ) |> 
  arrange(avg_rating)
# A tibble: 7 × 2
  teacher_id avg_rating
  <chr>           <dbl>
1 102379           0   
2 103092           0   
3 106126           0   
4 86222            0   
5 98650            0   
6 98651            0   
7 106692           0.68

Using kable() for Nicer Looking Tables

This code is brought to us by a student who wishes to remain anonymous! I appreciated that they read the instructions to use the kable() function to get nicer looking tables. The code is very simple, but the tables look much nicer!

percent_failed <- teacher_evals_clean |> 
  group_by(teacher_id) |> 
  filter(seniority == 1) |> 
  summarize(avg_percent_failed = mean(percent_failed_cur))

percent_failed |> 
  slice_min(avg_percent_failed) |> 
  kable()
teacher_id avg_percent_failed
102379 0
103092 0
106126 0
86222 0
98650 0
98651 0

Challenge 3

Hypothesis Test Conclusion

These statements are brought to us by Abel Alcala and Sebastien Montgrain! I really appreciated the context they infused into their hypothesis test conclusion!

Based on the results of the Chi-Squared test of Independence, we have a p-value of 0.006075 and a X^2 value of 10.207. Since the p-value is less than α = 0.05, we reject the null hypotheses indicating a statistically significant association between seniority level and SET evaluation results. This suggests that instructor experience across junior, senior, and very senior appears to influence how students perceive the use of engaging activities in class.

Because the p-value of our Chi-squared test of independence is less than 0.05 (0.006075), we know that there is an association between a teachers seniority and their SET scores from question 3 on if the professor used activities to make the class more engaging. Combining this Chi-squared test with the graph, we can see that more experienced teachers do a better job of implementing engaging activities is their classes.

Challenge 4

Amazing Tables

Amazing Plots

This plot is brought to us by Marietta Nikolskaya! I was really excited to see someone create a “lollipop” plot, since they are what we (Statisticians) make to compare how different a summary statistic is between groups. When plotting statistics like the mean, median, or difference, we generally advocate for this type of plot over a barplot since barplots are typically reserved for plots of counts or proportions.

Code
# Here is the data wrangling that happened to make this plot!

ca_childcare_new <- ca_childcare |> 
  filter(study_year == 2018) |>
  select(census_region, mc_infant, mfcc_infant) |> 
  mutate(cost_difference = mc_infant - mfcc_infant) |>
  group_by(census_region) |>
  summarize(
    median_center = median(mc_infant),
    median_family = median(mfcc_infant),
    median_difference = median(cost_difference)) |>
  arrange(desc(median_difference))
ca_childcare_new |>
  ggplot(aes(y = fct_reorder(census_region, median_difference), 
             x = median_difference)) +
  geom_segment(aes(yend = census_region, x = 0, xend = median_difference), 
               color = "gray50") +
  geom_point(size = 4, color = "steelblue") +
  labs(
    y = "",
    x = "Cost Difference: Center - Family ($)",
    title = "Cost Difference Between Center and Family Childcare for Infants (2018)"
  ) +
  theme_minimal()

This plot is brought to us by Devin Hadley! I thought this plot was really creative in their use of a non-standard geometry (geom_ribbon()) to fill the space between the lines. I also appreciated that they moved the legend to the top of the plot to make it easier to see!

Code
# Here is the data wrangling that happened to make this plot!
summary_table <- ca_childcare |>
  group_by(study_year) |>
  summarise(
    median_center_price = median(mc_infant),
    median_family_price = median(mfcc_infant)
  ) |>
  mutate(
    center_minus_family_price = median_center_price - median_family_price
  )

summary_table_long <- summary_table |>
  pivot_longer(
    cols = c("median_center_price", "median_family_price"),
    names_to = "care_type",
    values_to = "median_price"
  ) |>
  mutate(
    care_type = case_when(
      care_type == "median_center_price" ~ "Center-Based",
      care_type == "median_family_price" ~ "Family-Based",
      TRUE ~ care_type
    ))
ggplot(summary_table_long, aes(x = study_year, y = median_price, color = care_type)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  geom_ribbon(data = summary_table, aes(x = study_year, ymin = median_family_price, ymax = median_center_price, group = 1),
              fill = "skyblue", alpha = 0.3, inherit.aes = FALSE) +
  labs(
    title = "Median Weekly Infant Childcare Costs in California (2008-2018)",
    subtitle = "Based off the aggregated median price for each care type on an annual basis in various California counties.",
    x = "Year",
    y = "Median Weekly Price (2018 Dollars)",
    color = "Type of Childcare"
  ) +
  scale_x_continuous(breaks = seq(2008, 2018, by = 2)) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "top",
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(margin = margin(b = 10)),
    axis.title = element_text(face = "bold"),
    plot.background = element_rect(color = "black", linewidth = 1) 
  )

This plot is brought to us by Haley Wong! Haley also featured a non-standard geometry (geom_ribbon()) to fill the space between the lines on this plot! I really appreciated Haley’s attention to detail in making titles and labels on the plot that made the context very clear to the reader—$ on the y-axis labels and a subtitle explaining what the gap represents.

Code
# Here is the data wrangling that happened to make this plot!
table <- ca_childcare |>
  pivot_longer(
    cols = mc_infant:mfcc_preschool,
    names_to = "type_age",
    values_to = "median_price"
  ) |>
  mutate(
    type = case_when(
      str_detect(type_age, "^mc_") ~ "center_based",
      str_detect(type_age, "^mfcc_") ~ "family"
    ),
    age = case_when(
      str_detect(type_age, "infant$") ~ "infant",
      str_detect(type_age, "toddler$") ~ "toddler",
      str_detect(type_age, "preschool$") ~ "preschool"
    )
  ) |>
  group_by(study_year, age, type) |>
  summarise(median_price = median(median_price, na.rm = TRUE), .groups = "drop") |>
  pivot_wider(names_from = type, values_from = median_price) |>
  mutate(difference = center_based - family)

prices_long <- table |>
  pivot_longer(
    cols = c(center_based, family),
    names_to = "setting_type",
    values_to = "median_price"
  ) |>
  mutate(
    setting_label = fct_recode(
      setting_type,
      "Center-based" = "center_based",
      "Family (In-home)" = "family"
    ),
    age = factor(age, levels = c("infant", "toddler", "preschool"))
  )
ggplot(prices_long, aes(x = study_year, y = median_price, color = setting_label)) +
  geom_ribbon(
    data = table |> 
      mutate(age = factor(age, levels = c("infant", "toddler", "preschool"))),
    aes(x = study_year, ymin = family, ymax = center_based, fill = "Price Gap"),
    alpha = 0.15,
    inherit.aes = FALSE
  ) +
  geom_line(size = 1.2) +
  geom_point(size = 2.5, alpha = 0.9) +
  facet_wrap(~ age, nrow = 1) +
  scale_color_brewer(palette = "Set1") +
  scale_fill_manual(values = c("Price Gap" = "gray70")) +
  scale_y_continuous(labels = scales::label_dollar()) +
  scale_x_continuous(breaks = seq(2008, 2018, 4)) +
  labs(
    title = "Center-Based vs Family Childcare Costs Over Time",
    subtitle = "Shaded area shows median price gap by age group (California)",
    x = "Study Year",
    y = "Median Weekly Price (USD)",
    color = "Care Setting",
    fill = ""
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 13, hjust = 0.5),
    axis.text.x = element_text(size = 10),
    legend.position = "right"
  )

This plot is brought to us by Isaiah Woodard! I really appreciated how Isaiah took the skills they learned in Challenge 2 and applied them to a new setting!

Code
# Here is the data wrangling that happened to make this plot!

plot_data <- ca_childcare %>% 
  select(census_region, study_year, mfccsa, mcsa) %>% 
  filter(study_year == c("2008":"2018")) %>% 
  group_by(census_region, study_year) %>% 
  summarise(median_mcsa = median(mcsa),
            median_mfccsa = median(mfccsa)) %>% 
  pivot_longer(cols = c(median_mcsa, median_mfccsa),
               names_to = "category",
               values_to = "medians")

In a similar setting, Lucy also took the ideas from Challenge 2 and applied them to make an exciting plot!

Code
# Here is the data wrangling that happened to make this plot!

challenge <- ca_childcare |>
  rename(CenterBased = mcsa,
         Family = mfccsa) |>
  pivot_longer(cols = CenterBased:Family,
               names_to = "type_of_care",
               values_to = "Median_price")
colors <- c("#6C88C4", "#FF828B")

challenge |>
 mutate(year = as.factor(study_year)) |>
  ggplot(mapping = aes(x = year,
                       y = Median_price,
                       color = type_of_care)) +
          
  theme_bw() +
  geom_jitter (alpha = 0.5, aes(color = type_of_care), show.legend = FALSE) +
  geom_boxplot(outliers = FALSE) +
  labs(x = "Study Year",
       y = "",
       color = "Type of Childcare",
       subtitle = "Weekly Median Price ($) for <span style = 'color:#6C88C4;'>Center Based</span> vs <span style = 'color:#FF828B;'>Family</span> Childcare") +
    scale_color_manual(values = colors) +
  theme(legend.position = "none", 
        plot.subtitle = element_markdown(),
        panel.grid.major.x = element_blank()) 

plot_data %>% 
  ggplot(aes(x = study_year,
             y = medians,
             color = category)) +
  facet_wrap(~ census_region) +
  geom_point() +
  scale_x_continuous(breaks = c(2008, 2010, 2012, 2014, 2016, 2018)) +
  labs(title = "Price of Childcare Over Time in California",
       subtitle = "Plots subdivided for <span style='color:#009e73;'>center-based</span> and <span     style='color:#d55e00;'>at-home</span> childcare",
       x = "Study Year",
       y = "Median Weekly Cost of Childcare ($)") +
  scale_color_manual(values = c("#009e73", "#d55e00")) +
  theme_bw() +
  theme(plot.subtitle = ggtext::element_markdown(),
        axis.text = element_text(size = 5),
        legend.position = "")

This plot is brought to us by Rayan Tahir! I thought Rayan’s plot was very clear and easy to understand. Moreover, their use of $ on the axis labels and moving the legend to the top of the plot made it even easier to understand what was being plotted.

Code
# Here is the data wrangling that happened to make this plot!

ca_afford <- ca_childcare |>
  mutate(
    weekly_income = mhi_2018 / 52, 
    center_afford = mc_infant / weekly_income,
    family_afford = mfcc_infant / weekly_income
  )

ca_afford_long <- ca_afford |>
  select(study_year, center_afford,          family_afford) |>
  pivot_longer(
    cols = c(center_afford, family_afford),
    names_to = "Setting",
    values_to = "Affordability"
) |>
mutate(
  Setting = recode(Setting,
                  center_afford = "Center-Based",
                   family_afford = "Family-Based")
)
ggplot(ca_afford_long, aes(x = study_year, y = Affordability, color = Setting)) +
  stat_summary(fun = median, geom = "line", linewidth = 1.2) +
  stat_summary(fun = median, geom = "point", size = 2.2) +
  scale_y_continuous(labels = scales::label_percent(accuracy = 1)) + 
  labs(
  title = "Affordability of Childcare in California (2008-2018)",
  subtitle = "Median Weekly Cost as a Share of County Median Household Income",
  x = "Year",
  y = "Median Share of Weekly Income Spent on Infant Care",
  color = "Childcare Setting"
) +
theme_minimal(base_size = 11) +
theme(
  plot.title = element_text(face = "bold", size = 15, hjust = 0.5),
  plot.subtitle= element_text(size = 12, hjust= 0.5),
  legend.position = "top"
)

This plot is brought to us by a student who wishes to remain anonymous. I thought this plot was both incredibly clear and rather creative. They used a non-standard geom (geom_violin) which shows the density of each group’s distribution. Moreover, they plot a horizontal line at the median and annotate the median value (with dollar signs!).

Code
# Here is the data wrangling that happened to make this plot!

plot_data <- ca_childcare |>
  select(mc_infant, mc_toddler, mc_preschool,
         mfcc_infant, mfcc_toddler, mfcc_preschool) |>
  pivot_longer(
    cols = everything(),
    names_to = "age_group",
    values_to = "weekly_price"
  ) |>
  mutate(
    age_group = recode(
      age_group,
      mc_infant = "Center_Infant",
      mc_toddler = "Center_Toddler",
      mc_preschool = "Center_Preschool",
      mfcc_infant = "Family_Infant",
      mfcc_toddler = "Family_Toddler",
      mfcc_preschool = "Family_Preschool"
    ),
    setting = if_else(str_detect(age_group, "Center"), "Center-Based", "Family (In-Home)"),
    age_group = case_when(
      str_detect(age_group, "Infant") ~ "Infant",
      str_detect(age_group, "Toddler") ~ "Toddler",
      str_detect(age_group, "Preschool") ~ "Preschool"
    ),
    age_group = fct_relevel(age_group, "Infant", "Toddler", "Preschool")
  )
library(RColorBrewer)
custom_colors <- colorRampPalette(brewer.pal(8, "Accent"))(10)

ggplot(plot_data, aes(x = age_group, y = weekly_price, fill = age_group)) +
  geom_violin(trim = TRUE, alpha = 0.8, linewidth = 0.5) +
  stat_summary(fun = median, fun.min = median, fun.max = median,
               geom = "errorbar", width = 0.45, color = "black", linewidth = 0.5) +
  stat_summary(fun = median, geom = "text",
               aes(label = scales::dollar(after_stat(y))),
               vjust = -0.6, size = 3, color = "black") +
  facet_wrap(~ setting, ncol = 2) +
  scale_y_continuous(labels = scales::dollar) +
  scale_fill_manual(values = custom_colors) +   
  scale_color_manual(values = custom_colors) +
  labs(
    title = "Distribution of Full-Time Weekly Childcare Costs in California (2008–2018)",
    subtitle = "Violin shows distribution shape; bold line and label mark the median",
    x = "Age Group of Children",
    y = "Full-Time Weekly Cost"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    panel.border = element_rect(color = "grey60", fill = NA, linewidth = 0.7),
    strip.background = element_rect(fill = "grey95", color = "grey70"),
    plot.title = element_text(face = "bold", size = 15),
    plot.subtitle = element_text(face = "italic", color = "grey30"),
    axis.title = element_text(face = "bold"),
    axis.line = element_line(color = "grey60"),
    panel.grid.minor = element_blank()
  )

This plot is brought to us by Sebastien Montgrain! I really appreciated all the effort Sebastien put into getting the themes for the plot to be as clean as possible!

Code
# Here is the data wrangling that happened to make this plot!

clean_data <- counties |>
  left_join(childcare_costs, join_by(county_fips_code)) |>
  filter(state_name == "California") |>
  group_by(study_year) |>
  summarise(mcsa = median(mcsa, na.rm = TRUE), 
            mfccsa = median(mfccsa, na.rm = TRUE), 
            mhi_2018 = median(mhi_2018, na.rm = TRUE)) |>
  select(study_year, mcsa, mfccsa, mhi_2018) |>
  rename('Weekly Center Based Care Price' = mcsa, 'Weekly Family Based Care Price' = mfccsa, 'Median Income' = mhi_2018) |>
  pivot_longer(cols = -study_year, names_to = "Variable", values_to = "value") |>
  mutate(Variable = fct_relevel(Variable, 
                                "Weekly Center Based Care Price", 
                                "Weekly Family Based Care Price", 
                                "Median Income")
         )
library(scales)

ggplot(clean_data, aes(study_year, value, color = Variable)) +
  geom_line() +
  scale_y_continuous(labels = label_dollar()) +
  labs(x = "Year", y = "", title = "Trends in Childcare Cost and Median Income", subtitle = "Measurments in US Dollars") +
  facet_wrap(~ Variable, scales = "free_y", nrow = 1) +
  theme_linedraw() +
  scale_x_continuous(breaks = unique(clean_data$study_year)) +
  theme(plot.title = element_text(margin = margin(b = 13), face = "bold"), 
        plot.subtitle = element_text(size = 10),
        legend.position = "none",
        aspect.ratio = 0.9,
        axis.text.x = element_text(size = 7, angle = 45, hjust = 1)) +
  scale_color_manual(values = c("orange", "coral1", "steelblue2"))

Lab 5

Multiple Filters

This code is brought to us by a variety of students—Garrett, Jake, Liam, Lucy, and an anonymous student! I appreciated how each student managed to (efficiently) grab both witness interviews in one pipeline!

Garrett’s Version:

person |>
  left_join(y = interview,
            join_by(id == person_id)) |>
  filter((address_street_name == "Northwestern Dr" &
            address_number == max(address_number)) | #find interview for either the person in the last house on Northwestern Dr or Annabel on Franklin Ave
           (address_street_name == "Franklin Ave" &
              str_detect(name, "Annabel"))) |>
  pull(name, transcript)

Jake’s Version:

# in this chunk of code I joined the interview data set with the person data set
# so that I could narrow down the witnesses and extract their interviews to find more clues
witnesses <- full_join(person, interview, by = c("id" = "person_id")) |>  
  filter((str_detect(name, pattern = "Annabel") & address_street_name == "Franklin Ave") |  
        address_street_name == "Northwestern Dr" & address_number == max(address_number)) |>
  select(transcript)

Liam’s Version:

inner_join(x = person,
           y = interview,
           by = join_by(id == person_id)
           ) %>% # Inner join by `id` and `person_id`
  filter((str_detect(name, pattern = "Annabel") & address_street_name == "Franklin Ave") | 
          (address_street_name == "Northwestern Dr" & address_number == max(address_number))
         ) %>% # Filter for witness 1 (Annabel on Franklin Ave) and witness 2 (last house on Northwestern Dr)
  pull(transcript) # Extract transcripts for witnesses

Lucy’s Version:

#join interview data table with person to connect their interviews to their addresses/names. Save to object to be used to interview suspects later
interviewpeople <- interview |>
inner_join(y = person,
          by = join_by(person_id == id)) 
#filter witnesses to find the ones that match the descriptions above
interviewpeople |>
  filter((str_detect(name, pattern = "Annabel") == TRUE &
         address_street_name == "Franklin Ave") |
        (address_street_name == "Northwestern Dr" &
         address_number == max(address_number))) |>
  pull(transcript)

Anonymous Version:

interview |> 
  left_join(person, join_by(person_id == id)) |> 
  filter((str_detect(name, pattern = "Annabel") & address_street_name == "Franklin Ave") | 
           (address_street_name == "Northwestern Dr" & address_number == max(address_number, na.rm = TRUE)))  |> 
  pull(transcript)

Featuring Lubridate Functions

This code comes to us from Karthik Balaji! I appreciated Karthik’s extensive use of functions from lubridate when filtering these data:

get_fit_now_member |>
  left_join(get_fit_now_check_in, by = join_by(id == membership_id)) |>
  mutate(check_in_date = ymd(check_in_date)) |>
  filter(month(check_in_date) == 1,
         day(check_in_date) == 9,
         membership_status == "gold",
         str_detect(id, pattern = "48Z")) |>
  left_join(person, by = join_by(person_id == id)) |>
  left_join(drivers_license, by = join_by(license_id == id)) |>
  filter(str_detect(plate_number, pattern = "H42W")) |>
  left_join(interview, by = "person_id") |>
  select(transcript)

This code is brought to us by Rayan Tahir and May Thu Thu Kyaw! I thought it was so cool that both of them found a way to create an interval of time and filter for dates within that interval!

Rayan’s Version:

concert_mastermind <- facebook_event_checkin |>
  filter(
    event_name == "SQL Symphony Concert", 
    ymd(date) %within% interval(ymd("2017-12-01"), 
                                ymd("2017-12-31"))
    ) |>
  count(person_id, name = "visits") |>
  filter(visits >= 3) |>
  inner_join(person, by = c("person_id" = "id")) |>
  inner_join(drivers_license, by = c("license_id" =   "id")) |>
  filter(hair_color == "red", between(height, 65, 67))

May’s Version:

#Find people who attended SQL Symphony Concert 3 times in Dec 2017 
concert_3x_dec <- facebook_event_checkin |>
 mutate(date = ymd(date)) |>      #convert numeric dates (like 20171210) into proper Date objects            
  filter(
    event_name == "SQL Symphony Concert", 
    date %within% interval(ymd("2017-12-01"), ymd("2017-12-31")) #between these dates 
  ) |>
  count(person_id, name = "n") |>
  filter(n == 3) |>
  select(person_id) 

if_any()

This code is brought to us by Isaiah Woodard! I loved how Isaiah found a new setting to use if_any() in combination with !is.na() to check for missing values.

suspects <- get_fit_now_member %>% 
  filter(membership_status == "gold", # includes only gold members
         str_detect(id, pattern = "^48Z")) %>% # includes only members whose ID starts with 48Z
  left_join(y = get_fit_now_check_in,
            by = join_by(id == membership_id)) %>% 
  left_join(y = interview,
            by = join_by(person_id)) %>%  # joins to access additional data
  filter(if_any(
    .cols = transcript,
    .fns = ~ !is.na(.x) # Checks for individuals with interviews. Removes those without. 
  ))