Here are some of the exemplary assignments from your peers!
Themes
This code is brought to us by Jonathan Flores! I really appreciated the extensive themes they included and how these themes make the plot clearer to the reader.
ggplot(data = surveys,mapping =aes(x = species, y = weight)) +geom_boxplot(outliers =FALSE) +geom_jitter(color ="steel blue", alpha =0.08) +labs(title ="Distribution of Weight by Species Boxplot",subtitle ="Vertical Orientation",x ="Species Name", y ="Weight (g)") +theme_bw() +theme(axis.text.x =element_text(angle =45),axis.ticks.length.x =unit(8, "mm"),axis.ticks.x =element_line(color ="white"),panel.border =element_rect(color ="black", linewidth =1),plot.title =element_text(face ="bold"),plot.subtitle =element_text(color ="steelblue"),panel.grid.major.x =element_blank(),panel.grid.minor.x =element_blank())
Unique Colors
This code is brought to us by Fisher Fraley! I really appreciated how they found a unique color theme from a context they liked!
This code is brought to us by a student who wishes to remain anonymous! What I appreciated about this code was the use of across() when converting multiple columns to numeric. I also loved their use of the most current syntax for the across() function!
This code is brought to us by Akshat Khandelwal! I thought it was very clever how Akshat played with different combinations of variables to confirm which combination uniquely identified a row.
This code is brought to us by Arturo Ordaz-Gutierrez! I really appreciated how Art added the proportions to this demographic table so it is easier to see the representation of each group.
# A tibble: 8 × 4
academic_degree sex n percent
<chr> <chr> <int> <dbl>
1 dr male 95 32.3
2 dr female 73 24.8
3 ma male 39 13.3
4 ma female 36 12.2
5 no_dgr female 24 8.2
6 no_dgr male 19 6.5
7 prof male 7 2.4
8 prof female 1 0.3
across() + summarize()
This code is brought to us by Rachel Survilas! I loved how Rachel noticed they were using the same n_distinct() function across multiple columns and decided to use across()! Also, the syntax they are using is on point!
This code is brought to us by a student who wishes to remain anonymous! I thought this code was great because they chose to visualize the teacher demographics! They even added annotations to the top of their barplots!
This code is brought to us by a student who wishes to remain anonymous! I thought this code was great because they got both the maximum values and the minimum values in one pipeline. Very efficient!
This code is brought to us by a student who wishes to remain anonymous! I appreciated that they read the instructions to use the kable() function to get nicer looking tables. The code is very simple, but the tables look much nicer!
These statements are brought to us by Abel Alcala and Sebastien Montgrain! I really appreciated the context they infused into their hypothesis test conclusion!
Based on the results of the Chi-Squared test of Independence, we have a p-value of 0.006075 and a X^2 value of 10.207. Since the p-value is less than α = 0.05, we reject the null hypotheses indicating a statistically significant association between seniority level and SET evaluation results. This suggests that instructor experience across junior, senior, and very senior appears to influence how students perceive the use of engaging activities in class.
Because the p-value of our Chi-squared test of independence is less than 0.05 (0.006075), we know that there is an association between a teachers seniority and their SET scores from question 3 on if the professor used activities to make the class more engaging. Combining this Chi-squared test with the graph, we can see that more experienced teachers do a better job of implementing engaging activities is their classes.
Challenge 4
Amazing Tables
Amazing Plots
This plot is brought to us by Marietta Nikolskaya! I was really excited to see someone create a “lollipop” plot, since they are what we (Statisticians) make to compare how different a summary statistic is between groups. When plotting statistics like the mean, median, or difference, we generally advocate for this type of plot over a barplot since barplots are typically reserved for plots of counts or proportions.
Code
# Here is the data wrangling that happened to make this plot!ca_childcare_new <- ca_childcare |>filter(study_year ==2018) |>select(census_region, mc_infant, mfcc_infant) |>mutate(cost_difference = mc_infant - mfcc_infant) |>group_by(census_region) |>summarize(median_center =median(mc_infant),median_family =median(mfcc_infant),median_difference =median(cost_difference)) |>arrange(desc(median_difference))
ca_childcare_new |>ggplot(aes(y =fct_reorder(census_region, median_difference), x = median_difference)) +geom_segment(aes(yend = census_region, x =0, xend = median_difference), color ="gray50") +geom_point(size =4, color ="steelblue") +labs(y ="",x ="Cost Difference: Center - Family ($)",title ="Cost Difference Between Center and Family Childcare for Infants (2018)" ) +theme_minimal()
This plot is brought to us by Devin Hadley! I thought this plot was really creative in their use of a non-standard geometry (geom_ribbon()) to fill the space between the lines. I also appreciated that they moved the legend to the top of the plot to make it easier to see!
Code
# Here is the data wrangling that happened to make this plot!summary_table <- ca_childcare |>group_by(study_year) |>summarise(median_center_price =median(mc_infant),median_family_price =median(mfcc_infant) ) |>mutate(center_minus_family_price = median_center_price - median_family_price )summary_table_long <- summary_table |>pivot_longer(cols =c("median_center_price", "median_family_price"),names_to ="care_type",values_to ="median_price" ) |>mutate(care_type =case_when( care_type =="median_center_price"~"Center-Based", care_type =="median_family_price"~"Family-Based",TRUE~ care_type ))
ggplot(summary_table_long, aes(x = study_year, y = median_price, color = care_type)) +geom_line(linewidth =1.2) +geom_point(size =3) +geom_ribbon(data = summary_table, aes(x = study_year, ymin = median_family_price, ymax = median_center_price, group =1),fill ="skyblue", alpha =0.3, inherit.aes =FALSE) +labs(title ="Median Weekly Infant Childcare Costs in California (2008-2018)",subtitle ="Based off the aggregated median price for each care type on an annual basis in various California counties.",x ="Year",y ="Median Weekly Price (2018 Dollars)",color ="Type of Childcare" ) +scale_x_continuous(breaks =seq(2008, 2018, by =2)) +theme_minimal(base_size =14) +theme(legend.position ="top",plot.title =element_text(face ="bold"),plot.subtitle =element_text(margin =margin(b =10)),axis.title =element_text(face ="bold"),plot.background =element_rect(color ="black", linewidth =1) )
This plot is brought to us by Haley Wong! Haley also featured a non-standard geometry (geom_ribbon()) to fill the space between the lines on this plot! I really appreciated Haley’s attention to detail in making titles and labels on the plot that made the context very clear to the reader—$ on the y-axis labels and a subtitle explaining what the gap represents.
Code
# Here is the data wrangling that happened to make this plot!table <- ca_childcare |>pivot_longer(cols = mc_infant:mfcc_preschool,names_to ="type_age",values_to ="median_price" ) |>mutate(type =case_when(str_detect(type_age, "^mc_") ~"center_based",str_detect(type_age, "^mfcc_") ~"family" ),age =case_when(str_detect(type_age, "infant$") ~"infant",str_detect(type_age, "toddler$") ~"toddler",str_detect(type_age, "preschool$") ~"preschool" ) ) |>group_by(study_year, age, type) |>summarise(median_price =median(median_price, na.rm =TRUE), .groups ="drop") |>pivot_wider(names_from = type, values_from = median_price) |>mutate(difference = center_based - family)prices_long <- table |>pivot_longer(cols =c(center_based, family),names_to ="setting_type",values_to ="median_price" ) |>mutate(setting_label =fct_recode( setting_type,"Center-based"="center_based","Family (In-home)"="family" ),age =factor(age, levels =c("infant", "toddler", "preschool")) )
ggplot(prices_long, aes(x = study_year, y = median_price, color = setting_label)) +geom_ribbon(data = table |>mutate(age =factor(age, levels =c("infant", "toddler", "preschool"))),aes(x = study_year, ymin = family, ymax = center_based, fill ="Price Gap"),alpha =0.15,inherit.aes =FALSE ) +geom_line(size =1.2) +geom_point(size =2.5, alpha =0.9) +facet_wrap(~ age, nrow =1) +scale_color_brewer(palette ="Set1") +scale_fill_manual(values =c("Price Gap"="gray70")) +scale_y_continuous(labels = scales::label_dollar()) +scale_x_continuous(breaks =seq(2008, 2018, 4)) +labs(title ="Center-Based vs Family Childcare Costs Over Time",subtitle ="Shaded area shows median price gap by age group (California)",x ="Study Year",y ="Median Weekly Price (USD)",color ="Care Setting",fill ="" ) +theme_minimal(base_size =14) +theme(plot.title =element_text(face ="bold", size =16, hjust =0.5),plot.subtitle =element_text(size =13, hjust =0.5),axis.text.x =element_text(size =10),legend.position ="right" )
This plot is brought to us by Isaiah Woodard! I really appreciated how Isaiah took the skills they learned in Challenge 2 and applied them to a new setting!
Code
# Here is the data wrangling that happened to make this plot!plot_data <- ca_childcare %>%select(census_region, study_year, mfccsa, mcsa) %>%filter(study_year ==c("2008":"2018")) %>%group_by(census_region, study_year) %>%summarise(median_mcsa =median(mcsa),median_mfccsa =median(mfccsa)) %>%pivot_longer(cols =c(median_mcsa, median_mfccsa),names_to ="category",values_to ="medians")
In a similar setting, Lucy also took the ideas from Challenge 2 and applied them to make an exciting plot!
Code
# Here is the data wrangling that happened to make this plot!challenge <- ca_childcare |>rename(CenterBased = mcsa,Family = mfccsa) |>pivot_longer(cols = CenterBased:Family,names_to ="type_of_care",values_to ="Median_price")
plot_data %>%ggplot(aes(x = study_year,y = medians,color = category)) +facet_wrap(~ census_region) +geom_point() +scale_x_continuous(breaks =c(2008, 2010, 2012, 2014, 2016, 2018)) +labs(title ="Price of Childcare Over Time in California",subtitle ="Plots subdivided for <span style='color:#009e73;'>center-based</span> and <span style='color:#d55e00;'>at-home</span> childcare",x ="Study Year",y ="Median Weekly Cost of Childcare ($)") +scale_color_manual(values =c("#009e73", "#d55e00")) +theme_bw() +theme(plot.subtitle = ggtext::element_markdown(),axis.text =element_text(size =5),legend.position ="")
This plot is brought to us by Rayan Tahir! I thought Rayan’s plot was very clear and easy to understand. Moreover, their use of $ on the axis labels and moving the legend to the top of the plot made it even easier to understand what was being plotted.
Code
# Here is the data wrangling that happened to make this plot!ca_afford <- ca_childcare |>mutate(weekly_income = mhi_2018 /52, center_afford = mc_infant / weekly_income,family_afford = mfcc_infant / weekly_income )ca_afford_long <- ca_afford |>select(study_year, center_afford, family_afford) |>pivot_longer(cols =c(center_afford, family_afford),names_to ="Setting",values_to ="Affordability") |>mutate(Setting =recode(Setting,center_afford ="Center-Based",family_afford ="Family-Based"))
ggplot(ca_afford_long, aes(x = study_year, y = Affordability, color = Setting)) +stat_summary(fun = median, geom ="line", linewidth =1.2) +stat_summary(fun = median, geom ="point", size =2.2) +scale_y_continuous(labels = scales::label_percent(accuracy =1)) +labs(title ="Affordability of Childcare in California (2008-2018)",subtitle ="Median Weekly Cost as a Share of County Median Household Income",x ="Year",y ="Median Share of Weekly Income Spent on Infant Care",color ="Childcare Setting") +theme_minimal(base_size =11) +theme(plot.title =element_text(face ="bold", size =15, hjust =0.5),plot.subtitle=element_text(size =12, hjust=0.5),legend.position ="top")
This plot is brought to us by a student who wishes to remain anonymous. I thought this plot was both incredibly clear and rather creative. They used a non-standard geom (geom_violin) which shows the density of each group’s distribution. Moreover, they plot a horizontal line at the median and annotate the median value (with dollar signs!).
Code
# Here is the data wrangling that happened to make this plot!plot_data <- ca_childcare |>select(mc_infant, mc_toddler, mc_preschool, mfcc_infant, mfcc_toddler, mfcc_preschool) |>pivot_longer(cols =everything(),names_to ="age_group",values_to ="weekly_price" ) |>mutate(age_group =recode( age_group,mc_infant ="Center_Infant",mc_toddler ="Center_Toddler",mc_preschool ="Center_Preschool",mfcc_infant ="Family_Infant",mfcc_toddler ="Family_Toddler",mfcc_preschool ="Family_Preschool" ),setting =if_else(str_detect(age_group, "Center"), "Center-Based", "Family (In-Home)"),age_group =case_when(str_detect(age_group, "Infant") ~"Infant",str_detect(age_group, "Toddler") ~"Toddler",str_detect(age_group, "Preschool") ~"Preschool" ),age_group =fct_relevel(age_group, "Infant", "Toddler", "Preschool") )
library(RColorBrewer)custom_colors <-colorRampPalette(brewer.pal(8, "Accent"))(10)ggplot(plot_data, aes(x = age_group, y = weekly_price, fill = age_group)) +geom_violin(trim =TRUE, alpha =0.8, linewidth =0.5) +stat_summary(fun = median, fun.min = median, fun.max = median,geom ="errorbar", width =0.45, color ="black", linewidth =0.5) +stat_summary(fun = median, geom ="text",aes(label = scales::dollar(after_stat(y))),vjust =-0.6, size =3, color ="black") +facet_wrap(~ setting, ncol =2) +scale_y_continuous(labels = scales::dollar) +scale_fill_manual(values = custom_colors) +scale_color_manual(values = custom_colors) +labs(title ="Distribution of Full-Time Weekly Childcare Costs in California (2008–2018)",subtitle ="Violin shows distribution shape; bold line and label mark the median",x ="Age Group of Children",y ="Full-Time Weekly Cost" ) +theme_minimal() +theme(legend.position ="none",panel.border =element_rect(color ="grey60", fill =NA, linewidth =0.7),strip.background =element_rect(fill ="grey95", color ="grey70"),plot.title =element_text(face ="bold", size =15),plot.subtitle =element_text(face ="italic", color ="grey30"),axis.title =element_text(face ="bold"),axis.line =element_line(color ="grey60"),panel.grid.minor =element_blank() )
This plot is brought to us by Sebastien Montgrain! I really appreciated all the effort Sebastien put into getting the themes for the plot to be as clean as possible!
Code
# Here is the data wrangling that happened to make this plot!clean_data <- counties |>left_join(childcare_costs, join_by(county_fips_code)) |>filter(state_name =="California") |>group_by(study_year) |>summarise(mcsa =median(mcsa, na.rm =TRUE), mfccsa =median(mfccsa, na.rm =TRUE), mhi_2018 =median(mhi_2018, na.rm =TRUE)) |>select(study_year, mcsa, mfccsa, mhi_2018) |>rename('Weekly Center Based Care Price'= mcsa, 'Weekly Family Based Care Price'= mfccsa, 'Median Income'= mhi_2018) |>pivot_longer(cols =-study_year, names_to ="Variable", values_to ="value") |>mutate(Variable =fct_relevel(Variable, "Weekly Center Based Care Price", "Weekly Family Based Care Price", "Median Income") )
library(scales)ggplot(clean_data, aes(study_year, value, color = Variable)) +geom_line() +scale_y_continuous(labels =label_dollar()) +labs(x ="Year", y ="", title ="Trends in Childcare Cost and Median Income", subtitle ="Measurments in US Dollars") +facet_wrap(~ Variable, scales ="free_y", nrow =1) +theme_linedraw() +scale_x_continuous(breaks =unique(clean_data$study_year)) +theme(plot.title =element_text(margin =margin(b =13), face ="bold"), plot.subtitle =element_text(size =10),legend.position ="none",aspect.ratio =0.9,axis.text.x =element_text(size =7, angle =45, hjust =1)) +scale_color_manual(values =c("orange", "coral1", "steelblue2"))
Lab 5
Multiple Filters
This code is brought to us by a variety of students—Garrett, Jake, Liam, Lucy, and an anonymous student! I appreciated how each student managed to (efficiently) grab both witness interviews in one pipeline!
Garrett’s Version:
person |>left_join(y = interview,join_by(id == person_id)) |>filter((address_street_name =="Northwestern Dr"& address_number ==max(address_number)) |#find interview for either the person in the last house on Northwestern Dr or Annabel on Franklin Ave (address_street_name =="Franklin Ave"&str_detect(name, "Annabel"))) |>pull(name, transcript)
Jake’s Version:
# in this chunk of code I joined the interview data set with the person data set# so that I could narrow down the witnesses and extract their interviews to find more clueswitnesses <-full_join(person, interview, by =c("id"="person_id")) |>filter((str_detect(name, pattern ="Annabel") & address_street_name =="Franklin Ave") | address_street_name =="Northwestern Dr"& address_number ==max(address_number)) |>select(transcript)
Liam’s Version:
inner_join(x = person,y = interview,by =join_by(id == person_id) ) %>%# Inner join by `id` and `person_id`filter((str_detect(name, pattern ="Annabel") & address_street_name =="Franklin Ave") | (address_street_name =="Northwestern Dr"& address_number ==max(address_number)) ) %>%# Filter for witness 1 (Annabel on Franklin Ave) and witness 2 (last house on Northwestern Dr)pull(transcript) # Extract transcripts for witnesses
Lucy’s Version:
#join interview data table with person to connect their interviews to their addresses/names. Save to object to be used to interview suspects laterinterviewpeople <- interview |>inner_join(y = person,by =join_by(person_id == id)) #filter witnesses to find the ones that match the descriptions aboveinterviewpeople |>filter((str_detect(name, pattern ="Annabel") ==TRUE& address_street_name =="Franklin Ave") | (address_street_name =="Northwestern Dr"& address_number ==max(address_number))) |>pull(transcript)
This code comes to us from Karthik Balaji! I appreciated Karthik’s extensive use of functions from lubridate when filtering these data:
get_fit_now_member |>left_join(get_fit_now_check_in, by =join_by(id == membership_id)) |>mutate(check_in_date =ymd(check_in_date)) |>filter(month(check_in_date) ==1,day(check_in_date) ==9, membership_status =="gold",str_detect(id, pattern ="48Z")) |>left_join(person, by =join_by(person_id == id)) |>left_join(drivers_license, by =join_by(license_id == id)) |>filter(str_detect(plate_number, pattern ="H42W")) |>left_join(interview, by ="person_id") |>select(transcript)
This code is brought to us by Rayan Tahir and May Thu Thu Kyaw! I thought it was so cool that both of them found a way to create an interval of time and filter for dates within that interval!
Rayan’s Version:
concert_mastermind <- facebook_event_checkin |>filter( event_name =="SQL Symphony Concert", ymd(date) %within%interval(ymd("2017-12-01"), ymd("2017-12-31")) ) |>count(person_id, name ="visits") |>filter(visits >=3) |>inner_join(person, by =c("person_id"="id")) |>inner_join(drivers_license, by =c("license_id"="id")) |>filter(hair_color =="red", between(height, 65, 67))
May’s Version:
#Find people who attended SQL Symphony Concert 3 times in Dec 2017 concert_3x_dec <- facebook_event_checkin |>mutate(date =ymd(date)) |>#convert numeric dates (like 20171210) into proper Date objects filter( event_name =="SQL Symphony Concert", date %within%interval(ymd("2017-12-01"), ymd("2017-12-31")) #between these dates ) |>count(person_id, name ="n") |>filter(n ==3) |>select(person_id)
if_any()
This code is brought to us by Isaiah Woodard! I loved how Isaiah found a new setting to use if_any() in combination with !is.na() to check for missing values.
suspects <- get_fit_now_member %>%filter(membership_status =="gold", # includes only gold membersstr_detect(id, pattern ="^48Z")) %>%# includes only members whose ID starts with 48Zleft_join(y = get_fit_now_check_in,by =join_by(id == membership_id)) %>%left_join(y = interview,by =join_by(person_id)) %>%# joins to access additional datafilter(if_any(.cols = transcript,.fns =~!is.na(.x) # Checks for individuals with interviews. Removes those without. ))