# Challenge 3
<- read_csv(here::here("data", "teacher_evals.csv")) teacher_data
Example A-level Portfolio
My Grade: I believe my grade equivalent to course work evidenced below to be an A.
Learning Objective Evidence: In the code chunks below, provide code from Lab or Challenge assignments where you believe you have demonstrated proficiency with the specified learning target. Be sure to specify where the code came from (e.g., Lab 4 Question 2).
Working with Data
WD-1: I can import data from a variety of formats (e.g., csv, xlsx, txt, etc.).
csv
xlsx
# PA 4
<- read_xlsx(here::here("data",
military "gov_spending_per_capita.xlsx"),
sheet = "Share of Govt. spending",
skip = 7,
n_max = 191)
txt
# Check in 2.3
<-
ages_tab
read_tablefile = here::here("Week 2", "Check-ins", "Ages_Data", "ages_tab.txt")) (
WD-2: I can select necessary columns from a dataset.
# Lab 3, Question 5
# Success comment: I'd encourage you to be more consistent with your function
# syntax. The first syntax for the across() function is spot on, but the second
# one lacks the details you included in the first.
= teacher_data |>
teacher_evals_clean rename(sex=gender) |>
filter(no_participants >= 10) |>
select(course_id, teacher_id, question_no, no_participants,
resp_share, SET_score_avg, percent_failed_cur, |>
academic_degree, seniority, sex) mutate(
across(.cols = course_id:teacher_id, .fns = ~ as.character(.x)),
across(.cols = c(academic_degree, sex), .fns = ~ as.factor(.x)))
WD-3: I can filter rows from a dataframe for a variety of data types (e.g., numeric, integer, character, factor, date).
- numeric
# Lab 4, Question 4
# Success comments: I would recommend using the %in% operator instead of the or!
# Nice work learning to drop the groups!
# Nice column names! You could even be more specific and say "Median Income"!
|>
ca_childcare_clean filter(study_year %in% c(2008, 2018))
group_by(region, study_year) |>
summarise(median_income = median(mhi_2018, na.rm = TRUE), .groups = 'drop') |>
pivot_wider(id_cols = region,
names_from = study_year,
values_from = median_income,
names_prefix = "Medium Income ") |>
arrange(`Income 2018`)
- character – specifically a string (example must use functions from stringr)
# Lab 5
|>
person filter(
== "Northwestern Dr" &
(address_street_name == max(address_number)) |
address_number str_detect(name, "Annabel") &
(== "Franklin Ave")) |>
address_street_name left_join(interview, by = c("id" = "person_id")) |>
select(transcript)
- factor
# Lab 5
|>
person filter(
== "Northwestern Dr" &
(address_street_name == max(address_number)) |
address_number str_detect(name, "Annabel") &
(== "Franklin Ave")) |>
address_street_name left_join(interview, by = c("id" = "person_id")) |>
select(transcript)
- date (example must use functions from lubridate)
# Lab 5
# Chose an example where I extract the month and year using lubridate!
# I changed my code so I use month(), year(), and ymd() instead of
# filter(str_starts(as.character(date), "2017")
|>
drivers_license rename(license_id = id) |>
filter(
== "female",
gender == "red",
hair_color >= 65 & height <= 67,
height == "Tesla",
car_make == "Model S") |>
car_model left_join(person, by = "license_id") |>
inner_join(facebook_event_checkin, by = c("id" = "person_id")) |>
mutate(date = ymd(date)) |>
filter(year(date) == 2017, month(date) == 12,
== "SQL Symphony Concert") |>
event_name group_by(id) |>
summarise(event_count = n(), .groups = "drop") |>
filter(event_count == 3) |>
inner_join(person, by = "id") |>
left_join(interview, by = c("id" = "person_id")) |>
select(name, transcript)
WD-4: I can modify existing variables and create new variables in a dataframe for a variety of data types (e.g., numeric, integer, character, factor, date).
- numeric (using
as.numeric()
is not sufficient)
# Challenge 3, Question 1
# I use to compare the question_no to just 903,
# but here I made modifications so I just refer to the numbers without
# the leading 90. I also made sure to convert question_no back to a double
# as I converted it to a string to remove the 90.
= teacher_data |>
teacher_evals_compare mutate(
question_no = as.numeric(str_remove(as.character(question_no), "^90")),
SET_level = if_else(SET_score_avg >= 4, "excellent", "standard"),
sen_level = if_else(seniority <= 4, "junior", "senior")) |>
filter(question_no == 3) |>
select(course_id, SET_level, sen_level)
- character – specifically a string (example must use functions from stringr)
# Lab 5
# Clues:
# The membership number on the bag started with "48Z"
# I was working out last week on January the 9th
# I mutated the check_in_date to include dashes so I could compare the date
# by simply using "2018-01-09"
<- get_fit_now_check_in |>
membership_ids mutate(
check_in_date = str_replace_all(check_in_date,
"(\\d{4})(\\d{2})(\\d{2})", "\\1-\\2-\\3")) |>
filter(
str_detect(membership_id, "^48Z"),
== "2018-01-09") |>
check_in_date select(membership_id)
- factor (example must use functions from forcats)
# Lab 4, Question 6
# Growing comment: Nice work pivoting and modifying the age variable!
# The recode() function is superseeded, in favor of case_when() and functions
# in the forcats package.
# What I did: Before I changed the names of mc_infant, mc_toddler, mc_preschool
# to Infant, Toddler, Preschool using recode(), which is
# superseeded! I changed it to case_when() this way
# I am using the updated appropriate function.
<- ca_childcare_clean |>
ca_childcare_long select(study_year, region, mc_infant, mc_toddler, mc_preschool) |>
# Transform wide to long format
pivot_longer(cols = starts_with("mc_"),
# Create a new column "age" from the column names
names_to = "age",
# The corresponding values will go in the "price" column
values_to = "price") |>
mutate(age = fct_relevel(case_when(
== "mc_infant" ~ "Infant",
age == "mc_toddler" ~ "Toddler",
age == "mc_preschool" ~ "Preschool"
age "Infant", "Toddler", "Preschool")) ),
- date (example must use functions from lubridate)
# Lab 5
# I changed my code so I use month(), year(), and ymd() instead of
# filter(str_starts(as.character(date), "2017")
# Here I mutated date, so I could then use year() and month() to filter
|>
drivers_license rename(license_id = id) |>
filter(
== "female",
gender == "red",
hair_color >= 65 & height <= 67,
height == "Tesla",
car_make == "Model S") |>
car_model left_join(person, by = "license_id") |>
inner_join(facebook_event_checkin, by = c("id" = "person_id")) |>
mutate(date = ymd(date)) |>
filter(year(date) == 2017, month(date) == 12,
== "SQL Symphony Concert") |>
event_name group_by(id) |>
summarise(event_count = n(), .groups = "drop") |>
filter(event_count == 3) |>
inner_join(person, by = "id") |>
left_join(interview, by = c("id" = "person_id")) |>
select(name, transcript)
WD-5: I can use mutating joins to combine multiple dataframes.
left_join()
# Lab 5
# returns all the rows from the left table and the matching
# rows from the right table
# Get only rows for people we want (include person and interview)
|>
person filter(
== "Northwestern Dr" &
(address_street_name == max(address_number)) |
address_number str_detect(name, "Annabel") &
(== "Franklin Ave")) |>
address_street_name left_join(interview, by = c("id" = "person_id")) |>
select(transcript)
right_join()
# An example for portfolio:
# returns all the rows from the right table and the matching rows
# from the left table
# To see all interviews, but only have person info for those we want
|>
person filter(
== "Northwestern Dr" &
(address_street_name == max(address_number)) |
address_number str_detect(name, "Annabel") &
(== "Franklin Ave")) |>
address_street_name right_join(interview, by = c("id" = "person_id"))
inner_join()
# lab 5
# returns only the rows where there is a match in both tables
# Growing comment: You need to interview every suspect!
# Added: finished the rest of the lab!
|>
drivers_license rename(license_id = id) |>
filter(
== "female",
gender == "red",
hair_color >= 65 & height <= 67,
height == "Tesla",
car_make == "Model S"
car_model |>
) left_join(person, by = "license_id") |>
inner_join(facebook_event_checkin, by = c("id" = "person_id")) |>
# since date is a double, change to character first
filter(str_starts(as.character(date), "2017"),
== "SQL Symphony Concert") |>
event_name group_by(id) |>
summarise(event_count = n(), .groups = "drop") |>
filter(event_count == 3) |>
inner_join(person, by = "id") |>
left_join(interview, by = c("id" = "person_id")) |>
# Confirm new suspect = shouldn't have an interview
select(name, transcript)
WD-6: I can use filtering joins to filter rows from a dataframe.
semi_join()
# returns rows from one table where there is a match in another table,
# but it does not return any columns from the second table
# includes person info for people we want, no interviews
|>
person filter(
== "Northwestern Dr" &
(address_street_name == max(address_number)) |
address_number str_detect(name, "Annabel") &
(== "Franklin Ave")) |>
address_street_name semi_join(interview, by = c("id" = "person_id"))
anti_join()
# anti join is a type of join that returns rows from one table where there
# are no matches in another table
# All people who don't have an interview
|>
person anti_join(interview, by = c("id" = "person_id"))
WD-7: I can pivot dataframes from long to wide and visa versa
pivot_longer()
# Lab 4, Question 6
# Growing comment: Nice work pivoting and modifying the age variable!
# The recode() function is superseeded, in favor of case_when() and functions
# in the forcats package.
# What I did: Before I changed the names of mc_infant, mc_toddler, mc_preschool
# to Infant, Toddler, Preschool using recode(), which is
# superseeded! I changed it to case_when() this way
# I am using the updated appropriate function.
<- ca_childcare_clean |>
ca_childcare_long select(study_year, region, mc_infant, mc_toddler, mc_preschool) |>
# Transform wide to long format
pivot_longer(cols = starts_with("mc_"),
# Create a new column "age" from the column names
names_to = "age",
# The corresponding values will go in the "price" column
values_to = "price") |>
mutate(age = fct_relevel(case_when(
== "mc_infant" ~ "Infant",
age == "mc_toddler" ~ "Toddler",
age == "mc_preschool" ~ "Preschool"),
age "Infant", "Toddler", "Preschool"))
pivot_wider()
# Lab 4, Question 4
# Success comments: I would recommend using the %in% operator instead of the or!
# Nice work learning to drop the groups!
# Nice column names! You could even be more specific and say "Median Income"!
|>
ca_childcare_clean filter(study_year %in% c(2008, 2018))
group_by(region, study_year) |>
summarise(median_income = median(mhi_2018, na.rm = TRUE), .groups = 'drop') |>
pivot_wider(id_cols = region,
names_from = study_year,
values_from = median_income,
names_prefix = "Medium Income ") |>
arrange(`Income 2018`)
Reproducibility
R-1: I can create professional looking, reproducible analyses using RStudio projects, Quarto documents, and the here package.
I’ve done this in the following provided assignments: Lab 2, Lab 3, Challenge 3, Lab 4, Lab 5
R-2: I can write well documented and tidy code.
- Example of ggplot2 plotting
# Lab 4, Question 6
# Growing comments: Nice work changing the size of the x-axis and y-axis text!
# Can you make this change to other aspects of the plot?
# The legend is really large! Great job reordering the legend to go in the same
# order as the lines! The final step is to match the colors and theme I used.
# Personally, I like theme_bw() and the “Accent” palette from the RColorBrewer package.
# I changed the size of my legend to be smaller, as it was massive before.
# This makes the graph more pleasing to look at. Lastly, I added the theme
# theme_bw() and modified the colors. Rather than using the defaults,
# it is important to explore different colors and themes that can make a
# graph more appealing!
#| label: recreate-plot
#| echo: true
#| warning: false
#| message: false
ggplot(data = ca_childcare_long, aes(x = study_year, y = price,
color = fct_reorder2(.f = region, .x = study_year, .y = price))) +
geom_smooth(method = "loess", linewidth = 0.5) +
geom_point(size = 0.8, alpha = 0.5) +
# creates separate graphs for age_groups
# each has its own x-axis
# in one row
facet_wrap(~ age, scales = "free_x", nrow=1) +
labs(title = "Weekly Median Price for Center-Based Childcare ($)",
x = "Study Year",
y = "",
color = "California Region") +
# adjust axis
scale_y_continuous(limits = c(100, 500), breaks = seq(100, 500, by = 100)) +
scale_x_continuous(breaks = seq(2008, 2018, by = 2)) +
theme_bw() +
theme(
# spaces the graphs apart, lines is a unit
panel.spacing = unit(1, "lines"),
# change the aspect ratio to make it less tall
aspect.ratio = 1, # make it less tall
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.key.size = unit(0.8, "lines")
+
) scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(10))
- Example of dplyr pipeline
# Lab 4, Question 6
# Growing comment: Nice work pivoting and modifying the age variable!
# The recode() function is superseeded, in favor of case_when() and functions
# in the forcats package.
# What I did: Before I changed the names of mc_infant, mc_toddler, mc_preschool
# to Infant, Toddler, Preschool using recode(), which is
# superseeded! I changed it to case_when() this way
# I am using the updated appropriate function.
<- ca_childcare_clean |>
ca_childcare_long select(study_year, region, mc_infant, mc_toddler, mc_preschool) |>
# Transform wide to long format
pivot_longer(cols = starts_with("mc_"),
# Create a new column "age" from the column names
names_to = "age",
# The corresponding values will go in the "price" column
values_to = "price") |>
mutate(age = fct_relevel(case_when(
== "mc_infant" ~ "Infant",
age == "mc_toddler" ~ "Toddler",
age == "mc_preschool" ~ "Preschool"
age "Infant", "Toddler", "Preschool")) ),
- Example of function formatting
R-3: I can write robust programs that are resistant to changes in inputs.
- Example – any context
# Lab 4, Question 4
# Success comments: I would recommend using the %in% operator instead of the or!
# Nice work learning to drop the groups!
# Nice column names! You could even be more specific and say "Median Income"!
|>
ca_childcare_clean filter(study_year %in% c(2008, 2018))
group_by(region, study_year) |>
summarise(median_income = median(mhi_2018, na.rm = TRUE), .groups = 'drop') |>
pivot_wider(id_cols = region,
names_from = study_year,
values_from = median_income,
names_prefix = "Median Income ") |>
arrange(`Median Income 2018`)
- Example of function stops
Data Visualization & Summarization
DVS-1: I can create visualizations for a variety of variable types (e.g., numeric, character, factor, date)
- at least two numeric variables
# Lab 4, Question 7
# Success comments: Nice work removing your y-axis label so people
# don't tilt their head! I would recommend looking into the scales package,
# which provides an easy method for getting $ signs on the plot labels,
# with the label_dollar() function!
ggplot(data = ca_childcare, aes(x = mhi_2018, y = mc_infant)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "steelblue") +
labs(
title = "Correlation Between Household Income
and Center-Based Childcare Costs in California",
y = "",
x = "2018 Dollars",
subtitle = "Median Weekly Price for Infants"
+
) scale_x_continuous(labels = label_dollar()) +
theme_minimal()
- at least one numeric variable and one categorical variable
# Lab 2, Question 16
ggplot(data = surveys,
mapping = aes(y = species, x = weight)) +
geom_boxplot(outliers = FALSE) +
geom_jitter(color = "steelblue", alpha = 0.2) +
labs(
y = "",
subtitle = "Species",
x = "Weight (grams)",
title = "Analyzing Weight Distributions Across Various Rodents")
- at least two categorical variables
# Lab 4, Question 6
# Growing comments: Nice work changing the size of the x-axis and y-axis text!
# Can you make this change to other aspects of the plot?
# The legend is really large! Great job reordering the legend to go in the same
# order as the lines! The final step is to match the colors and theme I used.
# Personally, I like theme_bw() and the “Accent” palette from the RColorBrewer package.
# I changed the size of my legend to be smaller, as it was massive before.
# This makes the graph more pleasing to look at. Lastly, I added the theme
# theme_bw() and modified the colors. Rather than using the defaults,
# it is important to explore different colors and themes that can make a
# graph more appealing!
ggplot(data = ca_childcare_long, aes(x = study_year, y = price,
color = fct_reorder2(.f = region, .x = study_year, .y = price))) +
geom_smooth(method = "loess", linewidth = 0.5) +
geom_point(size = 0.8, alpha = 0.5) +
# creates seperate graphs for age_groups
# each has its own x-axis
# in one row
facet_wrap(~ age, scales = "free_x", nrow=1) +
labs(title = "Weekly Median Price for Center-Based Childcare ($)",
x = "Study Year",
y = "",
color = "California Region") +
# adjust axis
scale_y_continuous(limits = c(100, 500), breaks = seq(100, 500, by = 100)) +
scale_x_continuous(breaks = seq(2008, 2018, by = 2)) +
theme_bw() +
theme(
# spaces the graphs apart, lines is a unit
panel.spacing = unit(1, "lines"),
# change the aspect ratio to make it less tall
aspect.ratio = 1, # make it less tall
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.key.size = unit(0.8, "lines")
+
) scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(10))
- dates (timeseries plot)
# Lab 4, Question 6
# Growing comments: Nice work changing the size of the x-axis and y-axis text!
# Can you make this change to other aspects of the plot?
# The legend is really large! Great job reordering the legend to go in the same
# order as the lines! The final step is to match the colors and theme I used.
# Personally, I like theme_bw() and the “Accent” palette from the RColorBrewer package.
# I changed the size of my legend to be smaller, as it was massive before.
# This makes the graph more pleasing to look at. Lastly, I added the theme
# theme_bw() and modified the colors. Rather than using the defaults,
# it is important to explore different colors and themes that can make a
# graph more appealing!
ggplot(data = ca_childcare_long, aes(x = study_year, y = price,
color = fct_reorder2(.f = region, .x = study_year, .y = price))) +
geom_smooth(method = "loess", linewidth = 0.5) +
geom_point(size = 0.8, alpha = 0.5) +
# creates seperate graphs for age_groups
# each has its own x-axis
# in one row
facet_wrap(~ age, scales = "free_x", nrow=1) +
labs(title = "Weekly Median Price for Center-Based Childcare ($)",
x = "Study Year",
y = "",
color = "California Region") +
# adjust axis
scale_y_continuous(limits = c(100, 500), breaks = seq(100, 500, by = 100)) +
scale_x_continuous(breaks = seq(2008, 2018, by = 2)) +
theme_bw() +
theme(
# spaces the graphs apart, lines is a unit
panel.spacing = unit(1, "lines"),
# change the aspect ratio to make it less tall
aspect.ratio = 1, # make it less tall
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.key.size = unit(0.8, "lines")
+
) scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(10))
DVS-2: I use plot modifications to make my visualization clear to the reader.
- I can ensure people don’t tilt their head
# Lab 2, Question 16
# Don't need to tilt head!
ggplot(data = surveys,
mapping = aes(y = species, x = weight)) +
geom_boxplot(outliers = FALSE) +
geom_jitter(color = "steelblue", alpha = 0.2) +
labs(
y = "",
subtitle = "Species",
x = "Weight (grams)",
title = "Analyzing Weight Distributions Across Various Rodents")
- I can modify the text in my plot to be more readable
# Lab 4, Question 6
# Growing comments: Nice work changing the size of the x-axis and y-axis text!
# Can you make this change to other aspects of the plot?
# The legend is really large! Great job reordering the legend to go in the same
# order as the lines! The final step is to match the colors and theme I used.
# Personally, I like theme_bw() and the “Accent” palette from the RColorBrewer package.
# I changed the size of my legend to be smaller, as it was massive before.
# This makes the graph more pleasing to look at. Lastly, I added the theme
# theme_bw() and modified the colors. Rather than using the defaults,
# it is important to explore different colors and themes that can make a
# graph more appealing!
ggplot(data = ca_childcare_long, aes(x = study_year, y = price,
color = fct_reorder2(.f = region, .x = study_year, .y = price))) +
geom_smooth(method = "loess", linewidth = 0.5) +
geom_point(size = 0.8, alpha = 0.5) +
# creates separate graphs for age_groups
# each has its own x-axis
# in one row
facet_wrap(~ age, scales = "free_x", nrow=1) +
labs(title = "Weekly Median Price for Center-Based Childcare ($)",
x = "Study Year",
y = "",
color = "California Region") +
# adjust axis
scale_y_continuous(limits = c(100, 500), breaks = seq(100, 500, by = 100)) +
scale_x_continuous(breaks = seq(2008, 2018, by = 2)) +
theme_bw() +
theme(
# spaces the graphs apart, lines is a unit
panel.spacing = unit(1, "lines"),
# change the aspect ratio to make it less tall
aspect.ratio = 1, # make it less tall
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.key.size = unit(0.8, "lines")
+
) scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(10))
- I can reorder my legend to align with the colors in my plot
# Lab 4, Question 6
# Growing comments: Nice work changing the size of the x-axis and y-axis text!
# Can you make this change to other aspects of the plot?
# The legend is really large! Great job reordering the legend to go in the same
# order as the lines! The final step is to match the colors and theme I used.
# Personally, I like theme_bw() and the “Accent” palette from the RColorBrewer package.
# I changed the size of my legend to be smaller, as it was massive before.
# This makes the graph more pleasing to look at. Lastly, I added the theme
# theme_bw() and modified the colors. Rather than using the defaults,
# it is important to explore different colors and themes that can make a
# graph more appealing!
ggplot(data = ca_childcare_long, aes(x = study_year, y = price,
color = fct_reorder2(.f = region, .x = study_year, .y = price))) +
geom_smooth(method = "loess", linewidth = 0.5) +
geom_point(size = 0.8, alpha = 0.5) +
# creates separate graphs for age_groups
# each has its own x-axis
# in one row
facet_wrap(~ age, scales = "free_x", nrow=1) +
labs(title = "Weekly Median Price for Center-Based Childcare ($)",
x = "Study Year",
y = "",
color = "California Region") +
# adjust axis
scale_y_continuous(limits = c(100, 500), breaks = seq(100, 500, by = 100)) +
scale_x_continuous(breaks = seq(2008, 2018, by = 2)) +
theme_bw() +
theme(
# spaces the graphs apart, lines is a unit
panel.spacing = unit(1, "lines"),
# change the aspect ratio to make it less tall
aspect.ratio = 1, # make it less tall
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.key.size = unit(0.8, "lines")
+
) scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(10))
DVS-3: I show creativity in my visualizations
- I can use non-standard colors
# Lab 4, Question 6
# Growing comments: Nice work changing the size of the x-axis and y-axis text!
# Can you make this change to other aspects of the plot?
# The legend is really large! Great job reordering the legend to go in the same
# order as the lines! The final step is to match the colors and theme I used.
# Personally, I like theme_bw() and the “Accent” palette from the RColorBrewer package.
# I changed the size of my legend to be smaller, as it was massive before.
# This makes the graph more pleasing to look at. Lastly, I added the theme
# theme_bw() and modified the colors. Rather than using the defaults,
# it is important to explore different colors and themes that can make a
# graph more appealing!
ggplot(data = ca_childcare_long, aes(x = study_year, y = price,
color = fct_reorder2(.f = region, .x = study_year, .y = price))) +
geom_smooth(method = "loess", linewidth = 0.5) +
geom_point(size = 0.8, alpha = 0.5) +
# creates separate graphs for age_groups
# each has its own x-axis
# in one row
facet_wrap(~ age, scales = "free_x", nrow=1) +
labs(title = "Weekly Median Price for Center-Based Childcare ($)",
x = "Study Year",
y = "",
color = "California Region") +
# adjust axis
scale_y_continuous(limits = c(100, 500), breaks = seq(100, 500, by = 100)) +
scale_x_continuous(breaks = seq(2008, 2018, by = 2)) +
theme_bw() +
theme(
# spaces the graphs apart, lines is a unit
panel.spacing = unit(1, "lines"),
# change the aspect ratio to make it less tall
aspect.ratio = 1, # make it less tall
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.key.size = unit(0.8, "lines")
+
) scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(10))
- I can use annotations
# Challenge 2, Hot
# Originally I didn't do this challenge, I added it now for the portfolio
# Fixed from first midterm check in
# Made it more efficient using map2 instead of repeating annotate()!
# https://chatgpt.com/share/674eb08f-bd5c-8006-8b97-c61b5ae788cd
<- c("Neotoma", "Chaetodipus", "Peromyscus", "Perognathus",
labels "Reithrodontomys", "Sigmodon", "Onychomys", "Peromyscus",
"Reithrodontomys", "Dipodomys", "Dipodomys", "Chaetodipus",
"Dipodomys", "Onychomys")
<- 1:14
x_positions
ggplot(data = surveys,
mapping = aes(x = species, y = weight, color = genus)) +
geom_boxplot() +
labs(
x = "",
subtitle = "Species",
y = "Weight (grams)",
title = "Analyzing Weight Distributions Across Various Rodents") +
coord_flip() +
theme(legend.position = "none") +
scale_color_manual(values = cdPalette_grey) +
map2(x_positions, labels, ~ annotate("text", x = .x, y = 250, label = .y))
- I can be creative…
DVS-4: I can calculate numerical summaries of variables.
- Example using
summarize()
# lab 3, Question 10
# Success comment: This suggests there is only *one* max and one min.
# Is that the case? Are there any ties?
# Growing/ Reflect: Before I used slice and ordered the averages so I could just
# take out the first and last row (min/ max). However, I did not consider
# ties. Now, I get the average SET_score_avg per professor and output
# all the professors where their average is equivalent to the average min and max.
# This is an improvement because now rather than just seeing A professor that got
# the min and max average score, I can see them ALL (accounting for ties).
|>
teacher_evals_clean group_by(teacher_id) |>
filter(question_no == 901) |>
summarize(avg = mean(SET_score_avg, na.rm = TRUE)) |>
filter(avg == min(avg) | avg == max(avg))
- Example using
across()
# Used data from lab 3 to demonstrate across!
|>
teacher_evals_clean filter(question_no == 901) |>
group_by(teacher_id) |>
summarize(
across(.cols = c(no_participants, resp_share, SET_score_avg),
.fns = mean,
.names = "{.col}_avg"))
DVS-5: I can find summaries of variables across multiple groups.
- Example 1
# Lab 5
# Growing comment: You need to interview every suspect!
# Added the rest of the lab!
|>
drivers_license rename(license_id = id) |>
filter(
== "female",
gender == "red",
hair_color >= 65 & height <= 67,
height == "Tesla",
car_make == "Model S"
car_model |>
) left_join(person, by = "license_id") |>
inner_join(facebook_event_checkin, by = c("id" = "person_id")) |>
# since date is a double, change to character first
filter(str_starts(as.character(date), "2017"),
== "SQL Symphony Concert") |>
event_name group_by(id) |>
summarise(event_count = n(), .groups = "drop") |>
filter(event_count == 3) |>
inner_join(person, by = "id") |>
left_join(interview, by = c("id" = "person_id")) |>
# Confirm new suspect = shouldn't have an interview
select(name, transcript)
- Example 2
# Lab 3, Question 9
# Success comment: I strongly recommend against nested functions, as they are
# difficult for people to understand what your code is doing. Having two
# lines is not less efficient and is more readable.
|>
teacher_evals_clean group_by(teacher_id, course_id) |>
summarize(num_questions = n_distinct(question_no)) |>
filter(num_questions == 9)
DVS-6: I can create tables which make my summaries clear to the reader.
- Example 1
# Lab 4, Question 4
# Success comments: I would recommend using the %in% operator instead of the or!
# Nice work learning to drop the groups!
# Nice column names! You could even be more specific and say "Median Income"!
|>
ca_childcare_clean filter(study_year %in% c(2008, 2018))
group_by(region, study_year) |>
summarise(median_income = median(mhi_2018, na.rm = TRUE), .groups = 'drop') |>
pivot_wider(id_cols = region,
names_from = study_year,
values_from = median_income,
names_prefix = "Median Income ") |>
arrange(`Median Income 2018`)
- Example 2
# Lab 3, Question 12
# Success comments: This suggests there is only *one* max and one min.
# Is that the case? Are there any ties? If you want both conditions to be
# satisfies in a filter() you can use a comma to separate them!
# I would recommend using the %in% operator instead of the or!
# I also added better names!
|>
teacher_evals_clean group_by(teacher_id) |>
filter(sex == "female", academic_degree %in% c("dr", "prof")) |>
summarize(avg = mean(resp_share, na.rm = TRUE)) |>
filter(avg == min(avg) | avg == max(avg)) |>
rename(`Female Teacher` = teacher_id,
`Average Response Rate (Min/ Max)` = avg)
DVS-7: I show creativity in my tables.
- Example 1
- Example 2
Program Efficiency
PE-1: I can write concise code which does not repeat itself.
- using a single function call with multiple inputs (rather than multiple function calls)
# Lab 5
# Growing comment: You need to interview every suspect!
# Added the rest of the lab!
|>
drivers_license rename(license_id = id) |>
filter(
== "female",
gender == "red",
hair_color >= 65 & height <= 67,
height == "Tesla",
car_make == "Model S"
car_model |>
) left_join(person, by = "license_id") |>
inner_join(facebook_event_checkin, by = c("id" = "person_id")) |>
# since date is a double, change to character first
filter(str_starts(as.character(date), "2017"),
== "SQL Symphony Concert") |>
event_name group_by(id) |>
summarise(event_count = n(), .groups = "drop") |>
filter(event_count == 3) |>
inner_join(person, by = "id") |>
left_join(interview, by = c("id" = "person_id")) |>
# Confirm new suspect = shouldn't have an interview
select(name, transcript)
across()
# Lab 3, Question 5
# Success comment: I'd encourage you to be more consistent with your function
# syntax. The first syntax for the across() function is spot on, but the second
# one lacks the details you included in the first.
= teacher_data |>
teacher_evals_clean rename(sex=gender) |>
filter(no_participants >= 10) |>
select(course_id, teacher_id, question_no, no_participants,
resp_share, SET_score_avg, percent_failed_cur, |>
academic_degree, seniority, sex) mutate(
across(.cols = course_id:teacher_id, .fns = ~ as.character(.x)),
across(.cols = c(academic_degree, sex), .fns = ~ as.factor(.x))
)
map()
functions
PE-2: I can write functions to reduce repetition in my code.
- Function that operates on vectors
- Function that operates on data frames
PE-3:I can use iteration to reduce repetition in my code.
across()
map()
function with one input (e.g.,map()
,map_chr()
,map_dbl()
, etc.)
map()
function with more than one input (e.g.,map_2()
orpmap()
)
PE-4: I can use modern tools when carrying out my analysis.
- I can use functions which are not superseded or deprecated
# Lab 4, Question 6
# Growing comment: Nice work pivoting and modifying the age variable!
# The recode() function is superseeded, in favor of case_when() and functions
# in the forcats package.
# What I did: Before I changed the names of mc_infant, mc_toddler, mc_preschool
# to Infant, Toddler, Preschool using recode(), which is
# superseeded! I changed it to case_when() this way
# I am using the updated appropriate function.
<- ca_childcare_clean |>
ca_childcare_long select(study_year, region, mc_infant, mc_toddler, mc_preschool) |>
# Transform wide to long format
pivot_longer(cols = starts_with("mc_"),
# Create a new column "age" from the column names
names_to = "age",
# The corresponding values will go in the "price" column
values_to = "price") |>
mutate(age = case_when(
== "mc_infant" ~ "Infant",
age == "mc_toddler" ~ "Toddler",
age == "mc_preschool" ~ "Preschool"),
age age = fct_relevel(age, "Infant", "Toddler", "Preschool")
) ca_childcare_long
- I can connect a data wrangling pipeline into a
ggplot()
# Lab 4, Question 6
# Growing comments: Nice work changing the size of the x-axis and y-axis text!
# Can you make this change to other aspects of the plot?
# The legend is really large! Great job reordering the legend to go in the same
# order as the lines! The final step is to match the colors and theme I used.
# Personally, I like theme_bw() and the “Accent” palette from the RColorBrewer package.
# I changed the size of my legend to be smaller, as it was massive before.
# This makes the graph more pleasing to look at. Lastly, I added the theme
# theme_bw() and modified the colors. Rather than using the defaults,
# it is important to explore different colors and themes that can make a
# graph more appealing!
# I combined the pipeline with the ggplot!
|>
ca_childcare_clean select(study_year, region, mc_infant, mc_toddler, mc_preschool) |>
# Transform wide to long format
pivot_longer(cols = starts_with("mc_"),
# Create a new column "age" from the column names
names_to = "age",
# The corresponding values will go in the "price" column
values_to = "price") |>
mutate(age = fct_relevel(case_when(
== "mc_infant" ~ "Infant",
age == "mc_toddler" ~ "Toddler",
age == "mc_preschool" ~ "Preschool"
age "Infant", "Toddler", "Preschool")) |>
), ggplot(aes(x = study_year, y = price,
color = fct_reorder2(.f = region, .x = study_year, .y = price))) +
geom_smooth(method = "loess", linewidth = 0.5) +
geom_point(size = 0.8, alpha = 0.5) +
# creates separate graphs for age_groups
# each has its own x-axis
# in one row
facet_wrap(~ age, scales = "free_x", nrow=1) +
labs(title = "Weekly Median Price for Center-Based Childcare ($)",
x = "Study Year",
y = "",
color = "California Region") +
# adjust axis
scale_y_continuous(limits = c(100, 500), breaks = seq(100, 500, by = 100)) +
scale_x_continuous(breaks = seq(2008, 2018, by = 2)) +
theme_bw() +
theme(
# spaces the graphs apart, lines is a unit
panel.spacing = unit(1, "lines"),
# change the aspect ratio to make it less tall
aspect.ratio = 1, # make it less tall
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.key.size = unit(0.8, "lines")
+
) scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(10))
Data Simulation & Statisical Models
DSSM-1: I can simulate data from a variety of probability models.
- Example 1
- Example 2
DSSM-2: I can conduct common statistical analyses in R.
- Example 1
# Lab 4, Question 8
# Linear Regression
<- lm(mc_infant ~ mhi_2018, data = ca_childcare)
reg_mod1 summary(reg_mod1)
- Example 2
# Challenge 3, Question 3
chisq.test(teacher_evals_compare$SET_level,
$sen_level) teacher_evals_compare
Revising My Thinking
Throughout the course, I have revised all my growing areas and have been working hard to look at the success comments as well to incorporate into newer assignments. I wrote code comments in every chunk were I revised code to state what I changed and reflect on how the new code is better ensuring to focus on the “bigger picture.” This was sometimes challenging as most of my growing areas were small fixes, but I still made sure to reflect. Overall, fixing all my growing areas and reflecting on them was very valuable because it helped me remember to incorporate those changes into the following labs. In my portfolio, I revised code chunks to reflect the changes growing and success comments suggested.
Extending My Thinking
To extend my thinking, I try to think of what code functions might be most helpful and efficient for each particular situation. Essentially, gathering what I learned from class and critically thinking as I apply it to my assignments. Also, if I was curious or wanted to learn more, I would Google to discover new information (e.g. finding new color palettes and themes). In the code examples in my portfolio, I made sure to consider all the growing and success comments on my assignments, improving my code to be tidy and efficient. I also make a really big effort to do my best on every lab and challenge to produce quality documents.
Peer Support & Collaboration
Here is my peer review from Lab 3. I particularly like this one, even though I have done a peer review every week. For this review, I talked about specific questions and gave advice on how the code can be made better. I made sure to comment what was good about the code too!
“Hey! Good job on adding a table of contents and code folding options. This makes your overall lab have a cleaner look and easier to follow.
In question 7, you add some extra spacing and indenting that is unnecessary and makes the code look a little messy. Also, you can be more specific when using if_any(). For example, try adding named arguments and correct function syntax. if_any(.cols = everything(), .fns = ~ is.na(.x)).
For questions 10-12, you have some extra spacing and indentation that made it hard to read / follow the code. Also, you repeat the main chunk of the code twice, I think you could have combined these to get the min / max from one code chunk to be more efficient. I also couldn’t see the output of your code in most of these.
Overall, your results seem great! Make sure to not repeat big chunks of code, always display the output, and remove extra spacing/ indents.”
During pair programming I made sure to follow the programmer vs coder routines. When it was my partners turn to type, I would tell them what to write. When it would be my turn to type, I listened to my partner and made sure to not just type what I wanted to. If we were stuck and neither of us knew what to do, we would collaborate and try to problem solve together. I also raised my hand when needed to help my partner and I when we couldn’t figure out what to do.