library(tidyverse)
library(moderndive)
library(openintro)
<- evals |>
evals mutate(large_class = if_else(cls_students > 100,
"large class",
"regular class"),
eval_completion = cls_did_eval / cls_students
|>
) select(-cls_did_eval,
-cls_students,
-prof_id,
-course_id,
-bty_f1lower,
-bty_f1upper,
-bty_f2upper,
-bty_m1lower,
-bty_m1upper,
-bty_m2upper)
Lab 6: Predicting Professor Evaluation Scores
Your Challenge
This week you have learned about model selection. During class you worked on performing a backward selection process to determine the “best” model for penguin body mass.
Today, you are going to use forward selection to determine the “best” model for professor’s evaluation score. This task will require you to fit tons of linear regressions. You must be able to show me exactly how you got to your top model. Meaning, I need to see a record of every model you fit and compared along the way.
Forward Selection
The forward selection process starts with a model with no predictor variables. That means, this model predicts the same mean evaluation score for every professor. I’ve fit this model for you below!
<- lm(score ~ 1, data = evals) one_mean
You can pull out the adjusted \(R^2\) for this model using the get_regression_summaries()
function.
get_regression_summaries(one_mean)
Based on this output, we are starting with a really low adjusted \(R^2\). So, things can only get better from here! 🙃
Step 1
Rules: You can only add a variable to the model if it improves the adjusted \(R^2\) by at least 2% (0.02).
Alright, so now we get cooking. The next step is to fit every model with one explanatory variable. I’ve provided a list of every explanatory variable you are allowed to consider!
rank
– rank of professorethnicity
– ethnicity of the professorgender
– gender of the professorlanguage
– language of school where professor received educationage
– age of the professorcls_perc_eval
– the percentage of students who completed the evaluationcls_level
– class levelcls_profs
– number of professors teaching sections in course: single, multiplecls_credits
– credits of class: one credit (lab, PE, etc.), multi creditbty_avg
– average beauty rating of the professorpic_outfit
– outfit of professor in picturepic_color
– color of professor’s picturelarge_class
– whether the class had over 100 studentseval_completion
– proportion of students who completed the evaluation
Woof, that’s 14 different variables. That means, for this first round, you will need to compare the adjusted \(R^2\) for 12 different models to decide what variable should be added.
Every model you fit will have the same format:
name_of_model <- lm(score ~ <variable>, data = evals)
But, the name of the model will need to change. I’ve started the process for you, using the naming style of one_
followed by the variable name (e.g., one_id
, one_bty
, etc.).
<- lm(score ~ rank, data = evals)
one_rank <- lm(score ~ ethnicity, data = evals)
one_ethnicity <- lm(score ~ gender, data = evals)
one_gender <- lm(score ~ language, data = evals)
one_language <- lm(score ~ age, data = evals)
one_age <- lm(score ~ cls_perc_eval, data = evals)
one_perc_eval <- lm(score ~ cls_level, data = evals)
one_level
## Now, you need to fit the other seven models!
Alright, now that you’ve fit the models, you need to inspect the adjusted \(R^2\) values to see which of these 14 models is the “top” model – the model with the highest adjusted \(R^2\)! Similar to before, I’ve provided you with some code to get you started, but you need to write the remaining code.
get_regression_summaries(one_rank)
get_regression_summaries(one_ethnicity)
get_regression_summaries(one_gender)
get_regression_summaries(one_language)
get_regression_summaries(one_age)
get_regression_summaries(one_perc_eval)
get_regression_summaries(one_level)
## Now, you need to compare the other seven models!
1. What model was your top model? Specifically, which variable was selected to be included?
Step 2 - Adding a Second Variable
Alright, you’ve added one variable, the next step is to decide if you should add a second variable. This process looks nearly identical to the previous step, with one major change: every model you fit needs to contain the variable you decided to add. So, if you decided to add the bty_avg
variable, every model you fit would look like this:
name_of_model <- lm(score ~ bty_avg + <variable>, data = evals)
Again, the name of the model will need to change. This round, you are on your own – I’ve provided you with no code. Here are my recommendations:
- name each model
two_
followed by the names of both variables included in the model (e.g.,two_bty_id
) - go through each variable step-by-step just like you did before
## Code to fit all 13 models that add a second variable to your top model goes here!
Alright, now you should have 13 more models to compare! Like before, you need to inspect the adjusted \(R^2\) values to see which of these 13 models is the “top” model.
Rules: You can only add a variable to the model if it improves adjusted \(R^2\) by at least 2% (0.02) from the model you chose in Question 1.
## Code to compare all 13 models you fit goes here!
2. What model was your top model? State which variables are included in the model you chose!
Step 3 - Adding a Third Variable
As you might have expected, in this step we add a third variable to our top model from the previous step. This process should be getting familiar at this point!
This process of fitting 12-14 models at a time is getting rather tedious! So, I’ve written some code that will carry out this process for us in one pipeline! The only thing you need to change is:
add in the names of the variables you selected in Steps 1 & 2 in the
~lm(score ~ .x + <VARIABLES SELECTED>, data = evals)
stepadd in the names of the variables you selected in Steps 1 & 2 in the
select(-ID, -score, -<VARIABLE 1 SELECTED>, -<VARIABLE 2 SELECTED>)
step
For example, if you chose gender
and age
in Steps 1 and 2, your code on the first line would look like map(.f = ~lm(score ~ .x + gender + age, data = evals)) %>%
and your code on the fourth line would look like select(-score, -gender, -age) %>%
.
Your turn!
## Change the <VARIABLES SELECTED> in line 2 to the names of the variables you selected in Steps 1 & 2
## Change the <VARIABLE 1 SELECTED> and <VARIABLE 2 SELECTED> in line 4 to the names of the variables you selected in Steps 1 & 2
%>%
evals map(.f = ~lm(score ~ .x + <VARIABLES SELECTED>, data = evals)) %>%
map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>%
select(-score,
-<VARIABLE 1 SELECTED>,
-<VARIABLE 2 SELECTED>
%>%
) pivot_longer(cols = everything(),
names_to = "variable",
values_to = "adj_r_sq") %>%
slice_max(adj_r_sq)
The output of this code is the variable that has the highest adjusted \(R^2\). Compare this value to the value of your “top” model from Step 2 and see if it improved adjusted \(R^2\) by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 2 is the “best” model!
3. What model was your top model? State which variables are included in the model you chose!
Step 4 - Adding a Fourth Variable
If you decided to add a variable in Step 3, then you keep going! If you didn’t add a variable in Step 3, then you stop!
## Change the <VARIABLES SELECTED> in line 2 to the names of the variables you selected in Steps 1, 2, & 3
## Change the <VARIABLE 1 SELECTED> and <VARIABLE 2 SELECTED> in line 4 to the names of the variables you selected in Steps 1, 2 & 3
%>%
evals map(.f = ~lm(score ~ .x + <VARIABLES SELECTED>, data = evals)) %>%
map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>%
select(-score,
-<VARIABLE 1 SELECTED>,
-<VARIABLE 2 SELECTED>,
-<VARIABLE 3 SELECTED>
%>%
) pivot_longer(cols = everything(),
names_to = "variable",
values_to = "adj_r_sq") %>%
slice_max(adj_r_sq)
The output of this code is the variable that has the highest adjusted \(R^2\). Compare this value to the value of your “top” model from Step 3 and see if it improved adjusted \(R^2\) by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 3 is the “best” model!
4. What model was your top model? You must state which variables are included in the model you chose!
Step 5 - Adding a Fifth Variable
If you decided to add a variable in Step 4, then you keep going! If you didn’t add a variable in Step 4, then you stop!
## Change the <VARIABLES SELECTED> in line 2 to the names of the variables you selected in Steps 1, 2, 3 & 4
## Change the <VARIABLE 1 SELECTED> and <VARIABLE 2 SELECTED> in line 4 to the names of the variables you selected in Steps 1, 2, 3 & 4
%>%
evals map(.f = ~lm(score ~ .x + <VARIABLES SELECTED>, data = evals)) %>%
map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>%
select(-score,
-<VARIABLE 1 SELECTED>,
-<VARIABLE 2 SELECTED>,
-<VARIABLE 3 SELECTED>,
-<VARIABLE 4 SELECTED>
%>%
) pivot_longer(cols = everything(),
names_to = "variable",
values_to = "adj_r_sq") %>%
slice_max(adj_r_sq)
The output of this code is the variable that has the highest adjusted \(R^2\). Compare this value to the value of your “top” model from Step 4 and see if it improved adjusted \(R^2\) by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 4 is the “best” model!
5. What model was your top model? You must state which variables are included in the model you chose!
Comparing with the step()
Function
Let’s check the forward selection model you found with what model the step()
function decides is best. Run the code chunk below to obtain the “best” model chosen by this function.
<- lm(score ~ ., data = evals)
full_model step(full_model, direction = "forward")
- Did the
step()
function choose the same model as you? If your “best” models do not not agree, why do you think this might have happened? Hint: The output afterCall:
tells you what final model was chosen