Lab 6: Predicting Professor Evaluation Scores

Author

Your group’s names here!

library(tidyverse)
library(moderndive)
library(openintro)

evals <- evals |> 
  mutate(large_class = if_else(cls_students > 100, 
                               "large class", 
                               "regular class"), 
         eval_completion = cls_did_eval / cls_students 
         ) |> 
  select(-cls_did_eval, 
         -cls_students, 
         -prof_id,
         -course_id, 
         -bty_f1lower, 
         -bty_f1upper, 
         -bty_f2upper, 
         -bty_m1lower, 
         -bty_m1upper, 
         -bty_m2upper)

Your Challenge

This week you have learned about model selection. During class you worked on performing a backward selection process to determine the “best” model for penguin body mass.

Today, you are going to use forward selection to determine the “best” model for professor’s evaluation score. This task will require you to fit tons of linear regressions. You must be able to show me exactly how you got to your top model. Meaning, I need to see a record of every model you fit and compared along the way.

Forward Selection

The forward selection process starts with a model with no predictor variables. That means, this model predicts the same mean evaluation score for every professor. I’ve fit this model for you below!

one_mean <- lm(score ~ 1, data = evals)

You can pull out the adjusted \(R^2\) for this model using the get_regression_summaries() function.

get_regression_summaries(one_mean)

Based on this output, we are starting with a really low adjusted \(R^2\). So, things can only get better from here! 🙃

Step 1

Rules: You can only add a variable to the model if it improves the adjusted \(R^2\) by at least 2% (0.02).

Alright, so now we get cooking. The next step is to fit every model with one explanatory variable. I’ve provided a list of every explanatory variable you are allowed to consider!

  • rank – rank of professor
  • ethnicity – ethnicity of the professor
  • gender – gender of the professor
  • language – language of school where professor received education
  • age – age of the professor
  • cls_perc_eval – the percentage of students who completed the evaluation
  • cls_level – class level
  • cls_profs – number of professors teaching sections in course: single, multiple
  • cls_credits – credits of class: one credit (lab, PE, etc.), multi credit
  • bty_avg – average beauty rating of the professor
  • pic_outfit – outfit of professor in picture
  • pic_color – color of professor’s picture
  • large_class – whether the class had over 100 students
  • eval_completion – proportion of students who completed the evaluation

Woof, that’s 14 different variables. That means, for this first round, you will need to compare the adjusted \(R^2\) for 12 different models to decide what variable should be added.

Every model you fit will have the same format:

name_of_model <- lm(score ~ <variable>, data = evals)

But, the name of the model will need to change. I’ve started the process for you, using the naming style of one_ followed by the variable name (e.g., one_id, one_bty, etc.).

one_rank <- lm(score ~ rank, data = evals)
one_ethnicity <- lm(score ~ ethnicity, data = evals)
one_gender <- lm(score ~ gender, data = evals)
one_language <- lm(score ~ language, data = evals)
one_age <- lm(score ~ age, data = evals)
one_perc_eval <- lm(score ~ cls_perc_eval, data = evals)
one_level <- lm(score ~ cls_level, data = evals)

## Now, you need to fit the other seven models! 

Alright, now that you’ve fit the models, you need to inspect the adjusted \(R^2\) values to see which of these 14 models is the “top” model – the model with the highest adjusted \(R^2\)! Similar to before, I’ve provided you with some code to get you started, but you need to write the remaining code.

get_regression_summaries(one_rank)
get_regression_summaries(one_ethnicity)
get_regression_summaries(one_gender)
get_regression_summaries(one_language)
get_regression_summaries(one_age)
get_regression_summaries(one_perc_eval)
get_regression_summaries(one_level)

## Now, you need to compare the other seven models! 

1. What model was your top model? Specifically, which variable was selected to be included?

Step 2 - Adding a Second Variable

Alright, you’ve added one variable, the next step is to decide if you should add a second variable. This process looks nearly identical to the previous step, with one major change: every model you fit needs to contain the variable you decided to add. So, if you decided to add the bty_avg variable, every model you fit would look like this:

name_of_model <- lm(score ~ bty_avg + <variable>, data = evals)

Again, the name of the model will need to change. This round, you are on your own – I’ve provided you with no code. Here are my recommendations:

  • name each model two_ followed by the names of both variables included in the model (e.g., two_bty_id)
  • go through each variable step-by-step just like you did before
## Code to fit all 13 models that add a second variable to your top model goes here!

Alright, now you should have 13 more models to compare! Like before, you need to inspect the adjusted \(R^2\) values to see which of these 13 models is the “top” model.

Rules: You can only add a variable to the model if it improves adjusted \(R^2\) by at least 2% (0.02) from the model you chose in Question 1.

## Code to compare all 13 models you fit goes here!

2. What model was your top model? State which variables are included in the model you chose!

Step 3 - Adding a Third Variable

As you might have expected, in this step we add a third variable to our top model from the previous step. This process should be getting familiar at this point!

This process of fitting 12-14 models at a time is getting rather tedious! So, I’ve written some code that will carry out this process for us in one pipeline! The only thing you need to change is:

  • add in the names of the variables you selected in Steps 1 & 2 in the ~lm(score ~ .x + <VARIABLES SELECTED>, data = evals) step

  • add in the names of the variables you selected in Steps 1 & 2 in the select(-ID, -score, -<VARIABLE 1 SELECTED>, -<VARIABLE 2 SELECTED>) step

For example, if you chose gender and age in Steps 1 and 2, your code on the first line would look like map(.f = ~lm(score ~ .x + gender + age, data = evals)) %>% and your code on the fourth line would look like select(-score, -gender, -age) %>%.

Your turn!

## Change the <VARIABLES SELECTED> in line 2 to the names of the variables you selected in Steps 1 & 2
## Change the <VARIABLE 1 SELECTED> and <VARIABLE 2 SELECTED> in line 4 to the names of the variables you selected in Steps 1 & 2

evals %>% 
  map(.f = ~lm(score ~ .x + <VARIABLES SELECTED>, data = evals)) %>% 
  map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% 
  select(-score, 
         -<VARIABLE 1 SELECTED>, 
         -<VARIABLE 2 SELECTED>
           ) %>% 
  pivot_longer(cols = everything(), 
               names_to = "variable", 
               values_to = "adj_r_sq") %>% 
  slice_max(adj_r_sq)

The output of this code is the variable that has the highest adjusted \(R^2\). Compare this value to the value of your “top” model from Step 2 and see if it improved adjusted \(R^2\) by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 2 is the “best” model!

3. What model was your top model? State which variables are included in the model you chose!

Step 4 - Adding a Fourth Variable

If you decided to add a variable in Step 3, then you keep going! If you didn’t add a variable in Step 3, then you stop!

## Change the <VARIABLES SELECTED> in line 2 to the names of the variables you selected in Steps 1, 2, & 3
## Change the <VARIABLE 1 SELECTED> and <VARIABLE 2 SELECTED> in line 4 to the names of the variables you selected in Steps 1, 2 & 3

evals %>% 
  map(.f = ~lm(score ~ .x + <VARIABLES SELECTED>, data = evals)) %>% 
  map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% 
  select(-score, 
         -<VARIABLE 1 SELECTED>, 
         -<VARIABLE 2 SELECTED>, 
         -<VARIABLE 3 SELECTED>
         ) %>% 
  pivot_longer(cols = everything(), 
               names_to = "variable", 
               values_to = "adj_r_sq") %>% 
  slice_max(adj_r_sq)

The output of this code is the variable that has the highest adjusted \(R^2\). Compare this value to the value of your “top” model from Step 3 and see if it improved adjusted \(R^2\) by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 3 is the “best” model!

4. What model was your top model? You must state which variables are included in the model you chose!

Step 5 - Adding a Fifth Variable

If you decided to add a variable in Step 4, then you keep going! If you didn’t add a variable in Step 4, then you stop!

## Change the <VARIABLES SELECTED> in line 2 to the names of the variables you selected in Steps 1, 2, 3 & 4
## Change the <VARIABLE 1 SELECTED> and <VARIABLE 2 SELECTED> in line 4 to the names of the variables you selected in Steps 1, 2, 3 & 4

evals %>% 
  map(.f = ~lm(score ~ .x + <VARIABLES SELECTED>, data = evals)) %>% 
  map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% 
  select(-score, 
         -<VARIABLE 1 SELECTED>, 
         -<VARIABLE 2 SELECTED>, 
         -<VARIABLE 3 SELECTED>,
         -<VARIABLE 4 SELECTED>
          ) %>%  
  pivot_longer(cols = everything(), 
               names_to = "variable", 
               values_to = "adj_r_sq") %>% 
  slice_max(adj_r_sq)

The output of this code is the variable that has the highest adjusted \(R^2\). Compare this value to the value of your “top” model from Step 4 and see if it improved adjusted \(R^2\) by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 4 is the “best” model!

5. What model was your top model? You must state which variables are included in the model you chose!

Comparing with the step() Function

Let’s check the forward selection model you found with what model the step() function decides is best. Run the code chunk below to obtain the “best” model chosen by this function.

full_model <- lm(score ~ ., data = evals)
step(full_model, direction = "forward")
  1. Did the step() function choose the same model as you? If your “best” models do not not agree, why do you think this might have happened? Hint: The output after Call: tells you what final model was chosen