Midterm Project Feedback & Model Selection

Deadline Reminders

  • Lab 3 revisions are due tonight
  • Statistical Critique 1 revisions are due tonight
  • The final version of your Midterm Project is due Sunday at midnight

Deadline Extensions

You cannot request deadline extensions for the final version of your Midterm Project. The assignment portal closes at 11:59pm on Sunday. Do not ride the line. Submissions made after 11:59pm will not be accepted.

Midterm Project Review

Data Descriptions

Your Introduction should at a minimum address the following questions:

  1. Who collected the data?
  2. How were the data collected?
  3. When were the data collected?
  4. Why were the data collected?
  5. What question do these data address?

Are your data associated with a publication?

If so, you should have a reference to this publication in your Introduction!

Research Questions

  • Your research question should be a question

  • Your question should be able to be addressed with a multiple linear regression

    • Is the relationship between stem length and stem dry mass different for watersheds treated with calcium versus without?
    • Is there a relationship between the elevation a pika lives and its stress levels? Does this relationship differ for male and female pika?

Variables

Describe the response and explanatory variables, how they were measured and their associated units.

  • How were the variables measured?
    • How do you know?
    • If the researchers do not explicitly state how a variable was measured, don’t guess! Be transparent about information you do not know!

Data Visualizations & Coefficient Interpretations

  • Descriptions of your visualizations should address:

    • form, direction, strength, and unusual points
  • Descriptions of your visualizations should go immediately below the visualization, before the “Statistical Methods” subsection.

Study Limitations – Scope of Inference

Based on how the study was designed, what population can you infer these results onto?

  • Every penguin?
  • Penguins in Antarctica?
  • Penguins on the Biscoe, Dream, and Torgersen islands?
  • Penguins in similar areas to those that were sampled?
  • This sample of penguins?

Study Limitations – Scope of Inference

Based on how the study was designed, what population can you infer these results onto?

Justify what population you believe your analysis can be inferred onto.

  • The sample of [possums / professors / crabs]?
  • Some larger population of [possums / professors / crabs]?

Your justification needs to connect with how the researchers collected their data!

Study Limitations – Causal Inference

Based on how the study was designed, what can you say about the relationships between the variables?


  • Can you say that your explanatory variable(s) causes changes in your response variable?
  • Why or why not?
  • What can you say about these relationships?

Study Limitations – Causal Inference

Based on how the study was designed, what can you say about the relationships between the variables?


Stating that the study was “observational” doesn’t tell me that you understand what would be required to use cause-and-effect language!

  • What specifically would have the researchers have needed to do in order to use causal language?

Writing Conclusions

  • Circle back to your research question
  • What did you learn in your visualizations?
  • What did you learn in your regression model?
  • What conclusions would you reach about your research question?

No “significance” & no p-values

Model Selection

Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one.

In the context of statistical analyses, this may be the selection of a statistical model from a set of candidate models, given data.

Wikipedia

How does model selection work?

  1. Choose a set of variables you are interested in including in your model.

  2. Choose a metric to compare your models (e.g., adjusted R-squared, AIC, p-values).

  3. Choose a threshold that you will use to say one model is discernibly “better” than another model (e.g., higher adjusted R-squared).

  4. Choose how you want to progress through the different model options (e.g., forward selection, backward selection, fit all possible models).

Why would you want to use model selection?

Your data have LOTS of variables


By “lots” I mean LOTS, like 100+.

In this setting, model selection can help you find the “signal” through the noise—which variables actually matter?

You’re interested in prediction


You mostly care about finding a model that will get you the best predictions, and are not interested in interpreting the coefficients from the model.

Let’s give it a try!

Predicting a Baby’s Weight

What variables do we have to choose from?

Variables
fage
mage
mature
weeks
premie
visits
gained
weight
lowbirthweight
sex
habit
marital
whitemom

Using backward selection with AIC, the “best model” includes:

Chosen Variables
mage
mature
weeks
premie
gained
lowbirthweight
sex
habit
whitemom

Would you expect to get the same “best” model with a different dataset?

A Different Sample

Using a different sample of 1,000 births, the “best model” includes:

Chosen Variables
fage
weeks
marital
gained
lowbirthweight
gender
habit
whitemom

Did we get the same “best” model?

Should you always use model selection?

In fact, many statisticians discourage the use of stepwise regression alone for model selection and advocate, instead, for a more thoughtful approach that carefully considers the research focus and features of the data.

Introduction to Modern Statistics

Lab 6

Forward Selection (by Hand)

  1. Start with the most basic model (one mean)

  2. Decide which one variable to add (based on adjusted \(R^2\))

  3. Decide if you should add another variable

\(\vdots\)

  1. Stop adding variables when adjusted \(R^2\) stops increasing

Choosing What Variable to Add

In each step, you will choose which one variable to add based on the adjusted R-squared value.


get_regression_summaries(backward_model)
# A tibble: 1 × 9
  r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
      <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
1     0.605         0.601 0.819 0.905  0.91      151.       0     8   800

A More Automated Option

evals_train %>% 
  map(.f = ~lm(score ~ .x + <VARIABLES SELECTED>, data = evals_train)) %>% 
  map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% 
  select(-ID, 
         -score,
         -<VARIABLE 1 SELECTED>,
         -<VARIABLE 2 SELECTED>
         ) %>% 
  pivot_longer(cols = everything(), 
               names_to = "variable", 
               values_to = "adj_r_sq") %>% 
  slice_max(adj_r_sq)

Accessing Lab 6

Roles

The person who was the Recorder last week is the Resource Manager this week! The person who was the Resource Manager last week is the Recorder this week!

Step 2: Both members open the Lab 6 assignment in your group workspace!

Step 3: Follow the final instructions to activate collaborative editing in the document.