Variables |
---|
fage |
mage |
mature |
weeks |
premie |
visits |
gained |
weight |
lowbirthweight |
sex |
habit |
marital |
whitemom |
Deadline Extensions
You cannot request deadline extensions for the final version of your Midterm Project. The assignment portal closes at 11:59pm on Sunday. Do not ride the line. Submissions made after 11:59pm will not be accepted.
Your Introduction should at a minimum address the following questions:
Are your data associated with a publication?
If so, you should have a reference to this publication in your Introduction!
Your research question should be a question
Your question should be able to be addressed with a multiple linear regression
Describe the response and explanatory variables, how they were measured and their associated units.
Descriptions of your visualizations should address:
Descriptions of your visualizations should go immediately below the visualization, before the “Statistical Methods” subsection.
Based on how the study was designed, what population can you infer these results onto?
Based on how the study was designed, what population can you infer these results onto?
Justify what population you believe your analysis can be inferred onto.
Your justification needs to connect with how the researchers collected their data!
Based on how the study was designed, what can you say about the relationships between the variables?
Based on how the study was designed, what can you say about the relationships between the variables?
Stating that the study was “observational” doesn’t tell me that you understand what would be required to use cause-and-effect language!
No “significance” & no p-values
Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one.
In the context of statistical analyses, this may be the selection of a statistical model from a set of candidate models, given data.
Wikipedia
Choose a set of variables you are interested in including in your model.
Choose a metric to compare your models (e.g., adjusted R-squared, AIC, p-values).
Choose a threshold that you will use to say one model is discernibly “better” than another model (e.g., higher adjusted R-squared).
Choose how you want to progress through the different model options (e.g., forward selection, backward selection, fit all possible models).
Your data have LOTS of variables
By “lots” I mean LOTS, like 100+.
In this setting, model selection can help you find the “signal” through the noise—which variables actually matter?
You’re interested in prediction
You mostly care about finding a model that will get you the best predictions, and are not interested in interpreting the coefficients from the model.
What variables do we have to choose from?
Variables |
---|
fage |
mage |
mature |
weeks |
premie |
visits |
gained |
weight |
lowbirthweight |
sex |
habit |
marital |
whitemom |
Using backward selection with AIC, the “best model” includes:
Chosen Variables |
---|
mage |
mature |
weeks |
premie |
gained |
lowbirthweight |
sex |
habit |
whitemom |
Using a different sample of 1,000 births, the “best model” includes:
Chosen Variables |
---|
fage |
weeks |
marital |
gained |
lowbirthweight |
gender |
habit |
whitemom |
Did we get the same “best” model?
In fact, many statisticians discourage the use of stepwise regression alone for model selection and advocate, instead, for a more thoughtful approach that carefully considers the research focus and features of the data.
Introduction to Modern Statistics
Start with the most basic model (one mean)
Decide which one variable to add (based on adjusted \(R^2\))
Decide if you should add another variable
\(\vdots\)
In each step, you will choose which one variable to add based on the adjusted R-squared value.
evals_train %>%
map(.f = ~lm(score ~ .x + <VARIABLES SELECTED>, data = evals_train)) %>%
map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>%
select(-ID,
-score,
-<VARIABLE 1 SELECTED>,
-<VARIABLE 2 SELECTED>
) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "adj_r_sq") %>%
slice_max(adj_r_sq)
Roles
The person who was the Recorder last week is the Resource Manager this week! The person who was the Resource Manager last week is the Recorder this week!
Step 1: The Recorder follows these instructions to copy the Lab 6 project into your group’s workspace
Step 2: Both members open the Lab 6 assignment in your group workspace!
Step 3: Follow the final instructions to activate collaborative editing in the document.