Variable Selection in Multiple Regression

Statistical Critique

Feedback & Revisions

  • You are expected to directly state what variable(s) are associated with each aesthetic (e.g., the x-axis is associated with time).
  • Aesthetics are associated with variables
    • If there is only one color / shape / linetype being used in the plot, it is not associated with a variable!
    • If there are multiple plots, then there is a variable associated with how the plots are separated!

Revision Deadline

Revisions for Statistical Critiques are due on Wednesday (May 8)

Midterm Project

Research Questions

  • Your research question should be a question

  • Your question should be able to be addressed with a multiple linear regression

    • Is the relationship between stem length and stem dry mass different for watersheds treated with calcium versus without?
    • What is the relationship between the size of a fiddler crab and the latitude and water temperature in which it lives?

Data Visualizations & Coefficient Interpretations

  • Descriptions of your visualizations should address:

    • form, direction, strength, and unusual points
  • Descriptions of your visualizations should go immediately below the visualization, before the “Proposed Statistical Model” section

  • You should interpret every coefficient from your regression equation(s)!

    • Every y-intercept and every slope

Study Limitations – Scope of Inference

Based on how the study was designed, what population can you infer these results onto?

  • Every sugar maple?
  • Sugar maples in the Northeast?
  • Sugar maples in New Hampshire?
  • Sugar maples in the Hubbard Brook Experimental Forest?
  • Sugar maples in similar areas of the Hubbard Brook Experimental Forest to those that were sampled?
  • This sample of sugar maples?

Why?

Study Limitations – Causal Inference

Based on how the study was designed, what can you say about the relationships between the variables?

  • Can you say that your explanatory variable(s) causes changes in your response variable?
  • Why or why not?
  • What can you say about these relationships?

Writing Conclusions

  • Circle back to your research question
  • What did you learn in your visualizations?
  • What did you learn in your regression model?
  • What conclusions would you reach about your research question?

No “significance” & no p-values

Model Selection

What is model selection?



Why use model selection?

  1. Lots of available predictor variables

evals:

ID prof_ID score age bty_avg gender ethnicity language rank pic_outfit pic_color cls_did_eval cls_students cls_level
51 10 4.3 47 5.500 male not minority english teaching not formal color 12 16 upper
94 18 4.0 48 4.333 male not minority english teaching not formal color 100 135 lower
171 32 4.1 47 2.667 male not minority english tenured not formal color 10 14 upper

  1. Interested in prediction not explanation

You want to predict an outcome variable \(y\) based on the information contained in a set of predictor variables \(x\). You don’t care so much about understanding how all the variables relate and interact with one another, but rather only whether you can make good predictions about \(y\) using the information in \(x\).

ModernDive

How do you use model selection?

  • Stepwise Selection
    • Forward Selection
    • Backward Selection
  • Resampling Methods
    • Cross Validation
    • Testing / Training Datasets

With any of these methods, you get to choose how you decide if one model is better than another model.

Model Comparison Measures

\(R^2\) – Coefficient of Determination

Wright, Sewall (1921). Correlation and causation. Journal of Agricultural Research 20: 557-585.

In statistics, the coefficient of determination, denoted \(R^2\) or \(r^2\) and pronounced “R squared,” is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

Wikipedia

\(R^2 = 1 - \frac{\text{var}(\text{residuals})}{\text{var}(y)}\)

  • \(\text{var}(\text{residuals})\) is the variance of the residuals “leftover” from the regression model

  • \(\text{var}(y)\) is the inherent variability of the response variable

Suppose we have a simple linear regression with an \(R^2\) of 0.85. How would you interpret this quantity?

But! \(R^2\) always increases as you increase the number of explanatory variables.


The variance of the residuals will always decrease when you include additional explanatory variables.

Simple Linear Regression

\(0.85 = 1 - \frac{0.75}{5}\)

One Additional Variable

\(0.86 = 1 - \frac{0.7}{5}\)

Adjusted \(R^2\)

Mordecai Ezekiel (1930). Methods Of Correlation Analysis, Wiley, pp. 208-211.

The use of an adjusted \(R^2\) is an attempt to account for the phenomenon of the \(R^2\) automatically increasing when extra explanatory variables are added to the model.

Wikipedia

\(R^2_{adj} = 1 - R^2 \times \frac{(n - 1)}{(n - k - 1)}\)

  • \(n\) is the sample size

  • \(k\) is the number of coefficients needed to be calculated

Suppose you have a categorical variable with 4 levels included in your parallel slopes multiple linear regression.

What value will you use for \(k\) in the calculation of \(n - k - 1\)?

p-values

Fisher R. A. (1950). Statistical methods for research workers.

In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.

Wikipedia

AIC

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models.

Wikipedia

How do you use AIC to choose a “best” model?


model AIC Delta AIC
Full Model 4724.970 0.000000
All Variables Except Year 4727.242 2.272501
All Variables Except Flipper Length 4757.214 32.244605
All Variables Except Species 4793.681 68.710933


If you’ve ever assessed whether \(\Delta\) AIC \(> 2\) you have done something that is mathematically close to \(p < 0.05\).

Model Selection Activity!

Backward Selection by Hand

  • Start with “full” model (every explanatory variable is included)
    • Use adjusted \(R^2\) to summarize the “fit” of this model
  • Decide which one variable to remove
    • Highest adjusted \(R^2\)
  • Decide what one variable to remove next
    • Highest adjusted \(R^2\)
  • Keep removing variables until adjusted \(R^2\) doesn’t increase

What’s your best model?

Adding a Constraint

Repeat the same process, but now for a variable to be removed the adjusted \(R^2\) must increase by at least 2% (0.02).

What’s your best model?

If you’re not interested in prediction, what should you use instead?

In fact, many statisticians discourage the use of stepwise regression alone for model selection and advocate, instead, for a more thoughtful approach that carefully considers the research focus and features of the data.

Introduction to Modern Statistics

For Wednesday

Peer Review

Please print your Midterm Project and bring it to class!