Introduction to Multiple Linear Regression

Midterm Project Week

Plan for Today

  1. Review different types of multiple linear regression models

  2. Complete an activity on sample selection

  3. Start Midterm Project write-up

Plan for Wednesday

No lab – focus on getting all the coding accomplished for the Midterm Project

Draft Due by Sunday

To get everyone feedback on their drafts in a timely manner, the first drafts are due by Sunday.

Deadline Extension

A deadline extension is permitted for the first draft. Deadline extensions are not permitted for the final version (due next week).

Reminders About Deadlines

  • Midterm Project Proposal is due today by 5pm
  • Lab 3 revisions are due Wednesday (May 1)
  • Statistical Critique revisions are due next Wednesday (May 8)
  • Second round of Lab 2 revisions are due by Wednesday (May 1)

Multiple Linear Regression

Before…

Now…



How?

Offsets!

smoke_lm <- lm(weight ~ weeks * habit, data = ncbirths)

get_regression_table(smoke_lm)
# A tibble: 4 × 3
  term              estimate std_error
  <chr>                <dbl>     <dbl>
1 intercept           -5.94      0.484
2 weeks                0.341     0.013
3 habit: smoker       -1.86      1.63 
4 weeks:habitsmoker    0.039     0.042

Interaction Model

The * means the variables are interacting!

Estimated Regression Equations

# A tibble: 4 × 3
  term              estimate std_error
  <chr>                <dbl>     <dbl>
1 intercept           -5.94      0.484
2 weeks                0.341     0.013
3 habit: smoker       -1.86      1.63 
4 weeks:habitsmoker    0.039     0.042

What is the regression equation for non-smoker mothers?

What is the regression equation for smoker mothers?

What if we have a second numerical explanatory variable?

Multiple slopes

age_lm <- lm(weight ~ weeks + mage, data = ncbirths)

get_regression_table(age_lm)
# A tibble: 3 × 3
  term      estimate std_error
  <chr>        <dbl>     <dbl>
1 intercept   -6.68      0.492
2 weeks        0.346     0.012
3 mage         0.02      0.006

How do you interpret the value of 0.346?

How do you interpret the value of 0.02?

But how do we decide if the interaction model is “best” without a p-value??????

When investigating if a relationship differs…

Always start with the “interaction” / different slopes model.

If the slopes look different, you’re done!

If the slopes look similar, then fit the “additive” / parallel slopes model.

Different Enough?

What if they’re not very different?

Parallel Slopes

lm(average_sat_math ~ perc_disadvan + size, 
   data = MA_schools)


# A tibble: 4 × 3
  term          estimate std_error
  <chr>            <dbl>     <dbl>
1 intercept       588.       7.61 
2 perc_disadvan    -2.78     0.106
3 size: medium    -11.9      7.54 
4 size: large      -6.36     6.92 

Group equations – Baseline

# A tibble: 4 × 3
  term          estimate std_error
  <chr>            <dbl>     <dbl>
1 intercept       588.       7.61 
2 perc_disadvan    -2.78     0.106
3 size: medium    -11.9      7.54 
4 size: large      -6.36     6.92 

\[\widehat{SAT}_{small} = 588 - 2.78 \times \text{percent disadvantaged}\]

Group equations – Offsets

# A tibble: 4 × 3
  term          estimate std_error
  <chr>            <dbl>     <dbl>
1 intercept       588.       7.61 
2 perc_disadvan    -2.78     0.106
3 size: medium    -11.9      7.54 
4 size: large      -6.36     6.92 

\[\widehat{SAT}_{medium} = (588 - 11.9) - 2.78 \times \text{percent disadvan}\]

\[\widehat{SAT}_{medium} = 576.1 - 2.78 \times \text{percent disadvan}\]

\[\widehat{SAT}_{large} = (588 - 6.36) - 2.78 \times \text{percent disadvan}\]

\[\widehat{SAT}_{large} = 581.64 - 2.78 \times \text{percent disadvan}\]

Sample Selection Activity

Find Your Data Group!

Once you have found other students working on the same dataset, complete the sample selection activity.

  1. What are the observations / rows in this dataset?

  2. From what population was the sample drawn?

  3. For an observation to be included in the dataset, what inclusion criteria needed to be met?

  4. How were the observations who satisfied the inclusion criteria sampled from the population?

  5. Based on the inclusion criteria and sampling methods, to what population can the findings of the study be generalized?

Midterm Project Work Time

Steps Before Wednesday

  1. Insert the description of your dataset and variables (from the Midterm Proposal) into the “Introduction” of your project

  2. Pose a research question about your selected variables, which can be addressed with multiple linear regression

  3. Insert the code to create the required two (or three) visualizations

  4. Write a description of what you see in the visualizations

  5. Make a decision which model you believe is “best”