Introduction to Multiple Linear Regression

Lab 3

A Grading Reminder


“Complete” = Satisfactory


Your group obtained a “Success” on every question

“Incomplete” = Growing


Your group received a “Growing” on at least one question

Common Mistakes

  • Categorical variables in R (Q2)
    • What data types does R use to store categorical variables? Integers? Characters? Doubles? Factors? Dates?
    • The output of glimpse() can help!
  • Comparing distributions between groups (Q9)
    • Were trout observed in every channel type in both sections of forest?
  • Calculating group means (Q10)
    • group_by()creates groups based on a categorical variable not based on the dataset
    • group_by(species) not group_by(trout)

Copying the Lab – Last Week’s Recorder

The person who typed your lab needs to make their project “public”

  1. Open Posit Cloud
  2. Go to the STAT 313 workspace
  3. Click on “Your Content”
  4. Open the settings for your Lab 3 project

Copying the Lab – Last Week’s Recorder

  1. Change the access for your project to “Space Members”

Copying the Lab – Everyone Else

  1. Find your group member’s lab (you can use the search bar to search for their name)

  1. Open their Lab 3 project
  2. Select “Save a Permanent Copy”

Completing Revisions

Lab 3 revisions are due by Wednesday, May 1.

  1. Read comments on Canvas
  2. Copy your group’s lab assignment
  3. Complete your revisions
  4. Render your revised Lab 3
  5. Download your revised HTML
  6. Submit your revisions to the original Lab 3 assignment portal

Reflections

Revisions are required to be accompanied with reflections on what you learned while completing your revisions. These can be written in your Lab 3 file (next to the problems you revised), in a Word document, or in the comment box on Canvas.

Midterm Project Week

Plan for Today

  1. Review different types of multiple linear regression models

  2. Complete an activity on sample selection

  3. Start Midterm Project write-up

Plan for Wednesday

No lab – focus on getting all the coding accomplished for the Midterm Project

Draft Due by Sunday

To get everyone feedback on their drafts in a timely manner, the first drafts are due by Sunday.

Deadline Extension

A deadline extension is permitted for the first draft. Deadline extensions are not permitted for the final version (due next week).

Reminders About Deadlines

  • Midterm Project Proposal is due today by 5pm
  • Lab 3 revisions are due Friday (May 1)
  • Statistical Critique revisions are due next Wednesday (May 8)

Multiple Linear Regression

Before…

Now…



How?

Offsets!

smoke_lm <- lm(weight ~ weeks * habit, data = ncbirths)

get_regression_table(smoke_lm)
# A tibble: 4 × 3
  term              estimate std_error
  <chr>                <dbl>     <dbl>
1 intercept           -5.94      0.484
2 weeks                0.341     0.013
3 habit: smoker       -1.86      1.63 
4 weeks:habitsmoker    0.039     0.042

Interaction Model

The * means the variables are interacting!

Estimated Regression Equations

# A tibble: 4 × 3
  term              estimate std_error
  <chr>                <dbl>     <dbl>
1 intercept           -5.94      0.484
2 weeks                0.341     0.013
3 habit: smoker       -1.86      1.63 
4 weeks:habitsmoker    0.039     0.042

What is the regression equation for non-smoker mothers?

What is the regression equation for smoker mothers?

What if we have a second numerical explanatory variable?

Multiple slopes

age_lm <- lm(weight ~ weeks + mage, data = ncbirths)

get_regression_table(age_lm)
# A tibble: 3 × 3
  term      estimate std_error
  <chr>        <dbl>     <dbl>
1 intercept   -6.68      0.492
2 weeks        0.346     0.012
3 mage         0.02      0.006

How do you interpret the value of 0.346?

How do you interpret the value of 0.02?

But how do we decide if the interaction model is “best” without a p-value??????

When investigating if a relationship differs…

Always start with the “interaction” / different slopes model.

If the slopes look different, you’re done!

If the slopes look similar, then fit the “additive” / parallel slopes model.

Different Enough?

What if they’re not very different?

Parallel Slopes

lm(average_sat_math ~ perc_disadvan + size, 
   data = MA_schools)


# A tibble: 4 × 3
  term          estimate std_error
  <chr>            <dbl>     <dbl>
1 intercept       588.       7.61 
2 perc_disadvan    -2.78     0.106
3 size: medium    -11.9      7.54 
4 size: large      -6.36     6.92 

Group equations – Baseline

# A tibble: 4 × 3
  term          estimate std_error
  <chr>            <dbl>     <dbl>
1 intercept       588.       7.61 
2 perc_disadvan    -2.78     0.106
3 size: medium    -11.9      7.54 
4 size: large      -6.36     6.92 

\[\widehat{SAT}_{small} = 588 - 2.78 \times \text{percent disadvantaged}\]

Group equations – Offsets

# A tibble: 4 × 3
  term          estimate std_error
  <chr>            <dbl>     <dbl>
1 intercept       588.       7.61 
2 perc_disadvan    -2.78     0.106
3 size: medium    -11.9      7.54 
4 size: large      -6.36     6.92 

\[\widehat{SAT}_{medium} = (588 - 11.9) - 2.78 \times \text{percent disadvan}\]

\[\widehat{SAT}_{medium} = 576.1 - 2.78 \times \text{percent disadvan}\]

\[\widehat{SAT}_{large} = (588 - 6.36) - 2.78 \times \text{percent disadvan}\]

\[\widehat{SAT}_{large} = 581.64 - 2.78 \times \text{percent disadvan}\]

Sample Selection Activity

Find Your Data Group!

Once you have found other students working on the same dataset, complete the sample selection activity.

  1. What are the observations / rows in this dataset?

  2. From what population was the sample drawn?

  3. For an observation to be included in the dataset, what inclusion criteria needed to be met?

  4. How were the observations who satisfied the inclusion criteria sampled from the population?

  5. Based on the inclusion criteria and sampling methods, to what population can the findings of the study be generalized?

Midterm Project Work Time

Steps Before Wednesday

  1. Insert the description of your dataset and variables (from the Midterm Proposal) into the “Introduction” of your project

  2. Pose a research question about your selected variables, which can be addressed with multiple linear regression

  3. Insert the code to create the required two (or three) visualizations

  4. Write a description of what you see in the visualizations

  5. Make a decision which model you believe is “best”