Introduction to Linear Regression

Lab 2

A Grading Reminder

“Complete” = Satisfactory

Your group obtained a “Success” on every question

“Incomplete” = Growing

Your group received a “Growing” on at least one question

Common Mistakes

Units in axis labels (Q2 & Q9)
- What unit were the departure / arrival delays measured in?
Justifying why I should expect to be early / late (Q8 & Q10)
- Why is the mean / median a reasonable estimate of the “typical” delay?
- What aspect(s) of the distribution did you use to decide what a “typical” delay is?

Making a Duplicate of Your Group’s Lab 2 Project

Log in to Posit Cloud
Open your Weeks 2-3 Group Workspace
Find Lab 2

A screenshot of a group's Lab 2 project in their group workspace. There is a purple box around an icon with two boxes that has a 'plus' (+) symbol in the middle.

Make a personal copy of your group’s lab 2

A screenshot of the pop-up that appears when you click on the 'Make a copy' icon next to the Lab 2 project.

Every member must have their own copy of the lab! No one works in the original document.

Completing Revisions

Lab 2 revisions are due by Friday, April 24.

Read comments on Canvas
Copy your group’s lab assignment
Complete your revisions
Render your revised Lab 2
Download your revised HTML
Submit your revisions to the original Lab 2 assignment

Reflections

Revisions are required to be accompanied with reflections on what you learned while completing your revisions. These are expected to be written in your Lab 2 Quarto file (next to the problems you revised).

A Word About Reflections

When I initially answered question 6, I believed “observations” were the variables used to classify each column in the data frame. Therefore, I believed the answer to be manufacturer, model, year, etc., or the column headers. However, after analyzing the data again and looking over the reading, I learned that “observations” or “cases” actually refer to the subjects that the data is collected on. In this instance, the data set presents information on cars, and that is why the observations/cases would actually be cars. It is important to be able to distinguish between cases and variables because that is how we can correctly analyze our data and prevent errors in our interpretations.

(Simple) Linear Regression

Relationships Between Variables

In a statistical model, we generally have one variable that is the output and one or more variables that are the inputs.

Response variable
- a.k.a. \(y\), dependent
- The quantity you want to understand
- In this class – always numerical

Explanatory variable
- a.k.a. \(x\), independent, explanatory, predictor
- Something you think might be related to the response
- In this class – either numerical or categorical

Visualizing Linear Regression

The scatterplot has been called the most “generally useful invention in the history of statistical graphics.”

It is a simple two-dimensional plot in which the two coordinates of each dot represent the values of two variables measured on a single observation.

Characterizing Relationships

Form (e.g. linear, quadratic, non-linear)
Direction (e.g. positive, negative)
Strength (how much scatter/noise?)
Unusual observations (do points not fit the overall pattern?)

Data for Today

The ncbirths dataset is a random sample of 1,000 cases taken from a larger dataset collected in North Carolina in 2004.

Each case describes the birth of a single child born in North Carolina, along with various characteristics of the child (e.g. birth weight, length of gestation, etc.), the child’s mother (e.g. age, weight gained during pregnancy, smoking habits, etc.) and the child’s father (e.g. age).

Your Turn!

How would your characterize this relationship?

form
direction
strength
unusual observations

Cleaning & Filtering the Data

It seems like pregnancies with a gestation less than 28 weeks have a non-linear relationship with a baby’s birth weight, so we will filter these observations out of our dataset.

births_post28 <- ncbirths %>% 
  drop_na(weight, weeks) %>% 
  filter(weeks > 28)

Change in scope of inference

Removing these observations narrows the population of births we are able to make inferences onto! In this case, what population could we infer our findings onto?

Summarizing a Linear Relationship

Correlation:

strength and direction of a linear relationship between two quantitative variables

Correlation coefficient between -1 and 1
Sign of the correlations shows direction
Magnitude of the correlation shows strength

Anscombe Correlations

Four datasets, very different graphical presentations

same mean and standard deviation in both \(x\) and \(y\)
same correlation
same regression line

For which of these relationships is correlation a reasonable summary measure?

Calculating Correlation in R

get_correlation(births_post28, 
                weeks ~ weight)

# A tibble: 1 × 1
    cor
  <dbl>
1 0.557

What if I ran get_correlation(births_post28, weight ~ weeks) instead? Would I get the same value?

Linear regression:

we assume the the relationship between our response variable (\(y\)) and explanatory variable (\(x\)) can be modeled with a linear function, plus some random noise

\(response = intercept + slope \cdot explanatory + noise\)

Writing the Regression Equation

Population Model

\(y = \beta_0 + \beta_1 \cdot x + \epsilon\)

\(y\) = response

\(\beta_0\) = population intercept

\(\beta_1\) = population slope

\(\epsilon\) = errors / residuals

Sample Model

\(\widehat{y} = b_0 + b_1 \cdot x\)

\(b_0\) = sample intercept

\(b_1\) = sample slope

Why does this equation have a hat on \(y\)?

What does the hat represent?

Linear Regression with One Numerical Explanatory Variable

Step 1: Fit a linear regression

weeks_lm <- lm(weight ~ weeks, 
               data = births_post28)

Step 2: Obtain coefficient table

get_regression_table(weeks_lm)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-5.003	0.582	-8.603	0	-6.144	-3.862
weeks	0.316	0.015	21.010	0	0.287	0.346

get_regression_table()

This function lives in the moderndive package, so we will need to load in this package (e.g., library(moderndive)) if we want to use the get_regression_table() function.

Our focus (for now…)

Estimated regression equation

\[\widehat{y} = b_0 + b_1 \cdot x\]

weeks_lm <- lm(weight ~ weeks, 
               data = births_post28)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-5.003	0.582	-8.603	0	-6.144	-3.862
weeks	0.316	0.015	21.010	0	0.287	0.346

Write out the estimated regression equation!

How do you interpret the intercept value of -5.003?

How do you interpret the slope value of 0.316?

Obtaining Residuals

\(\widehat{weight} = -5.003+0.316 \cdot weeks\)

What would the residual be for a pregnancy that lasted 39 weeks with a baby that weighed 7.63 pounds?

Linear Regression with One Categorical Explanatory Variable

Step 1: Finding `distinct` levels

distinct(births_post28, habit)

# A tibble: 2 × 1
  habit    
  <fct>    
1 nonsmoker
2 smoker

Step 2: Fit a linear regression

habit_lm <- lm(weight ~ habit,
               data = births_post28)

Step 3: Obtain coefficient table

get_regression_table(habit_lm)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	7.246	0.046	158.369	0.000	7.157	7.336
habit: smoker	-0.418	0.128	-3.270	0.001	-0.668	-0.167

🤔

Step 4: Write Estimated Regression Equation

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	7.246	0.046	158.369	0.000	7.157	7.336
habit: smoker	-0.418	0.128	-3.270	0.001	-0.668	-0.167

\[\widehat{weight} = 7.23 - 0.4 \cdot Smoker\]

But what does \(Smoker\) represent???

Categorical Explanatory Variables

\[ \widehat{y} = b_0 + b_1 \cdot x \]

\(x\) is a categorical variable with levels:

"nonsmoker"
"smoker"

We need to convert to:

a “baseline” group
“offsets” / adjustments to the baseline

Choosing a Baseline Group

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	7.246	0.046	158.369	0.000	7.157	7.336
habit: smoker	-0.418	0.128	-3.270	0.001	-0.668	-0.167

Based on the regression table, what habit group was chosen to be the baseline group?

Rewriting in Terms of Indicator Variables

\[\widehat{weight} = 7.23 - 0.4 \cdot 1_{smoker}(x)\]

where

\(1_{smoker}(x) = 1\) if the mother was a "smoker"

\(1_{smoker}(x) = 0\) if the mother was a "nonsmoker"

Obtaining Group Means

\[\widehat{weight} = 7.23 - 0.4 \cdot 1_{Smoker}(x)\]

Given the equation, what is the estimated mean birth weight for nonsmoking mothers?

For smoking mothers?

Causal Inference

We just concluded that babies born to a "smoker" weigh, on average, 0.4 pounds less than babies born to a "nonsmoker".

Can we conclude that smoking caused these babies to weigh less? Why or why not?

Midterm Project Preparation

Project Proposal

Choose a dataset
Choose one numerical response variable
Choose one numerical explanatory variable
Choose a second explanatory variable, it must be categorical

Checking values of your numerical variable(s)

Your numerical variable cannot have a small number of values (e.g., 2 or 3). You can use the distinct() function to determine the unique values of your variable. For example, by running distinct(hbr_maples, year) I would discover that year only has two values (2003 and 2004), meaning year is not eligible to be a numerical response or explanatory variable. It could, however, be a categorical explanatory variable!

Write your Introduction

Reading a Full Research Paper

We are going to read through excerpts from the Abstract, Introduction, Methods, and Results sections of the Dengue Fever paper by Tuan et al. (2015).

Within these sections, you are asked to answer ten (10) questions about the design of the study, the statistical methods used, and the findings of the analyses.

This will guide you in thinking about the structure of your Midterm Project report!

Introduction to Linear Regression

Lab 2

A Grading Reminder

Common Mistakes

Making a Duplicate of Your Group’s Lab 2 Project

Completing Revisions

A Word About Reflections

(Simple) Linear Regression

Cleaning & Filtering the Data

Summarizing a Linear Relationship

Calculating Correlation in R

Writing the Regression Equation

Linear Regression with One Numerical Explanatory Variable

Step 1: Fit a linear regression

Step 2: Obtain coefficient table

Linear Regression with One Categorical Explanatory Variable

Step 1: Finding distinct levels

Step 4: Write Estimated Regression Equation

Categorical Explanatory Variables

Choosing a Baseline Group

Rewriting in Terms of Indicator Variables

Obtaining Group Means

Causal Inference

Midterm Project Preparation

Project Proposal

Reading a Full Research Paper

Step 1: Finding `distinct` levels