Relationships Between Variables
Visualizing Linear Regression
Characterizing Relationships
Form (e.g. linear, quadratic, non-linear)
Direction (e.g. positive, negative)
Strength (how much scatter/noise?)
Unusual observations (do points not fit the overall pattern?)
Data for Today
The ncbirths dataset is a random sample of 1,000 cases taken from a larger dataset collected in North Carolina in 2004.
Each case describes the birth of a single child born in North Carolina, along with various characteristics of the child (e.g. birth weight, length of gestation, etc.), the child’s mother (e.g. age, weight gained during pregnancy, smoking habits, etc.) and the child’s father (e.g. age).
Your Turn!
How would your characterize this relationship?
It seems like pregnancies with a gestation less than 28 weeks have a non-linear relationship with a baby’s birth weight, so we will filter these observations out of our dataset.
Change in scope of inference
Removing these observations narrows the population of births we are able to make inferences onto! In this case, what population could we infer our findings onto?
Correlation:
strength and direction of a linear relationship between two quantitative variables
Anscombe Correlations
Four datasets, very different graphical presentations
For which of these relationships is correlation a reasonable summary measure?
What if I ran get_correlation(births_post28, weight ~ weeks)
instead? Would I get the same value?
Linear regression:
we assume the the relationship between our response variable (\(y\)) and explanatory variable (\(x\)) can be modeled with a linear function, plus some random noise
\(response = intercept + slope \cdot explanatory + noise\)
Population Model
\(y = \beta_0 + \beta_1 \cdot x + \epsilon\)
\(y\) = response
\(\beta_0\) = population intercept
\(\beta_1\) = population slope
\(\epsilon\) = errors / residuals
Sample Model
\(\widehat{y} = b_0 + b_1 \cdot x\)
\(b_0\) = sample intercept
\(b_1\) = sample slope
Why does this equation have a hat on \(y\)?
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | -5.003 | 0.582 | -8.603 | 0 | -6.144 | -3.862 |
weeks | 0.316 | 0.015 | 21.010 | 0 | 0.287 | 0.346 |
get_regression_table()
This function lives in the moderndive package, so we will need to load in this package (e.g., library(moderndive
) if we want to use the get_regression_table()
function.
Our focus (for now…)
Estimated regression equation
\[\widehat{y} = b_0 + b_1 \cdot x\]
Write out the estimated regression equation!
How do you interpret the intercept value of -5.003?
How do you interpret the slope value of 0.316?
Obtaining Residuals
\(\widehat{weight} = -5.003+0.316 \cdot weeks\)
What would the residual be for a pregnancy that lasted 39 weeks and whose baby weighed 7.63 pounds?
distinct
levelsStep 2: Fit a linear regression
Step 3: Obtain coefficient table
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 7.246 | 0.046 | 158.369 | 0.000 | 7.157 | 7.336 |
habit: smoker | -0.418 | 0.128 | -3.270 | 0.001 | -0.668 | -0.167 |
🤔
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 7.246 | 0.046 | 158.369 | 0.000 | 7.157 | 7.336 |
habit: smoker | -0.418 | 0.128 | -3.270 | 0.001 | -0.668 | -0.167 |
\[\widehat{weight} = 7.23 - 0.4 \cdot Smoker\]
But what does \(Smoker\) represent???
\(x\) is a categorical variable with levels:
"nonsmoker"
"smoker"
We need to convert to:
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 7.246 | 0.046 | 158.369 | 0.000 | 7.157 | 7.336 |
habit: smoker | -0.418 | 0.128 | -3.270 | 0.001 | -0.668 | -0.167 |
Based on the regression table, what habit
group was chosen to be the baseline group?
\[\widehat{weight} = 7.23 - 0.4 \cdot 1_{smoker}(x)\]
where
\(1_{smoker}(x) = 1\) if the mother was a "smoker"
\(1_{smoker}(x) = 0\) if the mother was a "nonsmoker"
\[\widehat{weight} = 7.23 - 0.4 \cdot 1_{Smoker}(x)\]
Given the equation, what is the estimated mean birth weight for nonsmoking mothers?
For smoking mothers?
We just concluded that babies born to a "smoker"
weigh, on average, 0.4 pounds less than babies born to a "nonsmoker"
.
Can we conclude that smoking caused these babies to weigh less? Why or why not?
Choose a dataset
Choose one numerical response variable
Choose one numerical explanatory variable
Choose a second explanatory variable, it can be either numerical or categorical
Checking values of your numerical variable(s)
Your numerical variable cannot have a small number of values (e.g., 2 or 3). You can use the distinct()
function to determine the unique values of your variable. For example, by running distinct(hbr_maples, year)
I would discover that year
only has two values (2003 and 2004), meaning year
is not eligible to be a numerical response or explanatory variable. It could, however, be a categorical explanatory variable!