library(tidyverse)
library(lterdatasampler)
# New package from reading -- to get regression table
library(moderndive)
Lab 4: Simple Linear Regression
Data for Today
Today we’ll be working with data on lake ice duration for two lakes surrounding Madison, Wisconsin. This dataset contains information on the number of days of ice (ice duration) on each lake for years between 1851 and 2019. These data are stored in the ntl_icecover
dataset, which lives in the lterdatsampler package.
According to the EPA, lake ice duration can be an indicator of climate change. This is because lake ice is dependent on several environmental factors, so changes in these factors will influence the formation of ice on top of lakes. As a result, the study and analysis of lake ice formation can inform scientists about how quickly the climate is changing, and are critical to minimizing disruptions to lake ecosystems.
Inspecting the Data
Question 1 – How large is the ntl_icecover
dataset? (i.e. How many rows and columns does it have?)
# Code to answer question 1 goes here!
Visualize a Simple Linear Regression
Let’s start with tools to visualize and summarize linear regression.
Tools
- Visualize the relationship between x & y –
geom_point()
- Visualize the linear regression line –
geom_smooth()
We will be investigating the relationship between the ice_duration
of each lake and the year
.
Step 1
Question 2 – Make a scatterplot of the relationship between the ice_duration
(response) and the year
(explanatory). Be sure to make the axis labels look nice, including any necessary units!
# Code to answer question 2 goes here!
Question 3 – Describe the relationship you see in the scatterplot. Be sure to address the four aspects we discussed in class: form, direction, strength, and unusual points. Hint: You need to explicitly state where the unusual observations are!
Step 2
To add a regression line on top of a scatterplot, you add (+
) a geom_smooth()
layer to your plot. However, if you add a “plain” geom_smooth()
to the plot, it uses a wiggly line. You need to tell geom_smooth()
what type of smoother line you want for it to use! We can get a straight line by including method = "lm"
inside of geom_smooth()
.
Question 4 – Add a linear regression line to the scatterplot you made in Question 3. No code goes here, you need to modify your scatterplot from Question 3!
Fit a Simple Linear Regression Model
Next, we are going to summarize the relationship between ice_duration
and year
with a linear regression equation.
Tools
- Calculate the correlation between x & y –
get_correlation()
- Model the relationship between x & y –
lm()
- Explore coefficient estimates –
get_regression_table()
Step 1
Question 5 – Calculate the correlation between these variables, using the get_correlation()
function.
# Code to answer question 5 goes here!
Step 2
Next, we will “fit” a linear regression with the lm()
function. Remember, the “formula” for lm()
is response_variable ~ explanatory_variable
. Also recall that you need to tell lm()
where the data live using data =
argument!
Question 6 – Fit a linear regression modeling the relationship between between ice_duration
and year
. The only part you need to remove is the ...
! Keep the ice_lm <-
!
# Code to answer question 6 goes here!
<- ... ice_lm
Step 3
Finally, to get the regression equation, we need grab the coefficients out of the linear model object you made in Step 2. The get_regression_table()
function is a handy tool to do just that!
Question 7 – Use the get_regression_table()
function to obtain the coefficient estimates for the ice_lm
regression you fit in Question 6.
# Code to answer question 7 goes here!
Question 8 – Using the coefficient estimates above, write out the estimated regression equation. Your equation needs to be in the context of the variables, not in generic \(x\) and \(y\) statements!
Question 9 – Interpret the value of the slope coefficient. Your interpretation needs to be in the context of the variables!
Question 10 – Sometimes interpreting a 1-unit increase in the explanatory variable is not a meaningful change, so we instead use larger increases. Based on your slope interpretation from Q9, what do you expect to happen to the duration of ice cover for an increase of 100 years?
A preview of what’s to come
In our analysis above, we only looked at the relationship between ice duration and year, not accounting for which lake the measurements came from. That is another (categorical) explanatory variable we could include in our regression model!
Question 11 – Using the code you wrote for Question 4 (with the regression line added), add a color
for the name of the lake (lakeid
).
# Code to answer question 11 goes here!