Lab 7: Confidence Intervals for Water Temperature and Latitude

Author

The names of your group members here!

Data

Today we will explore the pie_crab dataset contained in the lterdatasampler R package. The data is from a study by Johnson et al. at the Plum Island Ecosystem Long Term Ecological Research site, studying the relationship between the size (carapace width) of a Fiddler Crab and the geographical location of its habitat. These data can be used to investigate if Bergmann’s Rule applies to Fiddler Crabs, or specifically that the size of a crab increases as the distance from the equator increases.

Motivation

The students who investigated this relationship for their Midterm Project found that when both latitude and water temperature are included as explanatory variables in the multiple regression model, the coefficient associated with water temperature doesn’t make sense. Namely, the model suggests warmer water temperatures are associated with larger crab sizes. However, we know that the water is warmer near the equator, which is where the crab sizes should be smaller. Rather perplexing!

The moral of the story is that water temperature and latitude are high correlated with each other, so including them both as explanatory variables leads to multicollinearity – something we do not want in our multiple linear regression.

Our Investigation

The focus of this lab is on quantifying the relationship between water temperature (response) and latitude (explanatory) for marshes (sites) along the Atlantic coast.

Cleaning the Data

The data contains information on at total of 392 Fiddler Crabs caught at 13 marshes on the Atlantic coast of the United States in summer 2016. However, at each marsh, there is only one recorded water temperature. Meaning, we need to collapse our dataset to have only one observation per marsh.

1. Fill in the code below to create a new dataset called marsh_info which has 13 observations – one per marsh.

marsh_info <- pie_crab %>% 
  group_by(____) %>% 
  slice_sample(n = 1) %>% 
  ungroup()

From this point forward, you should use the marsh_info dataset for EVERY problem. Keep in mind that you are no longer analyzing data on crabs! The dataset you have is on marshes along the Atlantic coast!

Visualizing Relationships

2. Create a scatterplot modeling the relationship between latitude (explanatory) and water temperature (response) for these 13 marshes. Don’t forget to add descriptive axis labels!

3. Describe the relationship you see in the scatterplot. Be sure to address the four aspects we discussed in class: form, direction, strength, and unusual points! Keep in mind that you are no longer analyzing data on crabs! The dataset you have is on marshes along the Atlantic coast!

Summarizing the Relationship

Now that you’ve visualized the relationship, let’s summarize this relationship with a statistic. Specifically, we are interested in the slope statistic, as it captures the relationship between latitude and water temperature.

4. Fill in the code below to calculate the observed slope for the relationship between the water temperature (response) and latitude (explanatory).

Note: Nothing will be output when you run this code!

obs_slope <- marsh_info %>% 
  specify(response = ____, 
          explanatory = ____) %>% 
  calculate(stat = ____)

Bootstrap Distribution

Now that we have the observed slope statistic, let’s see what variability we might get in the slope statistic for other samples (marshes) we might have gotten from the population (the Atlantic coast of the US).

As a refresher, when we use resampling to obtain our bootstrap distribution, our steps look like the following:

Step 1: specify() the response and explanatory variables

Step 2: generate() lots of bootstrap resamples

Step 3: calculate() for each of the generated() samples, calculate the statistic you are interested in

Let’s give this a try!

5. Fill in the code to generate 500 bootstrap slope statistics (from 500 bootstrap resamples).

bootstrap <- marsh_info %>% 
  specify(response = ____, 
          explanatory = ____) %>% 
  generate(reps = ____, 
           type = ____) %>% 
  calculate(stat = ____)

Alright, now that we have the bootstrap slope statistics, let’s see how it looks! Let’s use the visualize() function (not ggplot()!) to make a quick visualization of the statistics you calculated above.

6. Use the visualize() function to create a simple histogram of your 500 bootstrap statistics. It would be nice to change the x-axis label to describe what statistic is being plotted!

# Code to visualize bootstrap statistics

Confidence Interval

The next step to obtain our confidence interval! First we need to determine what percentage of statistics we want to keep in the confidence interval. 90%? 95%? 99%? 80%?

This seems like a study where we care a bit less about our interval capturing the true value, at least compared to something like a medical study. So, I think this could be a great instance to use a 90% confidence interval.

7. Use the get_confidence_interval() function to find the 90% confidence interval from your bootstrap distribution, using the percentile method!

# Code to obtain a 90% PERCENTILE based confidence interval

8. Interpret the confidence interval you obtained in #7. Make sure to include the context of the data and the population of interest!

Just for fun, let’s compare the confidence we obtained using a percentile method with an interval found using the SE method.

9. Use the get_confidence_interval() function to find the 90% confidence interval from your bootstrap distribution, using the SE method! Remember – with the SE method, you need to specify the point estimate!

# Code to obtain a 90% SE based confidence interval

10. How do your confidence intervals compare? Based on the shape of the bootstrap distribution, would you expect for these methods to yield similar results?

Hint: Think about the conditions for using the SE method to obtain a confidence interval!

Bootstrap Assumptions

A bootstrap distribution aims to simulate the variability we’d get from other samples from our population. However, the accuracy of these samples relies on the quality of our original sample.

11. Based on the information given, how do you feel about the assumption a bootstrap distribution makes about the original sample? What issues do you believe might prevent this assumption being appropriate?