Lab 3: Incorporating Categorical Variables

Author

Your group’s names here!

Getting started

R Resources

The Resource Manager should pull up the R resources from Canvas. Specifically, the “cheat sheets” from Weeks 2 & 3 will be very helpful while completing this assignment.

Load Packages

In this lab, we will explore and visualize the data using packages housed in the tidyverse suite of packages.

## Package for ggplot and dplyr tools
library(tidyverse)

## Package for ecological data
library(lterdatasampler)

## Package for density ridge plots
library(ggridges)

The data

In this lab we will work with data from the H.J. Andrews Experimental Forest. The following is a description of the data:

Populations of West Slope cutthroat trout (Onchorhyncus clarki clarki) in two standard reaches of Mack Creek in the H.J. Andrews Experimental Forest have been monitored since 1987. Monitoring of Pacific Giant Salamanders, Dicamptodon tenebrosus began in 1993. The two standard reaches are in a section of clearcut forest (ca. 1963) and an upstream 500 year old coniferous forest. Sub-reaches are sampled with 2-pass electrofishing, and all captured vertebrates are measured and weighed. Additionally, a set of channel measurements are taken with each sampling. This study constitutes one of the longest continuous records of salmonid populations on record.

First, we’ll view the and_vertebrates dataframe where these data are stored.

and_vertebrates

Exploring the Dataset

The codebook (description of the variables) can be accessed by pulling up the help file by typing a ? before the name of the dataset:

?and_vertebrates

Question 1 – How large is the and_vertebrates dataset? (i.e. How many rows and columns does the dataset have?)

Question 2 – Are there categorical variables in the dataset? If so, what are their names? Hint: What data types are categorical variables stored as?

Accessing the Levels of a Variable

The species variable refers to the species of the animal which was captured. You can use the distinct() function to access the distinct values of a categorical variable (e.g., distinct(nycflights, carrier)). Notice the first input is the name of the dataset and the second input is the name of the categorical variable!

Question 3 – Use the distinct() function to discover the levels / values of the species variable.

Data Wrangling

Alright, you should have found that there is more than one species included in these data. For our analysis, we are only interested in Cutthroat trout.

This study used electrofishing to capture the trout. Electrofishing is a technique that uses direct current electricity flowing between a submerged cathode and anode, to insert an electric current into the water. This current stuns fish in a (hopefully) non-lethal manner, in order to capture them for marking and measuring. Technically, smaller fish are less affected by the current, so there presumably is a size of fish that is “uncatchable.” For our analysis, we are going to filter out fish that are less than 4 inches long, as this size of fish is nearly impossible to capture.

Question 4 – Use the filter() function to include only observations on Cutthroat trout whose length_1_mm is greater than 4 inches (or 101 mm). The only part you need to remove is the ...! Keep the trout <-!

trout <- ...

Data Visualizations

Alright, now that we’ve gotten our data ready for analysis, let’s start with some visualizations

Question 5 – Using ggplot() create a visualization of the distribution of the lengths of the Cutthroat trout (from the trout dataset you filtered above). Hint: The smallest value on your plot should be 101mm if you completed #4 correctly.

Question 6 – Name three possible sources of variation for the length of a Cutthroat trout.

Adding Categorical Variables

When we are interested in comparing the distribution of a numerical variable across groups of a categorical variable, we “typically” see people use faceted histograms or side-by-side boxplots. I believe an unsung hero of these types of comparisons is the ridge plot.

As introduced in Introduction to Modern Statistics, a ridge plot essentially has multiple density plots stacked in the same plotting window. A key feature of ridge plots is a categorical variable is always on the y-axis, with a numeric variable on the x-axis.

In R, we use the geom_density_ridges() function from the ggridges package to create a ridge plot. Yes, this is new, but don’t worry! The function has the same layout as things you’ve seen before.

Question 7 – Fill in the code below to create a ridge plot comparing the lengths of Cutthroat trout between the different types of channels (unittype). Be sure to add nice axis labels to your plot, which describe the variables being plotted (and their units)!

ggplot(data = trout, 
       mapping = aes(x = <NUMERICAL VARIABLE>, 
                     y = <CATEGORICAL VARIABLE>)
       ) +
  geom_density_ridges()

Question 8 – Modify your plot from #7 to incorporate the section of the forest into your plot, using either color or facets. Hint: The fill aesthetic will fill the ridge plots with color (rather than just the outline).

Question 9 – Based on your plot, how different are the lengths of the Cutthroat trout between the different channel types and forest sections? Be sure to address how the centers and shapes of ALL these distributions compare!

Data Summaries

Paired with visualizations, summary statistics can provide a clearer picture for the comparisons we are interested in. To obtain summary statistics for different groups of a categorical variable, we need to use our friend the group_by() function.

Question 10 – Find the average length of Cutthroat trout from the different channel types (unittype). Be sure to use the trout dataset from Question 4!

Question 11 – Find the average length of Cutthroat trout from the different channel types (unittype) and forest section. Be sure to use the trout dataset from Question 4!

Question 12 – How do these averages compare with what you saw in your visualization in Question 9? Why do you believe they are similar or different from what you saw in the visualizations? Hint: Your response should directly compare the statistics you got in Question 11 with the density ridges you saw in Question 8.