## Package for ggplot and dplyr tools
library(tidyverse)
## Package for ecological data
library(lterdatasampler)
## Package for density ridge plots
library(ggridges)
Lab 3: Incorporating Categorical Variables
Getting started
R Resources
You should have at least one member of your lab group pull up the R resources from Canvas. Specifically, the “cheat sheets” from Weeks 2 & 3 will be very helpful while completing this assignment.
Load packages
In this lab, we will explore and visualize the data using packages housed in the tidyverse suite of packages.
The data
In this lab we will work with data from the H.J. Andrews Experimental Forest. The following is a description of the data:
Populations of West Slope cutthroat trout (Onchorhyncus clarki clarki) in two standard reaches of Mack Creek in the H.J. Andrews Experimental Forest have been monitored since 1987. Monitoring of Pacific Giant Salamanders, Dicamptodon tenebrosus began in 1993. The two standard reaches are in a section of clearcut forest (ca. 1963) and an upstream 500 year old coniferous forest. Sub-reaches are sampled with 2-pass electrofishing, and all captured vertebrates are measured and weighed. Additionally, a set of channel measurements are taken with each sampling. This study constitutes one of the longest continuous records of salmonid populations on record.
First, we’ll view the and_vertebrates
dataframe where these data are stored.
View(and_vertebrates)
Exploring the Dataset
The codebook (description of the variables) can be accessed by pulling up the help file by typing a ?
before the name of the dataset:
?and_vertebrates
Question 1 – How large is the and_vertebrates
dataset? (i.e. How many rows and columns does the dataset have?)
## Your code for question 1 (and 2) goes here!
Question 2 – Are there categorical variables in the dataset? If so, what are their names? Hint: What data types are categorical variables stored as?
Accessing the Levels of a Variable
The species
variable refers to the species of the animal which was captured. You can use the distinct()
function to access the distinct values of a categorical variable (e.g., distinct(nycflights, carrier)
). Notice the first input is the name of the dataset and the second input is the name of the categorical variable!
Question 3 – Use the distinct()
function to discover the levels / values of the species
variable.
## Your code for question 3 goes here!
Data Wrangling
Alright, you should have found that there is more than one species included in these data. For our analysis, we are only interested in Cutthroat trout.
This study used electrofishing to capture the trout. Electrofishing is a technique that uses direct current electricity flowing between a submerged cathode and anode, to insert an electric current into the water. This current stuns fish in a (hopefully) non-lethal manner, in order to capture them for marking and measuring. Technically, smaller fish are less affected by the current, so there presumably is a size of fish that is “uncatchable”. For our analysis, we are going to filter out fish that are less than 4 inches long, as this size of fish is nearly impossible to capture.
Question 4 – Use the filter()
function to include only observations on Cutthroat trout whose length_1_mm
is greater than 4 inches (or 101 mm). The only part you need to remove is the ...
! Keep the trout <-
!
<- ... trout
Data Visualizations
Alright, now that we’ve gotten our data ready for analysis, let’s start with some visualizations
Question 5 – Using ggplot()
create a visualization of the distribution of the lengths of the Cutthroat trout (from the trout
dataset you filter
ed above). Your plot should have axis labels that describe the variable being plotted, and its associated units!
Keep in mind the smallest value on your plot should be 101mm if you completed #4 correctly.
## Your code for question 6 goes here!
Question 6 – Name three possible sources of variation for the length of a Cutthroat trout.
Adding Categorical Variables
When we are interested in comparing the distribution of a numerical variable across groups of a categorical variable, we “typically” see people use stacked histograms or side-by-side boxplots. I believe an unsung hero of these types of comparisons is the ridge plot.
As introduced in Introduction to Modern Statistics, a ridge plot essentially has multiple density plots stacked in the same plotting window. A key feature of ridge plots is a categorical variable is always on the y-axis, with a numeric variable on the x-axis.
In R, we use the geom_density_ridges()
function from the ggridges package to create a ridge plot. Yes, this is new, but don’t worry! The function has the same layout as things you’ve seen before.
Question 7 – Fill in the code below to create a ridge plot comparing the lengths of Cutthroat trout between the different types of channels (unittype
). Be sure to add nice axis labels to your plot, which describe the variables being plotted (and their units)!
ggplot(data = trout,
mapping = aes(x = <NUMERICAL VARIABLE>,
y = <CATEGORICAL VARIABLE>)
+
) geom_density_ridges()
Question 8 – Modify your plot from #7 to incorporate the section
of the forest into your plot, using either color
or facet
s. Hint: The fill
aesthetic will fill the ridge plots with color.
Question 9 – Based on your plot, how different are the lengths of the Cutthroat trout between the different channel types and forest sections? Be sure to address how the centers and shapes of ALL these distributions compare!
Data Summaries
Paired with visualizations, summary statistics can provide a clearer picture for the comparisons we are interested in. To obtain summary statistics for different groups of a categorical variable, we need to use our friend the group_by()
function.
Question 10 – Find the average length of Cutthroat trout from the different channel types (unittype
). Be sure to use the trout
dataset from Question 4!
## Your code for question 11 goes here!
Question 11 – Find the average length of Cutthroat trout from the different channel types (unittype
) and forest section
. Be sure to use the trout
dataset from Question 4!
## Your code for question 12 goes here!
Question 12 – How do the differences in these averages compare with what you saw in your visualization in Question 9? Why do you believe they are similar or different from what you saw in the visualizations? Hint: Your response should directly compare the statistics you got in Question 11 with the density ridges you saw in Question 8.