Incorporating Categorical Variables

Lab 1

Revisions

“Complete” = Satisfactory

  • You provided responses to every question
  • You correctly identified the observational unit for the mpg dataset

“Incomplete” = Growing

  • You did not provide responses to every question
  • You did not correctly identify the observational unit for the mpg dataset

Completing Revisions

Lab 1 revisions are due by Friday, April 18 (at midnight).

  1. Read comments on Canvas.
  2. Go back into your Lab 1 on Posit Cloud and complete your revisions.
  3. Render your revised Lab 1.
  4. Download your revised HTML.

Reflections

Revisions are required to be accompanied with reflections on what you learned while completing your revisions. Reflections must be written in your Lab 1 Quarto file (next to the problems you revised).

Categorical Variables

How would you describe a categorical variable?

In R

categorical variables can have either character or factor data types


factor – structured & fixed number of levels / options

  • can be ordered or unordered


character – unstructured & variable number of levels

  • is inherently unordered

Fill in the associated data types (e.g. character, factor, integer, double) with each type of variable.

Variable Data Type in R
Categorical variable
Continuous numerical variable
Discrete numerical variable

dplyr – a tool bag for data wrangling

filter()

select()

mutate()

summarize()

arrange()

group_by()

Brainstorm definitions for each verb

filter()

select()

mutate()

group_by()

summarize()

arrange()

The Pipe A hex of the pipe operator from the magrittr package. The pipe operator is a % followed by a > followed by another %, which is used to chain data wrangling operations together.

A cartoon-style diagram of data processing using pipe-shaped illustrations. The diagram begins with 'data' flowing into pipes labeled with various data transformation functions from the dplyr package: 'filter,' 'mutate,' 'group_by,' and 'summarize.' Each function name is positioned next to a differently shaped or colored pipe, representing how data is modified at each step. The background is teal, and the function names are written in white monospaced font with arrows indicating the flow direction.

The Pipe in Action!

If you wanted means for each level of a categorical variable, what would you do?

Trout Size

The HJ Andrews Experimental Forest houses one of the larges long-term ecological research stations, specifically researching cutthroat trout and salamanders in clear cut or old growth sections of Mack Creek.

trout %>% 
  group_by(section) %>% 
  summarize(
    mean_length = mean(length_1_mm, 
                       na.rm = TRUE)
            )
# A tibble: 2 × 2
  section                               mean_length
  <chr>                                       <dbl>
1 clear cut forest                             85.3
2 upstream old growth coniferous forest        81.4


Why na.rm = TRUE?

Classifying Channel Types

The channels of the Mack Creek which were sampled were classified into the following groups:

"C"

"I"

"IP"

"P"

"R"

"S"

"SC"

NA

cascade

riffle

isolated pool

pool

rapid

step (small falls)

side channel

not sampled by unit

filter()-ing Specific Channel Types

The majority of the Cutthroat trout were captured in cascades (C), pools (P), and side channels (SC). Suppose we want to only retain these levels of the unittype variable.

What would you do?


trout %>% 
  filter(
    unittype %in% c("C", "P", "SC")
         )


When we have more than one value we want to keep, we need to use %in% not ==!

Incorporating Categorical Variables into Data Visualizations

  • As a variable on the x- or y-axis

  • As a color / fill

  • As a facet

Salamander Size

ggplot(data = salamander, 
       mapping = aes(x = length_2_mm)) + 
  geom_histogram(binwidth = 14, color = "white") + 
  labs(x = "Snout to Tail Length of Salamander (mm)")

How would this histogram look if there was no variation in salamander length?


What are possible causes for the variation in salamander length?

Faceted Histograms

ggplot(data = salamander, 
       mapping = aes(x = length_2_mm)) + 
  geom_histogram(binwidth = 14, color = "white") + 
  facet_wrap(~ section, scales = "free") +
  labs(x = "Snout to Tail Length of Salamander (mm)")

Side-by-Side Boxplots

ggplot(data = salamander, 
       mapping = aes(x = length_1_mm, 
                     y = species)
         ) + 
  geom_boxplot() + 
  labs(x = "Snout to Tail Length (mm)", 
       y = "") 

ggplot(data = salamander, 
       mapping = aes(y = length_1_mm, 
                     x = species)
         ) + 
  geom_boxplot() + 
  labs(y = "Snout to Tail Length (mm)", 
       x = "")

Colors in Boxplots

ggplot(data = salamander, 
       mapping = aes(x = length_1_mm, 
                       y = species, 
                       color = unittype)
         ) + 
  geom_boxplot() + 
  labs(x = "Snout to Tail Length (mm)", 
       y = "", 
       color = "Channel Type")

Facets & Colors in Boxplots

ggplot(data = salamander, 
       mapping = aes(x = length_1_mm, 
                       y = species, 
                       color = section)
         ) + 
  geom_boxplot() + 
  facet_wrap(~ unittype) + 
  labs(x = "Snout to Tail Length (mm)", 
       y = "", 
       color = "Section in Mack Creek")

Facets & Colors in Boxplots

Facets & Color in Scatterplots

ggplot(data = salamander, 
       mapping = aes(x = length_1_mm, 
                       y = weight_g, 
                       color = section)
         ) + 
  geom_point(alpha = 0.25) + 
  facet_wrap(~species) +
  labs(y = "Snout to Tail Length (mm)", 
       x = "Weight (g)", 
       color = "Section in Mack Creek")

Facets & Color in Scatterplots

Your Turn – 90 seconds

What are the aesthetics included in this plot?

What is one aspect of this plot that you believe is well done? What is one aspect of the plot that could be improved?

Statistical Critique

Your Tasks

  1. Find your visualizations
  • Visualization from the New York Times
  • Visualization from your chosen scientific article
  1. Discuss the aesthetics included in the plots
  2. Discuss what you believe the plot does well
  3. Discuss what improvements could be made to make the plot more clear

Reading a Research Paper

Your Tasks

Read the Background and Methods in the summary at the beginning of the Tuan et al. paper. Then answer the following questions:

  1. What data is involved in this research study?

  2. What are the research questions / research goals?