Tidy Data, Importing Data & More Advanced Graphics

Thursday, October 1

Today we will…

Debrief PA 1
Debrief Lab 1
- Content Related to Lab 2
New Material
- Tidy Data
- Load External Data
- Graphical Perception
- Colors in ggplot
Lab 2: Exploring Rodents with ggplot2
- Using External Resources

PA 1: Using Data Visualization to Find the Penguins

Code

ggplot(data = penguins) +
    geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = island )) + 
    labs(title = "Relashonship Between Bill Length and Bill Depth", x = "Bill Length (mm)" , y = "Bill Depth (mm)",)

Code

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm, 
                     color = species)) +
  geom_point() + 
  facet_wrap(~island) +
  labs(x = "Bill Length",
       y = "", 
       title = "Changes in Bill Depth versus Length")

Somethings I also noticed…

Code chunk options need to have a space before the option (#| label: not #|label:).
Make sure your code is visible (echo: true)!
Make sure you remove the messages and warnings from your final document!

Did you notice that your figures were not included in the HTML your group submitted to Canvas?

By default, Quarto does not embed plots in the HTML document. Instead, it creates a “PA-2-files” folder which stores all your plots.

So, when you submit your HTML file, your plots are not included! How do we fix this????

Add an embed-resources: true line to your YAML (at the beginning of your document)!

---
title: "PA 2: Using Data Visualization to Find the Penguins"
author: "Dr. T!"
format: html
editor: source
embed-resources: true
---

Lab 1

Grading / Feedback

Each question will earn a score of “Success” or “Growing”.
- Questions marked “Growing” will receive feedback on how to improve your solution.
- These questions can be resubmitted for additional feedback.

Earning a “Success” doesn’t necessarily mean your solution is without error.
- You may still receive feedback on how to improve your solution.
- These questions cannot be resubmitted for additional feedback.

Growing Points

Q2: All your code should be visible!
Q9: Captions should include more information than what is already present in the plot!
Q11: There’s no need to save intermediate objects if they are never used later!

How this translates into Lab 2…

Every lab and challenge is expected to use code-folding.
There should be no messages / warnings output in your final rendered HTML.
You should reduce the amount of “intermediate object junk” in your workspace.
- Ask yourself, do I need to use this later?
- If the answer is no, then you should not save that object.

Tidy Data

Artwork by Allison Horst

Same Data, Different Formats

Different formats of the data are tidy in different ways.

Option 1
Option 2

Team	Points	Assists	Rebounds
A	88	12	22
B	91	17	28
C	99	24	30
D	94	28	31

Team	Statistic	Value
A	Points	88
A	Assists	12
A	Rebounds	22
B	Points	91
B	Assists	17
B	Rebounds	28
C	Points	99
C	Assists	24
C	Rebounds	30
D	Points	94
D	Assists	28
D	Rebounds	31

Connection to ggplot

Let’s make a plot of each team’s statistics!

Option 1 - Wide Data
Option 2 - Long Data

Code

ggplot(data = bb_wide, 
       mapping = aes(x = Team)
       ) +
  geom_point(mapping = aes(y = Points, 
                           color = "Points"), 
             size = 4) +
  geom_point(mapping = aes(y = Assists, 
                           color = "Assists"), 
             size = 4) +
  geom_point(mapping = aes(y = Rebounds, 
                           color = "Rebounds"), 
             size = 4) + 
  scale_colour_manual(
    values = c("darkred", "steelblue", "forestgreen")
  ) +
  labs(color = "Statistic")

Code

ggplot(data = bb_long, 
       mapping = aes(x = Team, 
                     y = Value, 
                     color = Statistic)
       ) +
  geom_point(size = 4) + 
  scale_colour_manual(
    values = c("darkred", "steelblue", "forestgreen")
  ) +
  labs(color = "Statistic")

Tidy Data

An illustration featuring a cute, cartoonish scene with three characters sitting on a bench. In the center, there is a smiling blue rectangular character resembling a tidy data table, holding an ice cream cone. On either side of the table are two round, fluffy creatures: one pink on the left and one green on the right, both also holding ice cream cones. Above the characters, the text reads 'make friends with tidy data.' The overall tone of the image is friendly and inviting, encouraging positive feelings toward tidy data.

Artwork by Allison Horst

Working with External Data

Common Types of Data Files

Look at the file extension for the type of data file.

.csv : “comma-separated values”

Name, Age
Bob, 49
Joe, 40

.xls, .xlsx: Microsoft Excel spreadsheet

Common approach: save as .csv
Nicer approach: use the readxl package

.txt: plain text

Could have any sort of delimiter…
Need to let R know what to look for!

Common Types of Data Files

File A
File B
File C
Sources

Loading External Data

Using base R functions:

read.csv() is for reading in .csv files.
read.table() and read.delim() are for any data with “columns” (you specify the separator).

Loading External Data

The tidyverse has some cleaned-up versions in the readr and readxl packages:

read_csv() is for comma-separated data.
read_tsv() is for tab-separated data.
read_table() is for white-space-separated data.
read_delim() is any data with “columns” (you specify the separator). The above are special cases.
read_xls() and read_xlsx() are specifically for dealing with Excel files.

Remember to load the readr and readxl packages first!

What’s the difference?

Graphics

Structure: boxplot, scatterplot, etc.
Aesthetics: features such as color, shape, and size that map other variables to structural features.

Both the structure and aesthetics should help viewers interpret the information.

Pre-attentive Features

The next slide will have one point that is not like the others.

Raise your hand when you notice it.

Pre-attentive Features

A white background scattered with various green triangles and a single green circle. The triangles are evenly spaced but randomly oriented and positioned, with no apparent pattern or alignment. The single circle stands out among the triangles due to its different shape.

Pre-attentive Features

A white background scattered with various red circles and a single green circle. The circles are evenly spaced but randomly oriented and positioned, with no apparent pattern or alignment. The single circle stands out among the other circles due to its different color.

Pre-attentive Features

features that we see and perceive before we even think about it

They will jump out at us in less than 250 ms.
E.g., color, form, movement, spatial location.

There is a hierarchy of features:

Color is stronger than shape.
Combinations of pre-attentive features may not be pre-attentive due to interference.

Gestalt Principles

Gestalt Hierarchy	Graphical Feature
1. Enclosure	Facets
2. Connection	Lines
3. Proximity	White Space
4. Similarity	Color/Shape

Implications for practice:

Know that we perceive some groups before others.
Design to facilitate and emphasize the most important comparisons.

Double Encoding

No Double Encoding

Color

Color, hue, and intensity are pre-attentive features, and bigger contrasts lead to faster detection.
- Hue: main color family (red, orange, yellow…)
- Intensity: amount of color

This image illustrates the concept of color intensity. It shows a yellow square labeled 'YELLOW (HUE)' on the left, a gray square labeled 'GRAY' in the middle, and a square of desaturated yellow labeled 'LESS INTENSE' on the right. The image conveys that adding gray to a pure hue, such as yellow, results in a color that is less intense or more muted. The mathematical symbols '+' and '=' are used to show the combination of the yellow hue with gray, leading to a less intense color.

Color Guidelines

Do not use rainbow color gradients!
Be conscious of what certain colors “mean”.
- Good idea to use red for “good” and green for “bad”?

This image is a color-coded map of Texas, showing the percentage of people in each county who identify as white. The map uses a rainbow gradient scale to represent different percentages: red and orange for lower percentages (0% to 25%), transitioning through yellow and green (around 50%), to blue and purple for higher percentages (75% to 100%). Each county in Texas is colored according to where it falls on this scale, indicating the variation in racial identification across the state. The legend at the bottom of the map clarifies the percentage range associated with each color.

This image is a bar chart comparing the energy sources of six countries: Mexico, Brazil, Turkey, Russia, Indonesia, and China. Each bar represents a country and is divided into two color-coded segments: red for 'Low-carbon sources' and green for 'Fossil fuels sources.' The chart shows the proportion of energy generated from each source in each country. The purpose of this image is not to demonstrate what fuels each country is using but to highlight how the graph uses red and green hues to separate the two types of fuel. Not only are these colors are difficult for some people's eyes to differentiate, but the plot has swapped what we would ordinarily think of as 'good' and 'bad' colors with fuels that are worse and better for the planet.

Color Guidelines

For categorical data, try not to use more than 7 colors:

This image is a rectangular strip divided into seven equal vertical sections, each filled with a different solid color. The colors from left to right are: Red, Blue, Green, Purple, Orange, Yellow, Brown. Each section is distinctly separated by thin black lines, with no gradients or transitions between the colors. The image represents a simple color spectrum or palette.

If you need to, you can use colorRampPalette() from the RColorBrewer package to produce larger palettes:

This image is a rectangular strip divided into 19 equal vertical sections, each filled with a different solid color. The colors from left to right are red, maroon, dark blue, teal, green, light green, gray, purple, dark maroon, orange, light orange, yellow, light yellow, gold, brown, pink, light pink, mauve, and light gray. Each section is distinctly separated by thin black lines, with no gradients or transitions between the colors. The image represents an extended color spectrum or palette with a nuanced range of colors.

Color Guidelines

For quantitative data, use mappings from data to color that are numerically and perceptually uniform.
- Relative discriminability of two colors should be proportional to the difference between the corresponding data values.

Color Guidelines

To make your graphic color deficiency friendly…

use double encoding - when you use color, also use another aesthetic (line type, shape, facet, etc.).

This image features a pattern of small shapes scattered across a white background. The shapes are orange triangles and green circles, distributed randomly throughout the image. The triangles and circles do not overlap and are spaced unevenly, creating a dispersed, non-repetitive pattern. The image appears to represent a simple, abstract design with two distinct shapes and colors.

Color Guidelines

To make your graphic color deficiency friendly…

with a unidirectional scale (e.g., all + values), use a monochromatic color gradient.

This image features a horizontal gradient transitioning smoothly from light to dark blue. On the left side, the gradient starts with a very light, almost white blue and gradually deepens in color, moving through medium blue tones in the center, and finally ending with a deep, dark blue on the right side. The gradient creates a seamless transition across the spectrum of blue hues, visually representing the full range of the color from light to dark.

with a bidirectional scale (e.g., + and - values), use a purple-white-orange color gradient. Transition through white!

This image features a horizontal gradient transitioning between two different color families. On the left side, the gradient starts with a deep purple and gradually lightens to a pale lavender near the center. The gradient then transitions to white in the middle, which serves as a dividing point between the two color families. From the center to the right, the gradient shifts into warm tones, moving from a light peach to a deeper orange, and finally ending with a rich brown on the far right. This gradient smoothly blends the cool purples with the warm oranges, creating a balanced and visually appealing transition between the two color spectrums.

Color Guidelines

To make your graphic color deficiency friendly…

print your chart out in black and white – if you can still read it, it will be safe for all users.

This image is a bar chart titled 'How Often Teens Say They Use Each Platform,' depicting the frequency of usage of various social media platforms by teenagers. The platforms listed from top to bottom are YouTube, TikTok, Snapchat, Instagram, and Facebook. Each platform's usage is broken down into six categories, represented as different segments in the bar: Almost constantly (dark blue), Several times a day (medium blue), About once a day (light blue), Less often (lightest blue), Don’t use (gray).

This image is a copy of the previous image, but the blue tones have been replaced with grey tones.

Color in ggplot2

There are several packages with color scheme options:

Rcolorbrewer
ggsci
viridis
wesanderson

These packages have color palettes that are aesthetically pleasing and, in many cases, color deficiency friendly.

You can also take a look at other ways to find nice color palettes.

Lab 2: Exploring Rodents with ggplot2 & Challenge 2: Spicing things up with ggplot2

Peer Code Review

Starting with Lab 2, your labs will have an appearance / code format portion.

Review the code formatting guidelines before you submit your lab!
Each week, you will be assigned one of your peer’s labs to review their code formatting.

On the Use of External Resources…

Part of learning to program is learning from a variety of resources. Thus, I expect you will use resources that you find on the internet.

In this class the assumed knowledge is the course materials, including the course textbook, coursework pages, and course slides. Any functions / code used outside of these materials require direct references.

If you used Google:
- paste the link to the resource in a code comment next to where you used that resource
If you used ChatGPT:
- indicate somewhere in the problem that you used ChatGPT
- paste the link to your chat (using the Share button from ChatGPT)

Things You Should Know About ChatGPT…

GPT uses machine learning to predict what words to give you
- This method is entirely probabilistic, meaning, the same question may produce different answers for different people.

The answers GPT gives rely a lot on the context you provide (or don’t provide)
- It is good to give lots of background information (e.g., what R package you are using)

ChatGPT is a pretty decent tutor.
- Did you use GPT to help you with some code?
- Do you not understand what the code is doing?
- Ask GPT to explain code to you!

To do…

Lab 2: Exploring rodents with ggplot2
- due Sunday, October 6 at 11:59pm
Lab 2: Spicing things up with ggplot2
- due Sunday, October 6 at 11:59pm
Complete Week 3 Coursework: Data Wrangling with dplyr
- Check-ins 3.1 and 3.2 due Tuesday, October 8 at 12pm