Tidy Data, Importing Data & More Advanced Graphics

Thursday, October 1

Today we will…

  • Debrief PA 1 (10-minutes)
  • Debrief Lab 1 (10-minutes)
    • Content Related to Lab 2
  • Tidy Data & Loading External Data (5-minutes)
  • More Advanced Visualization Customizations (25-minutes)
  • 10-minute Break
  • Introduction to Lab 2 (10-minutes)
    • Using External Resources

PA 1: Using Data Visualization to Find the Penguins

Multiple Categorical Variables

Code
ggplot(data = penguins) +
    geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = island )) + 
    labs(title = "Relationship Between Bill Length and Bill Depth", x = "Bill Length (mm)" , y = "Bill Depth (mm)",)

Code
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm, 
                     color = species)) +
  geom_point() + 
  facet_wrap(~island) +
  labs(x = "Bill Length",
       y = "", 
       title = "Changes in Bill Depth versus Length")

Faceting

Extracts subsets of data and places them in side-by-side plots.

Code
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm, 
                     color = species)) +
  geom_point() + 
  facet_wrap(~island) +
  labs(x = "Bill Length",
       y = "", 
       title = "Changes in Bill Depth versus Length")

Code
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm, 
                     color = species)) +
  geom_point() + 
  facet_grid(sex ~ island) +
  labs(x = "Bill Length",
       y = "", 
       title = "Changes in Bill Depth versus Length")

Changing the Facet Scales

You can set scales to let axis limits vary across facets using the scales argument.

The x-axis limits adjust to individual facets.

Code
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm, 
                     color = species)) +
  geom_point() + 
  facet_wrap(~island, scales = "free_x") +
  labs(x = "Bill Length",
       y = "", 
       title = "Changes in Bill Depth versus Length")

"free_y" – only y-axis limits adjust to individual facets.

Code
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm, 
                     color = species)) +
  geom_point() + 
  facet_wrap(~island, scales = "free_y") +
  labs(x = "Bill Length",
       y = "", 
       title = "Changes in Bill Depth versus Length")

Both x- and y-axis limits adjust to individual facets.

Code
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm, 
                     color = species)) +
  geom_point() + 
  facet_wrap(~island, scales = "free") +
  labs(x = "Bill Length",
       y = "", 
       title = "Changes in Bill Depth versus Length")

Lab 1

Grading / Feedback

  • Each question will earn a score of “Success” or “Growing”.
    • Questions marked “Growing” will receive feedback on how to improve your solution.
    • These questions can be resubmitted for additional feedback.
  • Earning a “Success” doesn’t necessarily mean your solution is without error.
    • You may still receive feedback on how to improve your solution.
    • These questions cannot be resubmitted for additional feedback.

Growing Points

  • Q5: Section headers need to have a space between the # and the header (e.g., # Guinea Pig Teeth).
  • Q9: Captions should include more information than what is already present in the plot! This also goes for plot titles!
  • Q11: There’s no need to save intermediate objects if they are never used later!

How this translates into Lab 2…

  • Every lab and challenge is expected to use code-folding and code-tools
  • There should never be any messages or warnings output in your final rendered HTML.
  • You should reduce the amount of “intermediate object junk” in your workspace.
    • Ask yourself, do I need to use this later?
    • If the answer is no, then you should not save that object.
    • Just output the plot!

Why is there an embed-resources: true line in the YAML?


By default, Quarto does not embed plots in the HTML document. Instead, it creates a “Lab-1-files” folder which stores all your plots.

This means that by default your plots are not included in your HTML file!


---
title: "Lab 1"
author: "Dr. T!"
format: html
editor: source
embed-resources: true
---

The embed-resources: true line in the YAML forces the visualizations to be embedded in your HTML!

Tidy Data

Tidy Data

An educational graphic explaining 'Tidy Data' with text and a simple table. The main text at the top reads, 'TIDY DATA is a standard way of mapping the meaning of a dataset to its structure,' followed by the attribution to Hadley Wickham. Below, it explains the concept of tidy data: 'In tidy data: each variable forms a column, each observation forms a row, each cell is a single measurement.' To the right, there is a small table with three columns labeled 'id,' 'name,' and 'color,' demonstrating how each column is a variable and each row is an observation. The table contains entries such as 'floof' (gray), 'max' (black), and 'panda' (calico). The image ends with a citation for Hadley Wickham's 2014 paper on Tidy Data.

Artwork by Allison Horst

Working with External Data

Common Types of Data Files

Look at the file extension for the type of data file.

.csv : “comma-separated values”

Name, Age
Bob, 49
Joe, 40

.xls, .xlsx: Microsoft Excel spreadsheet

  • Common approach: save as .csv
  • Nicer approach: use the readxl package

.txt: plain text

  • Could have any sort of delimiter…
  • Need to let R know what to look for!

Common Types of Data Files

A screenshot of a raw data file. The top row is the name of the columns in quotations (LatD, LatM, LatS, NS, LonG, LonS, EW, City, State). The subsequent rows are the observations for the data with values for each of the variables. The entries of the data file are separated with commas. What is the delimiter for this file?

A screenshot of a raw data file. The top row is the name of the columns (FID, AFFGEOID, TRACTCE ST, STATE, ST_ABBR, STCNTY, FIPS, AREA_SQMI, E_TOTPOP). The subsequent rows are the observations for the data with values for each of the variables. The entries of the data file are separated by tabs. What is the delimiter for this file?

A screenshot of a raw data file. There is no top row indicating column names. Rather, the data file starts with the first observation. The entries of the data file are separated with bars (|). What is the delimiter for this file?

Loading External Data

Using base R functions:

  • read.csv() is for reading in .csv files.

  • read.table() and read.delim() are for any data with “columns” (you specify the separator).

Loading External Data

The tidyverse has some cleaned-up versions of the base R functions in the readr and readxl packages:

  • read_csv() is for comma-separated data.

  • read_tsv() is for tab-separated data.

  • read_table() is for white-space-separated data.

  • read_delim() is any data with “columns” (you specify the separator). The above are special cases.

  • read_xls() and read_xlsx() are specifically for dealing with Excel files.

Remember to load the readr and readxl packages first!

What’s the difference?

A screenshot of the function arguments for the read.csv() function. The function has seven arguments: file, header, sep, quote, dec, fill, and comment.char.

A screenshot of the function arguments for the read_csv() function. The function has twenty arguments. The first two arguments are the same as read.csv() (file, column names), there is no sep argument as it assumes the file is separated with a comma (since it is a csv). The additional arguments allow the user to control aspects of how the data are read in, such as what data type should be used for each column, if certain rows should be skipped, or if only certain columns should be selected.

More Advanced Graphics

More Advanced Graphics

  • Structure: boxplot, scatterplot, etc.

  • Aesthetics: features such as color, shape, and size that map other variables to structural features.

Both the structure and aesthetics should help viewers interpret the information.

Pre-attentive Features

Pre-attentive Features


The next slide will have one point that is not like the others.


Raise your hand when you notice it.

Pre-attentive Features

A white background scattered with various green triangles and a single green circle. The triangles are evenly spaced but randomly oriented and positioned, with no apparent pattern or alignment. The single circle stands out among the triangles due to its different shape.

Pre-attentive Features

A white background scattered with various red circles and a single green circle. The circles are evenly spaced but randomly oriented and positioned, with no apparent pattern or alignment. The single circle stands out among the other circles due to its different color.

Pre-attentive Features

features that we see and perceive before we even think about it

  • They will jump out at us in less than 250 ms.

  • E.g., color, form, movement, spatial location.

There is a hierarchy of features (e.g., color is stronger than shape).

Gestalt Principles

Gestalt Hierarchy Graphical Feature
1. Enclosure Facets
2. Connection Lines
3. Proximity White Space
4. Similarity Color/Shape


Implications for practice:

  • Know that we perceive some groups before others.
  • Design to facilitate and emphasize the most important comparisons.

Double Encoding

No Double Encoding

Color

Color

  • Color, hue, and intensity are pre-attentive features, and bigger contrasts lead to faster detection.
    • Hue: main color family (red, orange, yellow…)
    • Intensity: amount of color

This image is a color wheel that displays the concept of 'Hue.' The wheel is divided into 12 segments, each representing a different color. The colors transition smoothly around the wheel, moving from yellow to yellow-orange, orange, red-orange, red, red-violet, violet, blue-violet, blue, blue-green, green, and yellow-green. The word 'HUE' is written in the center of the wheel, indicating that the image is meant to illustrate the variety of hues that make up the color spectrum.

This image illustrates the concept of color intensity. It shows a yellow square labeled 'YELLOW (HUE)' on the left, a gray square labeled 'GRAY' in the middle, and a square of desaturated yellow labeled 'LESS INTENSE' on the right. The image conveys that adding gray to a pure hue, such as yellow, results in a color that is less intense or more muted. The mathematical symbols '+' and '=' are used to show the combination of the yellow hue with gray, leading to a less intense color.

Color Guidelines

  • Do not use rainbow color gradients!

  • Be conscious of what certain colors “mean”.

    • Good idea to use red for “good” and green for “bad”?

This image is a color-coded map of Texas, showing the percentage of people in each county who identify as white. The map uses a rainbow gradient scale to represent different percentages: red and orange for lower percentages (0% to 25%), transitioning through yellow and green (around 50%), to blue and purple for higher percentages (75% to 100%). Each county in Texas is colored according to where it falls on this scale, indicating the variation in racial identification across the state. The legend at the bottom of the map clarifies the percentage range associated with each color.

This image is a bar chart comparing the energy sources of six countries: Mexico, Brazil, Turkey, Russia, Indonesia, and China. Each bar represents a country and is divided into two color-coded segments: red for 'Low-carbon sources' and green for 'Fossil fuels sources.' The chart shows the proportion of energy generated from each source in each country. The purpose of this image is not to demonstrate what fuels each country is using but to highlight how the graph uses red and green hues to separate the two types of fuel. Not only are these colors are difficult for some people's eyes to differentiate, but the plot has swapped what we would ordinarily think of as 'good' and 'bad' colors with fuels that are worse and better for the planet.

Color Guidelines

For categorical data, try not to use more than 7 colors:

This image is a rectangular strip divided into seven equal vertical sections, each filled with a different solid color. The colors from left to right are: Red, Blue, Green, Purple, Orange, Yellow, Brown. Each section is distinctly separated by thin black lines, with no gradients or transitions between the colors. The image represents a simple color spectrum or palette.

If you need to, you can use colorRampPalette() from the RColorBrewer package to produce larger palettes:

This image is a rectangular strip divided into 19 equal vertical sections, each filled with a different solid color. The colors from left to right are red, maroon, dark blue, teal, green, light green, gray, purple, dark maroon, orange, light orange, yellow, light yellow, gold, brown, pink, light pink, mauve, and light gray. Each section is distinctly separated by thin black lines, with no gradients or transitions between the colors. The image represents an extended color spectrum or palette with a nuanced range of colors.

Color Guidelines

To make your graphic friendly for people with different color vision deficiencies…

  • use double encoding - when you use color, also use another aesthetic (line type, shape, facet, etc.).
This image features a pattern of small shapes scattered across a white background. The shapes are orange triangles and green circles, distributed randomly throughout the image. The triangles and circles do not overlap and are spaced unevenly, creating a dispersed, non-repetitive pattern. The image appears to represent a simple, abstract design with two distinct shapes and colors.

Color Guidelines

To make your graphic friendly for people with different color vision deficiencies…

  • with a unidirectional scale (e.g., all + values), use a monochromatic color gradient.

This image features a horizontal gradient transitioning smoothly from light to dark blue. On the left side, the gradient starts with a very light, almost white blue and gradually deepens in color, moving through medium blue tones in the center, and finally ending with a deep, dark blue on the right side. The gradient creates a seamless transition across the spectrum of blue hues, visually representing the full range of the color from light to dark.

  • with a bidirectional scale (e.g., + and - values), use a purple-white-orange color gradient. Transition through white!

This image features a horizontal gradient transitioning between two different color families. On the left side, the gradient starts with a deep purple and gradually lightens to a pale lavender near the center. The gradient then transitions to white in the middle, which serves as a dividing point between the two color families. From the center to the right, the gradient shifts into warm tones, moving from a light peach to a deeper orange, and finally ending with a rich brown on the far right. This gradient smoothly blends the cool purples with the warm oranges, creating a balanced and visually appealing transition between the two color spectrums.

Color Guidelines

To make your graphic friendly for people with different color vision deficiencies…

  • print your chart out in black and white – if you can still read it, it will be safe for all users.

This image is a bar chart titled 'How Often Teens Say They Use Each Platform,' depicting the frequency of usage of various social media platforms by teenagers. The platforms listed from top to bottom are YouTube, TikTok, Snapchat, Instagram, and Facebook. Each platform's usage is broken down into six categories, represented as different segments in the bar: Almost constantly (dark blue), Several times a day (medium blue), About once a day (light blue), Less often (lightest blue), Don’t use (gray).

This image is a copy of the previous image, but the blue tones have been replaced with grey tones.

Color in ggplot2

There are several packages with color scheme options:

  • Rcolorbrewer
  • ggsci
  • viridis
  • wesanderson

These packages have color palettes that are aesthetically pleasing and, in many cases, color deficiency friendly.

You can also take a look at other ways to find nice color palettes.

Further Customizations

Code
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm, 
                     color = species)) +
  geom_point() + 
  facet_wrap(~island) +
  labs(x = "Bill Length",
       y = "", 
       color = "Species of Penguin",
       title = "Do penguins with shorter bills have deeper bills?")

Code
custom_labels <- c(
  Dream = "Dream Island",
  Torgersen = "Torgersen Island",
  Biscoe = "Biscoe Island"
)

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm, 
                     color = species)) +
  geom_point() + 
  facet_wrap(~island, labeller = as_labeller(custom_labels)) +
  labs(x = "Bill Length",
       y = "", 
       color = "Species of Penguin",
       title = "Do penguins with shorter bills have deeper bills?")

Code
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm, 
                     color = species)) +
  geom_point() + 
  facet_wrap(~island) +
  labs(x = "Bill Length",
       y = "", 
       color = "Species of Penguin",
       title = "Do penguins with shorter bills have deeper bills?") +
  theme_bw() +
  theme(legend.position = "top")

Code
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm, 
                     color = species)) +
  geom_point() + 
  facet_wrap(~island) +
  labs(x = "Bill Length",
       y = "", 
       color = "Species of Penguin",
       title = "Do penguins with shorter bills have deeper bills?") +
  scale_y_continuous(limits = c(10, 30),
                     breaks = seq(from = 10, to = 30, by = 5)
                     )

Lab 2 & Challenge 2 Exploring Rodents with ggplot2

Peer Code Review

Starting with Lab 2, your labs will have an appearance / code format portion.

  • Review the code formatting guidelines before you submit your lab!

  • Each week, you will be assigned one of your peer’s labs to review their code formatting.

ggplot Code

It is good practice to put each geom and aes on a new line.

  • This makes code easier to read!
  • Generally: no line of code should be over 80 characters long.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) + geom_point() + theme_bw() + labs(x = "City (mpg)", y = "Highway (mpg)")
ggplot(data = mpg, 
       mapping = aes(x = cty, y = hwy, color = class)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "City (mpg)", y = "Highway (mpg)")

On the Use of External Resources…

Part of learning to program is learning from a variety of resources. Thus, I expect you will use resources that you find on the internet.

In this class the assumed knowledge is the course materials, including the course textbook, coursework pages, and course slides. Any functions / code used outside of these materials require direct references.

  • If you used Google:
    • paste the link to the resource in a code comment next to where you used that resource
  • If you used ChatGPT:
    • indicate somewhere in the problem that you used ChatGPT
    • paste the link to your chat (using the Share button from ChatGPT)

Things You Should Know About ChatGPT…

  • GPT uses machine learning to predict what words to give you
    • This method is entirely probabilistic, meaning, the same question may produce different answers for different people.
  • The answers GPT gives rely a lot on the context you provide (or don’t provide)
    • It is good to give lots of background information (e.g., what R package you are using)
  • ChatGPT is a pretty decent tutor.
    • Did you use GPT to help you with some code?
    • Do you not understand what the code is doing?
    • Ask GPT to explain code to you!

Additional Challenge Opportunity

A research team from The University of Illinois is studying students’ ability to decipher what data visualizations in the media do and do not reveal. They have developed an assessment to measure students’ visual data literacy, and our class will be among the first to test it out and offer feedback.

The assessment form is linked in the “ADDITIONAL CHALLENGE OPPORTUNITIES” module on Canvas.

Completing the survey can count as an additional demonstration of you “extending” your thinking.

As with all research, it is up to you whether you give consent for your data to be used for research purposes! You will be asked about this at the end of the survey. That choice completely up to you!

To do…

  • Lab 2: Exploring rodents with ggplot2
    • due Monday, September 29 at 5pm
  • Challenge 2: Spicing things up with ggplot2
    • due Monday, September 29 at 5pm
  • Complete Week 3 Coursework: Data Wrangling with dplyr
    • Check-ins 3.1 and 3.2 due Tuesday, September 30 at 8am