Lab 2: Visualizing and Summarizing Numerical Data

Author

Your group’s names here!

Getting started

Load packages

Let’s load the following packages:

  • The tidyverse “umbrella” package which houses a suite of many different R packages for data wrangling and data visualization

  • The openintro R package: houses the dataset we will be working with

# Package for functions 
library(tidyverse)

# Package for data
library(openintro)

The data

The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.

First, we’ll view the nycflights data frame. Run the following code to load in the data:

data(nycflights)

The codebook (description of the variables) can be accessed by pulling up the help file by typing a ? before the name of the dataset:

?nycflights

Remember that you can use glimpse() to take a quick peek at your data to understand its contents better.

Question 1

(a) How large is the nycflights dataset? (i.e. How many rows and columns does it have?)

(b) What are the names of the variables R believes are numerical? Hint: What data types are numerical variables stored as?

# You code for exercise 1 goes here! Yes, your answer should use code!

(c) Of the numerical variables you named in (b), which variables behave more like a categorical variable?

Departure Delays

Let’s start by examining the distribution of departure delays (dep_delay) of all flights with a histogram.

Question 2 – Create a histogram of the dep_delay variable from the nycflights data. Don’t forget to give your visualization informative axis labels that include the units the variable was measured in!

# Your code for exercise 2 goes here! 

Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data into different bins.

You can easily define the binwidth you want to use, by specifying the binwidth argument inside of geom_histogram(), like so:

geom_histogram(binwidth = 15)

Question 3

(a) Make two other histograms, one with a binwidth of 15 and one with a binwidth of 150. Feel free to copy-and-paste the code you used for Question 2 and modify the binwidth.

# Your code for exercise 3 goes here! 

(b) How do these three histograms compare? Are features revealed in one that are obscured in another?

SFO Destinations

One of the variables refers to the destination (i.e. airport) of the flight, which have three letter abbreviations. For example, flights into Los Angeles have a dest of "LAX", flights into San Francisco have a dest of "SFO", and flights into Chicago (O’Hare) have a dest of "ORD".

If you want to visualize only on delays of flights headed to Los Angeles, you need to first filter() the data for flights with that destination (e.g., filter(dest == "LAX")) and then make a histogram of the departure delays of only those flights.

Logical operators: Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so, you can use the filter() function and a series of logical operators. The most commonly used logical operators for data analysis are as follows:

  • == means “equal to”
  • != means “not equal to”
  • > or < means “greater than” or “less than”
  • >= or <= means “greater than or equal to” or “less than or equal to”

Question 4 – Fill in the code to create a new dataframe named sfo_flights that is the result of filter()ing only the observations whose destination was San Francisco.

sfo_flights <- filter(nycflights, 
                      dest == )

Multiple Data Filters

You can filter based on multiple criteria! Within the filter() function, each criteria is separated using commas. For example, suppose you are interested in flights leaving from LaGuardia (LGA) in February:

filter(nycflights, 
       origin == "LGA", 
       month == 2)

## Remember months are coded as numbers (February = 2)!

Note that you can separate the conditions using commas if you want flights that are both leaving from LGA and flights in February. If you are interested in either flights leaving from LGA or flights that happened in February, you can use the | instead of the comma.

Question 5 – Fill in the code below to find the number of flights flying into SFO in July that arrived early.

filter(sfo_flights, 
       month == __, 
       arr_delay > __) %>% 
  glimpse()

Question 6 – When you ran the code above output a preview of the filtered dataset. What does this output tell you about the number of flights that met your criteria (SFO, July, arrived early)?

Data Summaries

You can also obtain numerical summaries for the flights headed to SFO, using the summarise() function:

summarise(sfo_flights, 
          mean_dd   = mean(dep_delay), 
          median_dd = median(dep_delay), 
          n         = n())

Note that in the summarise() function I’ve created a list of three different numerical summaries that I’m interested in.

The names of these elements are user defined, like mean_dd, median_dd, n, and you can customize these names as you like (just don’t use spaces in your names!).

Calculating these summary statistics also requires that you know the summary functions you would like to use.

Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows:

  • mean(): calculates the average
  • median(): calculates the median
  • sd(): calculates the standard deviation
  • var(): calculates the variances
  • IQR(): calculates the inner quartile range (Q3 - Q1)
  • min(): finds the minimum
  • max(): finds the maximum
  • n(): reports the sample size

Note that each of these functions takes a single variable as an input and returns a single value as an output.

Summaries vs. Visualizations

If I’m flying from New York to San Francisco, should I expect that my flights will typically arrive on time?

Let’s think about how you could answer this question. One option is to summarize the data and inspect the output. Another option is to plot the delays and inspect the plots. Let’s try both!

Question 7 – Calculate the following statistics for the arrival delays in the sfo_flights dataset:

  • mean
  • median
  • max
  • min
## Code for exercise 7 goes here! 

Question 8 – Using the above summary statistics, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?

Question 9 – Now, rather than calculating summary statistics, plot the distribution of arrival delays for the sfo_flights dataset.

Choose the type of plot you believe is appropriate for visualizing the distribution of arrival delays. Don’t forget to give your visualization informative axis labels that include the units of measurement!

## Code for exercise 9 goes here! 

Question 10 – Using the plot above, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?

Question 11 – How did your answer change when using the plot versus using the summary statistics? i.e. What were you able to see in the plot that could could not “see” with the summary statistics?