# Package for functions
library(tidyverse)
# Package for data
library(openintro)
Lab 2: Visualizing and Summarizing Numerical Data
Getting started
Load packages
Let’s load the following packages:
The tidyverse “umbrella” package which houses a suite of many different
R
packages for data wrangling and data visualizationThe openintro
R
package: houses the dataset we will be working with
The data
The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.
First, we’ll view the nycflights
data frame. Run the following code to load in the data:
data(nycflights)
The codebook (description of the variables) can be accessed by pulling up the help file by typing a ?
before the name of the dataset:
?nycflights
Remember that you can use glimpse()
to take a quick peek at your data to understand its contents better.
Question 1
(a) How large is the nycflights
dataset? (i.e. How many rows and columns does it have?)
(b) What are the names of the variables R believes are numerical? Hint: What data types are numerical variables stored as?
# You code for exercise 1 goes here! Yes, your answer should use code!
(c) Of the numerical variables you named in (b), which variables behave more like a categorical variable?
Departure Delays
Let’s start by examining the distribution of departure delays (dep_delay
) of all flights with a histogram.
Question 2 – Create a histogram of the dep_delay
variable from the nycflights
data. Don’t forget to give your visualization informative axis labels that include the units the variable was measured in!
# Your code for exercise 2 goes here!
Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data into different bins.
You can easily define the binwidth you want to use, by specifying the binwidth
argument inside of geom_histogram()
, like so:
geom_histogram(binwidth = 15)
Question 3
(a) Make two other histograms, one with a binwidth
of 15 and one with a binwidth
of 150. Feel free to copy-and-paste the code you used for Question 2 and modify the binwidth
.
# Your code for exercise 3 goes here!
(b) How do these three histograms compare? Are features revealed in one that are obscured in another?
SFO Destinations
One of the variables refers to the destination (i.e. airport) of the flight, which have three letter abbreviations. For example, flights into Los Angeles have a dest
of "LAX"
, flights into San Francisco have a dest
of "SFO"
, and flights into Chicago (O’Hare) have a dest
of "ORD"
.
If you want to visualize only on delays of flights headed to Los Angeles, you need to first filter()
the data for flights with that destination (e.g., filter(dest == "LAX")
) and then make a histogram of the departure delays of only those flights.
Logical operators: Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so, you can use the filter()
function and a series of logical operators. The most commonly used logical operators for data analysis are as follows:
==
means “equal to”!=
means “not equal to”>
or<
means “greater than” or “less than”>=
or<=
means “greater than or equal to” or “less than or equal to”
Question 4 – Fill in the code to create a new dataframe named sfo_flights
that is the result of filter()
ing only the observations whose destination was San Francisco.
<- filter(nycflights,
sfo_flights == ) dest
Multiple Data Filters
You can filter based on multiple criteria! Within the filter()
function, each criteria is separated using commas. For example, suppose you are interested in flights leaving from LaGuardia (LGA) in February:
filter(nycflights,
== "LGA",
origin == 2)
month
## Remember months are coded as numbers (February = 2)!
Note that you can separate the conditions using commas if you want flights that are both leaving from LGA and flights in February. If you are interested in either flights leaving from LGA or flights that happened in February, you can use the |
instead of the comma.
Question 5 – Fill in the code below to find the number of flights flying into SFO in July that arrived early.
filter(sfo_flights,
== __,
month > __) %>%
arr_delay glimpse()
Question 6 – When you ran the code above output a preview of the filtered dataset. What does this output tell you about the number of flights that met your criteria (SFO, July, arrived early)?
Data Summaries
You can also obtain numerical summaries for the flights headed to SFO, using the summarise()
function:
summarise(sfo_flights,
mean_dd = mean(dep_delay),
median_dd = median(dep_delay),
n = n())
Note that in the summarise()
function I’ve created a list of three different numerical summaries that I’m interested in.
The names of these elements are user defined, like mean_dd
, median_dd
, n
, and you can customize these names as you like (just don’t use spaces in your names!).
Calculating these summary statistics also requires that you know the summary functions you would like to use.
Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows:
mean()
: calculates the averagemedian()
: calculates the mediansd()
: calculates the standard deviationvar()
: calculates the variancesIQR()
: calculates the inner quartile range (Q3 - Q1)min()
: finds the minimummax()
: finds the maximumn()
: reports the sample size
Note that each of these functions takes a single variable as an input and returns a single value as an output.
Summaries vs. Visualizations
If I’m flying from New York to San Francisco, should I expect that my flights will typically arrive on time?
Let’s think about how you could answer this question. One option is to summarize the data and inspect the output. Another option is to plot the delays and inspect the plots. Let’s try both!
Question 7 – Calculate the following statistics for the arrival delays in the sfo_flights
dataset:
- mean
- median
- max
- min
## Code for exercise 7 goes here!
Question 8 – Using the above summary statistics, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?
Question 9 – Now, rather than calculating summary statistics, plot the distribution of arrival delays for the sfo_flights
dataset.
Choose the type of plot you believe is appropriate for visualizing the distribution of arrival delays. Don’t forget to give your visualization informative axis labels that include the units of measurement!
## Code for exercise 9 goes here!
Question 10 – Using the plot above, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?
Question 11 – How did your answer change when using the plot versus using the summary statistics? i.e. What were you able to see in the plot that could could not “see” with the summary statistics?