ggplot(data = nycflights, mapping = aes(x = dep_delay)) +
geom_histogram() +
labs(x = "Departure Delays (minutes)")
Lab 2 - Grading Guide
1 (a)
How large is the nycflights
dataset? (i.e. How many rows and columns does it have?)
Success:
- Uses
glimpse()
to obtain the size of the dataset - Size of 32,735 rows and 16 columns
Growing:
- If no code is present
- If they flip the rows and columns
- If they do not provide size of data
Feedback for no code: For Question 1(a), you needed to write code to explore the size of the data and the types of variables.
Feedback for not using glimpse()
: For this course, we prefer you use the tools being taught in the textbook and course materials. What function have you learned in this class which give a preview of a dataset?
Feedback for no / incorrect dimensions: For Question 1(a), look again at the output of the glimpse()
function, how many rows and columns are included in the nycflights dataset?
1 (b)
Are there numerical variables in the dataset? If so, what are their names?
Success:
- Lists all variables with an
int
ordbl
data type:- year
- month (many may miss)
- day
- dep_time
- dep_delay
- arr_time
- arr_delay
- flight (many may miss)
- air_time
- distance
- hour
- minute
- Permitted to miss 1 variable
Growing:
- If misses 2 or more variables
- If includes categorical variables (labeled
chr
)
Feedback: For Question 1(b), you need to be careful to include ALL of the variables that R believes are numerical, that includes variables which may behave more like a categorical variable.
2
Create a histogram of the dep_delay
variable from the nycflights
data
Success: Code should look like
Note: May choose their own binwidth, but that is not required!
Growing:
- If maps
dep_delay
to y-axis instead of x-axis
Feedback: For Question 2, our histograms are always made with our quantitative variable on the x-axis!
- If they don’t label the axis with the units
Feedback: Every plot we make should have descriptive axis labels which include the units the variable was measured in. Were departure delays measured in hours? Minutes? Seconds? Days?
3 (a)
Make two other histograms, one with a binwidth
of 15 and one with a binwidth
of 150.
Success: Code should look like:
ggplot(data = nycflights,
mapping = aes(x = dep_delay)) +
geom_histogram(binwidth = 15) +
labs(x = "Departure Delays (minutes)")
ggplot(data = nycflights,
mapping = aes(x = dep_delay)) +
geom_histogram(binwidth = 150) +
labs(x = "Departure Delays (minutes)")
If they don’t have nice axis labels for this question, we don’t need to mark them down. But we can give this feedback:
Careful! Look at the feedback for Question 2 about your axis labels.
Growing: If code does not use correct binwidth
s
Feedback: For Question 3(a), pay attention to what binwidth you need to use within your geom_histogram()
!
3 (b)
How do these three histograms compare? Are features revealed in one that are obscured in another?
Success:
- Discusses how wider bins make it hard to see where there are small differences and / or how smaller bins make it easier to see differences (i.e., smaller bins make the distribution more specific)
Growing:
- If description only references how the bins are larger or smaller, but doesn’t talk about shape the distribution
- Only talks about how the y-axis scale changes
Feedback: For Question 3(b), your discussion should address how the binwidth affects the shape of the distribution (e.g., skew, peak) and what you are able to see.
4
Fill in the code to create a new dataframe named sfo_flights that is the result of filter()
ing only the observations whose destination was San Francisco.
Success: Code should look like:
<- filter(nycflights,
sfo_flights == "SFO") dest
Growing: If they use the wrong destination in their filter()
Feedback: For Question 4, you need to look at the description for how the airport names were coded!
5
Fill in the code below to find the number of flights flying into SFO in July that arrived early. What does the result tell you?
Success: Code should look like:
filter(sfo_flights,
== 7,
month < 0) %>%
arr_delay dim()
Growing:
- If they put
7
in quotations ("7"
)
Feedback: We use quotations for variables that are categorical. Does R think that month is a categorical variable? (look back at what you said in Question 1b)
- If they use a
>
instead of a<
Feedback: Careful! If my flight arrived 5 minutes early, how would my arrival delay be recorded in the dataset? How does this translate to the inequality we want to use in our data filter?
6
When you ran the code above it output a preview of the filtered dataset. What does this output tell you about the number of flights that met your criteria (flying into SFO, in July, and arrived early)?
They should state that 45 flights (rows) met the criteria (arrived to SFO early in July).
Growing:
- If they don’t interpret 45 as the number of flights – only as the number of rows
Feedback: What are the observations included in the rows in the context of this dataset? Airplanes? Airports? Flights?
- If they interpret 45 and 16 as flight numbers
Feedback: The 45 and the 16 refer to the size of the filtered dataset. What does that tell you in terms of the number of flights that satisfied your criteria out of the sfo_flights
dataset?
7
Calculate the following statistics for the arrival delays in the sfo_flights dataset:
- mean
- median
- max
- min
Success: Code should look like:
summarise(sfo_flights,
mean_ad = mean(arr_delay),
median_ad = median(arr_delay),
max_ad = max(arr_delay),
min_ad = min(arr_delay)
)
- If they don’t name their summary statistics
Feedback: The output of your summary statistics looks a lot nicer if you give them names! (like the example that was given)
- I’d write a comment if they used
_dd
in their names, since it should be_ad
for the arrival delays.
Feedback: It would be more clear to use _ad
at the end of your names, since you are summarizing the arrival delays (not the departure delays).
Growing:
- If they miss some of the statistics
Feedback: For Question 6, you were asked to provide four (4) statistics!
- If they use the wrong dataset (nycflights instead of sfo_flights)
Feedback: For Question 6, what dataset should you be using to calculate your summary statistics?
8
Using the above summary statistics, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?
Success
- Makes a statement about what I should expect (flight to be early / late)
- Justifies statement based on summary statistics
Growing
- Does not justify their statement
Feedback: In Statistics it is critical to back your claim with data! How did you decide what I should expect if I am flying from NYC to SFO?
- Says I will arrive early / late based on the median / mean
Feedback: Why is the median / mean a good measure of the “typical” delay?
- Only discuss the statistics but don’t address the question
Feedback: You point out the statistics and what they mean in terms of an arrival delay, but how do these statistics connect with my question about what I should expect?
9
Now, rather than calculating summary statistics, plot the distribution of arrival delays for the sfo_flights
dataset. Choose the type of plot you believe is appropriate for visualizing the distribution of arrival delays.
Success: Code should look like:
ggplot(data = sfo_flights,
mapping = aes(x = arr_delay)) +
geom_histogram() +
labs(x = "Arrival Delays (min)")
Acceptable
geom
s:geom_histogram()
geom_dotplot()
(though not great)geom_density()
It’s okay for them not to change the
binwidth
Growing:
- Uses
geom_boxplot()
Feedback: For Question 9, we need to use a geom which allows for us to see the shape of the distribution. Boxplots hide distributions with multiple modes, so what type of plot would be better?
- Doesn’t change x-axis label and / or doesn’t include the units (minutes) in their label
Feedback: You were asked to give your visualization nice axis labels. It is also important for the axis label to contain the units of the variable! What units were the arrival delays measured in?
10
Using the plot above, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?
Success
- Makes a statement about what I should expect (flight to be early / late)
- Justifies statement based on the plot
Growing
- Does not justify their statement
Feedback: In Statistics it is critical to back your claim with data! How did you use this distribution to decide what I should expect if I am flying from NYC to SFO?
11
How did your answer change when using the plot versus using the summary statistics? i.e. What were you able to see in the plot that could could not “see” with the summary statistics?
Success:
- States if / how their answer did / did not change
- Discusses what could be see in the visualizations that could not be seen in the statistics
- skew
- mode / peak
Growing:
- Response does not discuss what could be seen in the visualizations that could not be seen in the statistics
Feedback: You should address what you were able to SEE in the visualization (e.g., shape, skew, number of peaks) that you could not “see” in the summary statistics.
- Response doesn’t discuss if / how the answer to the question changed
Feedback: Great job discussing how the shape of the distribution informs you choice of statistic! You did not, however, discuss if / how your answer to my question changed between looking at the plot versus looking at the statistics.