Lab 2 - Grading Guide

1 (a)

How large is the nycflights dataset? (i.e. How many rows and columns does it have?)

Success:

Uses glimpse() to obtain the size of the dataset
Size of 32,735 rows and 16 columns

Growing:

If no code is present
If they flip the rows and columns
If they do not provide size of data

Feedback for no code: For Question 1(a), you needed to write code to explore the size of the data and the types of variables.

Feedback for not using glimpse(): For this course, we prefer you use the tools being taught in the textbook and course materials. What function have you learned in this class which give a preview of a dataset?

Feedback for no / incorrect dimensions: For Question 1(a), look again at the output of the glimpse() function, how many rows and columns are included in the nycflights dataset?

1 (b)

Are there numerical variables in the dataset? If so, what are their names?

Success:

Lists all 12 variables with an int or dbl data type:
- year
- month (many may miss)
- day
- dep_time
- dep_delay
- arr_time
- arr_delay
- flight (many may miss)
- air_time
- distance
- hour
- minute
Permitted to miss 1 variable

Growing:

If misses 2 or more variables
If includes categorical variables (labeled chr)

Feedback: For Question 1(b), you need to be careful to include ALL of the variables that R believes are numerical, that includes variables which may behave more like a categorical variable.

1 (c)

Of the numerical variables you named in (b), which variables behave more like a categorical variable?

Success:

Lists all of the following variables:
- year
- month (many may miss)
- flight (many may miss)
If additional variables are listed they are variables that could reasonably be argued to be categorical.

Growing:

If misses any of the three variables listed above
If includes variables that do not behave like categorical variables (e.g., arr_time)

If they missed any of the variables listed above:

Take another look and see if any of the variables listed as integers behave more like categorical variables (variables that could be used to group observations).

If they included numerical variables:

Categorical variables group observations! We talked this week about how categorical variables have a finite (small) number of values. The variable you chose does not necessarily reflect this. What variables do?

2

Create a histogram of the dep_delay variable from the nycflights data

Success: Code should look like

ggplot(data = nycflights, mapping = aes(x = dep_delay)) + 
  geom_histogram() + 
  labs(x = "Departure Delays (minutes)")

Note

Note: May choose their own binwidth, but that is not required!

Growing:

If maps dep_delay to y-axis instead of x-axis

Feedback: For Question 2, our histograms are always made with our quantitative variable on the x-axis!

If they don’t label the axis with the units

Feedback: Every plot we make should have descriptive axis labels which include the units the variable was measured in. Were departure delays measured in hours? Minutes? Seconds? Days?

3 (a)

Make two other histograms, one with a binwidth of 15 and one with a binwidth of 150.

Success: Code should look like:

ggplot(data = nycflights, 
       mapping = aes(x = dep_delay)) + 
  geom_histogram(binwidth = 15) + 
  labs(x = "Departure Delays (minutes)")


ggplot(data = nycflights, 
       mapping = aes(x = dep_delay)) + 
  geom_histogram(binwidth = 150) +
  labs(x = "Departure Delays (minutes)")

Okay if their axis labels are not adjusted

If they don’t have nice axis labels for this question, we don’t need to mark them down. But we can give this feedback:

Careful! Look at the feedback for Question 2 about your axis labels.

Growing: If code does not use correct binwidths

Feedback: For Question 3(a), pay attention to what binwidth you need to use within your geom_histogram()!

3 (b)

How do these three histograms compare? Are features revealed in one that are obscured in another?

Success:

Discusses how wider bins make it hard to see where there are small differences and / or how smaller bins make it easier to see differences (i.e., smaller bins make the distribution more specific)

Growing:

If description only references how the bins are larger or smaller, but doesn’t talk about shape the distribution
Only talks about how the y-axis scale changes

Feedback: For Question 3(b), your discussion should address how the binwidth affects the shape of the distribution (e.g., skew, peak) and what you are able to see.

4

Fill in the code to create a new dataframe named sfo_flights that is the result of filter()ing only the observations whose destination was San Francisco.

Success: Code should look like:

sfo_flights <- filter(nycflights, 
                      dest == "SFO")

Growing: If they use the wrong destination in their filter()

Feedback: For Question 4, you need to look at the description for how the airport names were coded!

5

Fill in the code below to find the number of flights flying into SFO in July that arrived early. What does the result tell you?

Success: Code should look like:

filter(sfo_flights, 
       month == 7, 
       arr_delay < 0) %>% 
  dim()

Growing:

If they put 7 in quotations ("7")

Feedback: We use quotations for variables that are categorical. Does R think that month is a categorical variable? (look back at what you said in Question 1b)

If they use a > instead of a <

Feedback: Careful! If my flight arrived 5 minutes early, how would my arrival delay be recorded in the dataset? How does this translate to the inequality we want to use in our data filter?

6

When you ran the code above it output a preview of the filtered dataset. What does this output tell you about the number of flights that met your criteria (flying into SFO, in July, and arrived early)?

They should state that 45 flights (rows) met the criteria (arrived to SFO early in July).

Growing:

If they don’t interpret 45 as the number of flights – only as the number of rows

Feedback: What are the observations included in the rows in the context of this dataset? Airplanes? Airports? Flights?

If they interpret 45 and 16 as flight numbers

Feedback: The 45 and the 16 refer to the size of the filtered dataset. What does that tell you in terms of the number of flights that satisfied your criteria out of the sfo_flights dataset?

7

Calculate the following statistics for the arrival delays in the sfo_flights dataset:

mean
median
max
min

Success: Code should look like:

summarise(sfo_flights, 
          mean_ad = mean(arr_delay), 
          median_ad = median(arr_delay), 
          max_ad = max(arr_delay),
          min_ad = min(arr_delay)
          )

If they don’t name their summary statistics

Feedback: The output of your summary statistics looks a lot nicer if you give them names! (like the example that was given)

I’d write a comment if they used _dd in their names, since it should be _ad for the arrival delays.

Feedback: It would be more clear to use _ad at the end of your names, since you are summarizing the arrival delays (not the departure delays).

Growing:

If they miss some of the statistics

Feedback: For Question 6, you were asked to provide four (4) statistics!

If they use the wrong dataset (nycflights instead of sfo_flights)

Feedback: For Question 6, what dataset should you be using to calculate your summary statistics?

8

Using the above summary statistics, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?

Success

Makes a statement about what I should expect (flight to be early / late)
Justifies statement based on summary statistics

Growing

Does not justify their statement

Feedback: In Statistics it is critical to back your claim with data! How did you decide what I should expect if I am flying from NYC to SFO?

Says I will arrive early / late based on the median / mean

Feedback: Why is the median / mean a good measure of the “typical” delay?

Only discuss the statistics but don’t address the question

Feedback: You point out the statistics and what they mean in terms of an arrival delay, but how do these statistics connect with my question about what I should expect?

9

Now, rather than calculating summary statistics, plot the distribution of arrival delays for the sfo_flights dataset. Choose the type of plot you believe is appropriate for visualizing the distribution of arrival delays.

Success: Code should look like:

ggplot(data = sfo_flights, 
       mapping = aes(x = arr_delay)) + 
  geom_histogram() + 
  labs(x = "Arrival Delays (min)")

Acceptable geoms:
- geom_histogram()
- geom_dotplot() (though not great)
- geom_density()
It’s okay for them not to change the binwidth

Growing:

Uses geom_boxplot()

Feedback: For Question 9, we need to use a geom which allows for us to see the shape of the distribution. Boxplots hide distributions with multiple modes, so what type of plot would be better?

Doesn’t change x-axis label and / or doesn’t include the units (minutes) in their label

Feedback: You were asked to give your visualization nice axis labels. It is also important for the axis label to contain the units of the variable! What units were the arrival delays measured in?

10

Using the plot above, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?

Success

Makes a statement about what I should expect (flight to be early / late)
Justifies statement based on the plot

Growing

Does not justify their statement

Feedback: In Statistics it is critical to back your claim with data! How did you use this distribution to decide what I should expect if I am flying from NYC to SFO?

11

What information did the plot give you that was missing from the summary statistics?

Success:

Discusses what could be see in the visualizations that could not be seen in the statistics
- skew
- mode / peak

Growing:

Response does not discuss what could be seen in the visualizations that could not be seen in the statistics

Feedback: You should address what you were able to SEE in the visualization (e.g., shape, skew, number of peaks) that you could not “see” in the summary statistics.

Response doesn’t discuss if / how the answer to the question changed

Feedback: Great job discussing how the shape of the distribution informs you choice of statistic! You did not, however, discuss if / how your answer to my question changed between looking at the plot versus looking at the statistics.