Multivariate Summaries

Lab 1A Debrief

The Importance of Axis Labels

A plot of the proportion of the times the first digit of the volume of stocks sold by a stock market company started with each digit (1-9). The plot has an x-axis label describing the context of the digits ('First Digit of Volume') and the y-axis label describes the context of the proportions ('Proportion of Occurrances'). The plot title states 'Distribution of First Digit of Volume'.

A plot of the proportion of the times the first digit of the volume of stocks sold by a stock market company started with each digit (1-9). This is the same plot as above, but the axis labels do not have the contextual information included in the previous plot. The x-axis says 'digit' which leaves the reader wondering what the context of this 'digit' is. The y-axis says 'proportion' which also makes the reader wonder what these are a proportion of.

Does your plot communicate the context of the data you are plotting?

The story so far…

Last Week

  • Reading in data and cleaning / prepping it.

  • Summarizing one categorical variable with a distribution.

  • Summarizing two categorical variables with joint and conditional distributions.

  • Using plotnine and the grammar of graphics to make bar plots and column plots.

Quantitative Variables So Far

  • Visualizing by converting to categorical.

  • Visualizing with histograms or densities.

  • Estimating probabilities from histograms and densities.

  • Describing the skew.

  • Calculating and explaining the mean and the median.

  • Calculating and explaining the standard deviation and variance.

Comparing Quantities Across Categories

New dataset: Airplane Flights

Which airline carriers are most likely to be delayed?

Let’s look at a data set of all domestic flights that departed from one of New York City’s airports (JFK, LaGuardia, and Newark) on November 16, 2013.

df = pd.read_csv("data/flights.csv")
df
        year  month  day  ...  hour  minute             time_hour
0       2013      1    1  ...     5      15  2013-01-01T10:00:00Z
1       2013      1    1  ...     5      29  2013-01-01T10:00:00Z
2       2013      1    1  ...     5      40  2013-01-01T10:00:00Z
3       2013      1    1  ...     5      45  2013-01-01T10:00:00Z
4       2013      1    1  ...     6       0  2013-01-01T11:00:00Z
...      ...    ...  ...  ...   ...     ...                   ...
336771  2013      9   30  ...    14      55  2013-09-30T18:00:00Z
336772  2013      9   30  ...    22       0  2013-10-01T02:00:00Z
336773  2013      9   30  ...    12      10  2013-09-30T16:00:00Z
336774  2013      9   30  ...    11      59  2013-09-30T15:00:00Z
336775  2013      9   30  ...     8      40  2013-09-30T12:00:00Z

[336776 rows x 19 columns]

Delays

We already know how to summarize the flight delays:

Check-in 2.2: Interpret these numbers!

df['dep_delay'].median()
np.float64(-2.0)
df['dep_delay'].mean()
np.float64(12.639070257304708)
df['dep_delay'].std()
np.float64(40.21006089212995)

Delays

We already know how to visualize the flight delays:

Check-in 2.2: How would you describe this distribution?

Delays by Origin

Do the three origin airports (JFK, LGA, EWR) have different delay patterns?

Check-in 2.2: What could you change in this code to include the origin variable?

(
  ggplot(df, aes(x = 'dep_delay')) + 
  geom_histogram() + 
  theme_bw()
)

Delays by Origin

Overlapping histograms can be really hard to read…

Code
(
  ggplot(df, aes(x = 'dep_delay', fill = 'origin')) + 
  geom_histogram() + 
  theme_bw()
)

Delays by Origin

… but overlapping densities often look nicer…

Code
(
  ggplot(df, aes(x = 'dep_delay', fill = 'origin')) + 
  geom_density() + 
  theme_bw()
)

Delays by Origin

… especially if we make them a little see-through!

Code
(
  ggplot(df, aes(x = 'dep_delay', fill = 'origin')) + 
  geom_density(alpha = 0.5) + 
  theme_bw()
)

Variable Transformations

  • That last plot was okay, but it was hard to see the details, because the distribution is so skewed right.

  • Sometimes, for easier visualization, it is worth transforming a variable.

  • For skewed data, we often use a log transformation.

Log Transformation

Example: Salaries of $10,000, and $100,000, and $10,000,000:

dat = pd.DataFrame({"salary": [10000, 100000, 10000000]})
dat["log_salary"] = np.log(dat["salary"])
Code
(
  ggplot(data = dat, mapping = aes(x = "salary")) + 
  geom_histogram(bins = 100) + 
  theme_bw()
)

Code
(
  ggplot(data = dat, mapping = aes(x = "log_salary")) + 
  geom_histogram(bins = 100) + 
  theme_bw()
)

Log transformations

  • Usually, we use the natural log, just for convenience.

Pros:

Skewed data looks less skewed, so it is easier to see patterns.

Cons:

The variable is now on a different scale so it is not as interpretable.

Remember, log transformations need positive numbers!

Delays by Origin - Transformed

Code
# Shift delays to be above zero
df['delay_shifted'] = df['dep_delay'] - df['dep_delay'].min() + 1

# Log transform
df['log_delay'] = np.log(df['delay_shifted'])

(
  ggplot(df, aes(x = 'log_delay', fill = 'origin')) + 
  geom_density(alpha = 0.5) + 
  theme_bw()
)

Boxplots

Another option: Boxplots

Code
(
  ggplot(df, mapping = aes(y = 'log_delay', x = 'origin')) + 
  geom_boxplot() + 
  labs(x = "", 
       y = "Log Delay (minutes)", 
       title = "Comparing Departure Delays for NYC Airports") +
  theme_bw()
)

Code
(
  ggplot(df, mapping = aes(y = 'log_delay', x = 'origin')) + 
  geom_boxplot() +
  labs(x = "", 
       y = "Log Delay (minutes)", 
       title = "Comparing Departure Delays for NYC Airports") +
  coord_flip() +
  theme_bw()
)

Facetting

Facetting

This plot still was a little hard to read.

What if we just made separate plots for each origin?

Separate Plots for Each Origin

One option would be to create separate data frames for each origin.

is_jfk = (df['origin'] == "JFK")
df_jfk = df[is_jfk]
df_jfk
        year  month  day  ...             time_hour  delay_shifted  log_delay
2       2013      1    1  ...  2013-01-01T10:00:00Z           46.0   3.828641
3       2013      1    1  ...  2013-01-01T10:00:00Z           43.0   3.761200
8       2013      1    1  ...  2013-01-01T11:00:00Z           41.0   3.713572
10      2013      1    1  ...  2013-01-01T11:00:00Z           42.0   3.737670
11      2013      1    1  ...  2013-01-01T11:00:00Z           42.0   3.737670
...      ...    ...  ...  ...                   ...            ...        ...
336766  2013      9   30  ...  2013-10-01T02:00:00Z           34.0   3.526361
336767  2013      9   30  ...  2013-10-01T02:00:00Z           39.0   3.663562
336768  2013      9   30  ...  2013-10-01T02:00:00Z           56.0   4.025352
336769  2013      9   30  ...  2013-10-01T03:00:00Z           34.0   3.526361
336771  2013      9   30  ...  2013-09-30T18:00:00Z            NaN        NaN

[111279 rows x 21 columns]

This seems kind of annoying…

FYI: Boolean Masking

How did we filter the previous df to only include "JFK" origins?

Step 1

is_jfk = (df['origin'] == "JFK")
is_jfk
0         False
1         False
2          True
3          True
4         False
          ...  
336771     True
336772    False
336773    False
336774    False
336775    False
Name: origin, Length: 336776, dtype: bool

Step 2

df_jfk = df[is_jfk]
df_jfk["origin"]
2         JFK
3         JFK
8         JFK
10        JFK
11        JFK
         ... 
336766    JFK
336767    JFK
336768    JFK
336769    JFK
336771    JFK
Name: origin, Length: 111279, dtype: object

Facetting

Fortunately, plotnine (and other plotting packages) has a trick for you!

(
  ggplot(df, aes(x = 'dep_delay')) + 
  geom_density() + 
  facet_wrap('origin')
)

Freeing the Scales

Code
(
  ggplot(df, aes(x = 'dep_delay')) + 
  geom_density() + 
  facet_wrap('origin', scales = "free_y") +
  labs(x = "Departure Delay (minutes)")
)

Summaries by Group

Split-apply-combine

  • Our visualizations told us some of the story, but can we use numeric summaries as well?

  • To do this, we want to calculate the mean or median delay time for each origin airport.

  • We call this split-apply-combine:

    • split the dataset up by a categorical variable origin
    • apply a calculation like mean
    • combine the results back into one dataset
  • In pandas, we use the groupby() function to take care of the split and combine steps!

Group-by

(
  df
  .groupby(by = "origin")["dep_delay"]
  .mean()
)
origin
EWR    15.107954
JFK    12.112159
LGA    10.346876
Name: dep_delay, dtype: float64
(
  df
  .groupby(by = "origin")["dep_delay"]
  .median()
)
origin
EWR   -1.0
JFK   -1.0
LGA   -3.0
Name: dep_delay, dtype: float64

Group-by Check-in

Check-in 2.2

  • Which code is causing “split by origin”?

  • Which code is causing “calculate the mean of delays”?

  • Which code is causing the re-combining of the data?

(
  df
  .groupby(by = "origin")["dep_delay"]
  .mean()
)

Standardized Values

Simple Example: Exam Scores

Hermione’s exam scores are is:

  • Potions class: 77/100

  • Charms class: 95/100

  • Herbology class: 90/100

In which class did she do best?

But wait!

The class means are:

  • Potions class: 75/100

  • Charms class: 85/100

  • Herbology class: 85/100

In which class did she do best?

But wait!

The class standard deviations are:

  • Potions class: 2 points

  • Charms class: 5 points

  • Herbology class: 1 point

In which class did she do best?

Different variabilities by origin

In addition to having different centers, the three origins also have different spreads.

(
  df
  .groupby("origin")["dep_delay"]
  .std()
)
origin
EWR    41.323704
JFK    39.035071
LGA    39.993021
Name: dep_delay, dtype: float64


In general flights from "LGA" have departure delays that are the furthest from the mean.

Standardized values

  • We standardize values by subtracting the mean and dividing by the standard deviation.

  • This tells us how much better/worse than typical values our target value is.

  • This is also called the z-score. \[z_i = \frac{x_i - \bar{x}}{s_x}\]

Standardized values

Suppose you fly from LGA and your flight is 40 minutes late. Your friend flies from JFK and their flight is 30 minutes late.

Who got “unluckier”?


You?

(40 - -0.48) / 26.12
1.5497702909647777

Your friend?

(30 - 1.46) / 18.71
1.5253874933190805

Activity 2.2

Do the different airlines have different patterns of flight delays?

  • Make a plot to answer the question.

  • Calculate values to answer the question.

  • The first row is a flight from EWR to CLT on US Airways. The second row is a flight from LGA to IAH on United Airlines. Which one was a “more extreme” delay?

Relationships Between Quantitative Variables

Did older passengers pay a higher fare on the Titanic?

To visualize two quantitative variables, we make a scatterplot (or point geometry).

Code
(
  ggplot(data = df_titanic, mapping = aes(x = 'age', y = 'fare')) + 
  geom_point() +
  labs(x = "Age of Passenger", 
       y = "Fare Paid on Titanic")
)

Scatterplots

Notice

  • The explanatory variable was on the x-axis.

  • The response variable was on the y-axis.

  • “If you are older, you pay more” not “If you pay more, you get older”.

(
  ggplot(data = df_titanic, 
         mapping = aes(x = 'age', y = 'fare')
         ) + 
  geom_point() +
  labs(x = "Age of Passenger", 
       y = "Fare Paid on Titanic")
)

Making a Clearer Plot

Did you notice how difficult it was to pick out each point?

Point Size

Code
(
  ggplot(data = df_titanic, 
         mapping = aes(x = 'age', y = 'fare')
         ) + 
  geom_jitter(size = 0.5) +
  labs(x = "Age of Passenger", 
       y = "Fare Paid on Titanic")
)

Transparency

Code
(
  ggplot(data = df_titanic, 
         mapping = aes(x = 'age', y = 'fare')
         ) + 
  geom_point(alpha = 0.5) +
  labs(x = "Age of Passenger", 
       y = "Fare Paid on Titanic")
)

Spicing Things Up

How could we make this more interesting?

  • Use a log-transformation for fare because it is very skewed.

  • Add in a third variable, pclass. How might you do this?

Challenge

Can you re-create this plot?

Describing a Scatterplot

Let’s look at just third class:

Code
is_third= df_titanic['pclass'] == 3
df_third = df_titanic[is_third]

(
  ggplot(df_third, aes(x = 'age', y = 'fare')) + 
  geom_jitter(alpha = 0.8) + 
  theme_bw()
)

Describing the Relationship

Strength

Not very strong: the points don’t follow a clear pattern.

Direction

Slightly negative: When age was higher, fare was a little lower.

Shape

Not very linear: the points don’t form a straight line.

Correlation

What if we want a numerical summary of the relationship between variables?

  • Do “older than average” people pay “higher than average” fares?

    • In other words, when the z-score of age was high, was the z-score of fare also high?

Age & Fare Correlation

Code
mean_age = df_third['age'].mean()
mean_fare = df_third['fare'].mean()

(
  ggplot(data = df_third, mapping = aes(x = 'age', y = 'fare')) + 
  geom_jitter(alpha = 0.8) + 
  geom_vline(xintercept = mean_age, color = "red", linetype = "dashed") + 
  geom_hline(yintercept = mean_fare, color = "red", linetype = "dashed") + 
  labs(x = "Age of Passenger", 
       y = "Titanic Fare Paid")
  theme_bw()
)

Correlation

Interpret this result:

df_third[['age', 'fare']].corr()
           age      fare
age   1.000000 -0.238137
fare -0.238137  1.000000


Age and fare are slightly negatively correlated.

Can you think of an explanation for this?

Correlation is not the Relationship

The image contains a 3x7 grid of scatterplots illustrating varying correlation patterns. Each scatterplot is labeled with its respective correlation coefficient. The first row demonstrates linear correlations ranging from a perfect positive correlation (1.0) to a perfect negative correlation (-1.0) in decrements of 0.2. The scatterplots show the corresponding transition from a tight upward-sloping line (1.0) to a tight downward-sloping line (-1.0), with increasing dispersion at intermediate values. The second row depicts perfect positive and negative correlations (1.0 and -1.0) represented by straight lines, while the others in the row maintain similar relationships. The third row shows scatterplots with non-linear or no correlation, all labeled as 0. These include distinct shapes like a sine wave, diamond, parabola, hourglass, circle, and clustered points. These shapes demonstrate the absence of linear relationships despite structured patterns.

Just for fun: Guess the Correlation Game

Takeaways

Takeaways

  • Plot quantitative variables across groups with overlapping density plots, boxplots, or by facetting.

  • Summarize quantitative variables across groups by using groupby() and then calculating summary statistics.

  • Know what split-apply-combine means.

  • Plot relationships between quantitative variables with a scatterplot.

  • Describe the strength, direction, and shape of the relationship displayed in a scatterplot.

  • Summarize relationships between quantitative variables with the correlation