Visualizing and Comparing Categorical Variables

The story so far…

Getting and Prepping Data

import pandas as pd

df = pd.read_csv("data/titanic.csv")


df["Pclass"] = df["Pclass"].astype("category")
df["Survived"] = df["Survived"].astype("category")

Thinking About Variable Types

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NA S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NA S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NA S
6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NA Q

Accessing Rows and Columns

df.iloc[5,]
PassengerId                   6
Survived                      0
Pclass                        3
Name           Moran, Mr. James
Sex                        male
Age                         NaN
SibSp                         0
Parch                         0
Ticket                   330877
Fare                     8.4583
Cabin                       NaN
Embarked                      Q
Name: 5, dtype: object
df["Name"]
0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

Quick Summary of Quantitative Variables

df.describe()
       PassengerId         Age       SibSp       Parch        Fare
count   891.000000  714.000000  891.000000  891.000000  891.000000
mean    446.000000   29.699118    0.523008    0.381594   32.204208
std     257.353842   14.526497    1.102743    0.806057   49.693429
min       1.000000    0.420000    0.000000    0.000000    0.000000
25%     223.500000   20.125000    0.000000    0.000000    7.910400
50%     446.000000   28.000000    0.000000    0.000000   14.454200
75%     668.500000   38.000000    1.000000    0.000000   31.000000
max     891.000000   80.000000    8.000000    6.000000  512.329200

Summarizing Categorical Variables

The list of percents for each category is called the distribution of the variable.

df["Pclass"].value_counts()
Pclass
3    491
1    216
2    184
Name: count, dtype: int64
df["Pclass"].value_counts(normalize = True)
Pclass
3    0.551066
1    0.242424
2    0.206510
Name: proportion, dtype: float64

Visualizing One Categorical Variable

The Grammar of Graphics

The Grammar of Graphics (GoG) is a framework for creating data visualizations.

A visualization consists of:

  • The aesthetic: Which variables are dictating which plot elements.

  • The geometry: What shape of plot you are making.

  • The theme: Other choices about the appearance.

A picture demonstrating the central idea of the grammar of graphics, that a visualization is comprised of layers. Each layer is displayed as a 3-D square in a different color with text written next to it. The first layer is the data, followed by the aesthetics, then the geometries, then facets, then statistics, then coordinates, then finally a theme.

Penguins!

from palmerpenguins import load_penguins

penguins = load_penguins()

A picture of an Adelie penguin. The penguin looks like a standard penguin, but has a very short bill.

A picture of a Chinstrap penguin. You can tell the penguin is a Chinstrap penguin because it has a small black line underneath its jawline that looks like a strap.

Example plotnine Code

from plotnine import ggplot, geom_point, aes, geom_boxplot, labs

(
  ggplot(penguins, mapping = aes(x = "species", 
                                 y = "bill_length_mm", 
                                 fill = "sex")
        ) +
  geom_boxplot() + 
  labs(x = "Species", 
       y = "Bill Length (mm)", 
       fill = "Penguin Sex")
)

Take 90-seconds

Draw what plot you think this code would produce.

Revealed!

Aesthetics & Geometries

The Grammar of Graphics framework map variables from the data to aesthetics in the plot.


What aesthetics are variables mapped onto in the plot?

The GoG also uses different geometries to represent the data.


What shape(s) are used to represent the data / observations in the plot?

plotnine

The plotnine library implements the grammar of graphics in Python.

  • The aes() function is the place to map variables to plot aesthetics.
    • x, y, and fill are three possible aesthetics that can be specified
  • A variety of geom_XXX() functions allow for different plotting shapes (e.g. boxplot, histogram, etc.)
    • Aesthetics can differ based on the geom you choose!

Themes

Code
from plotnine import theme_bw

(
  ggplot(penguins, aes(x = "species", 
                       y = "bill_length_mm", 
                       fill = "sex")
                       ) + 
  geom_boxplot() + 
   labs(x = "Species", 
        y = "Bill Length (mm)", 
        fill = "Penguin Sex") +
  theme_bw()
)

Check-In

What are the aesthetics and geometry in the cartoon plot below?

A graph showing the 'urge to try running up the down escalator' (y-axis) against age (x-axis). The y-axis ranges from weak to strong, and the x-axis spans ages 0 to 24. Two lines are plotted: 'What I was supposed to feel,' which peaks at age 10 and declines steeply thereafter; and 'What I've actually felt,' which remains high and relatively flat after age 10. Stick figures are drawn on the graph to illustrate the difference, with labels pointing to key points.

An XKCD comic

Bar Plots

To visualize the distribution of a categorical variable, we should use a bar plot.

Code
from plotnine import *

(
  ggplot(data = df, mapping = aes(x = "Pclass")) + 
  geom_bar() + 
  labs(x = "Class of Passenger on Titanic") +
  theme_bw()
)

Calculating Percents

pclass_dist = (
  df['Pclass']
  .value_counts(normalize = True)
  .reset_index()
  )
  
pclass_dist
  Pclass  proportion
0      3    0.551066
1      1    0.242424
2      2    0.206510

Why reset the index? What does that do?

Percents on Plots

Code
(
  ggplot(data = pclass_dist, 
         mapping = aes(x = "Pclass", y = "proportion")) + 
  geom_col() + ### notice this change to a column plot!
  labs(x = "Class of Passenger on Titanic") +
  theme_bw()
)

Tip

Technically, you could still use geom_bar(), but you would need to specify that you didn’t want it to use stat = "count" (the default). You’ve already calculated the proportions, so you would use geom_bar(stat = "identity").

Visualizing Two Categorical Variables

Option 1: Stacked Bar Plot

Code
(
  ggplot(data = df, mapping = aes(x = "Pclass", fill = "Sex")) + 
  geom_bar(position = "stack") + 
  labs(x = "Class of Passenger on Titanic", 
       fill = "Sex of Passenger") +
  theme_bw()
)

Option 1: Stacked Bar Plot

What are some pros and cons of the stacked bar plot?

Pros

  • We can still see the total counts in each class
  • We can easily compare the male counts in each class, since those bars are on the bottom.

Cons

  • It is hard to compare the female counts, since those bars are stacked on top.
  • It is hard to estimate the distributions.

Option 2: Side-by-Side Bar Plot

Code
(
  ggplot(data = df, mapping = aes(x = "Pclass", fill = "Sex")) + 
  geom_bar(position = "dodge") + 
  labs(x = "Class of Passenger on Titanic", 
       fill = "Sex of Passenger") +
  theme_bw()
)

Option 2: Side-by-side Bar Plot

What are some pros and cons of the side-by-side bar plot?

Pros

  • We can easily compare the female counts in each class.

  • We can easily compare the male counts in each class.

  • We can easily see counts of each within each class.

Cons

  • It is hard to see total counts in each class.

  • It is hard to estimate the distributions.

Option 3: Stacked Percentage Bar Plot

Code
(
  ggplot(data = df, mapping = aes(x = "Pclass", fill = "Sex")) + 
  geom_bar(position = "fill") +
  labs(x = "Class of Passenger on Titanic", 
       fill = "Sex of Passenger") +
  theme_bw()
)

Option 3: Stacked Percentage Bar Plot

What are some pros and cons of the stacked percentage bar plot?

Pros

  • This is the best way to compare sex balance across classes!

  • This is the option I use the most, because it can answer “Are you more likely to find ______ in ______ ?” type questions.

Cons

  • We can no longer see any counts!

Activity 1.2

Choose one of the plots from lecture so far and “upgrade” it.

You can do this by:

  • Finding and using a different theme

  • Trying different variables

  • Trying a different geometries

  • Using + scale_fill_manual() to change the colors being used

Tip

  • You will need to use documentation of plotnine and online resources!

  • Check out https://www.data-to-viz.com/ for ideas and example code.

  • Ask GenAI questions like, “What do I add to a plotnine bar plot to change the colors?” (But of course, make sure you understand the code you use!)

Joint distributions

Two Categorical Variables

df[["Pclass", "Sex"]].value_counts()
Pclass  Sex   
3       male      347
        female    144
1       male      122
2       male      108
1       female     94
2       female     76
Name: count, dtype: int64



But this is a little hard to read…

Two-way Table

(
  df[["Pclass", "Sex"]]
  .value_counts()
  .unstack()
  )
Sex     female  male
Pclass              
1           94   122
2           76   108
3          144   347


Pivot Table

Essentially unstack() has pivoted the sex column from long format (where the values are included in one column) to wide format where each value has its own column.

Two-way Table - Percents

(
  df[["Pclass", "Sex"]]
  .value_counts(normalize = True)
  .unstack()
  )
Sex       female      male
Pclass                    
1       0.105499  0.136925
2       0.085297  0.121212
3       0.161616  0.389450


All of these values should sum to 1, aka, 100%!

Switching Variable Order

What cross-tabulation would you expect if we changed the order of the variables? In other words, what would happen if "Sex" came first and "Pclass" came second?

(
  df[["Sex", "Pclass"]]
  .value_counts(normalize = True)
  .unstack()
  )
Pclass         1         2         3
Sex                                 
female  0.105499  0.085297  0.161616
male    0.136925  0.121212  0.389450

Interpretation

We call this the joint distribution of the two variables.

Sex       female      male
Pclass                    
1       0.105499  0.136925
2       0.085297  0.121212
3       0.161616  0.389450

Of all the passengers on the Titanic, 11% were female passengers riding in first class.

  • NOT “11% of all females on Titanic…”
  • NOT “11% of all first class passengers…”

Conditional Distribution from Counts

We know that:

  • 466 passengers identified as female

  • Of those 466 passengers, 144 rode in first class

So:

  • 144 / 466 = 31% of female identifying passengers rode in first class

Here we conditioned on the passenger being female, and then looked at the conditional distribution of Pclass.

Conditional Distribution from Percentages

We know that:

  • 35.5% of all passengers identified as female

  • Of those 35.5% of passengers, 11% rode in first class

So:

  • 0.11 / 0.355 = 31% of female identifying passengers rode in first class

Swapping Variables

We know that:

  • 323 passengers rode in first class

  • Of those 323 passengers, 144 identified as female

So:

  • 144 / 323 = 44.6% of first class passengers identified as female

Here we conditioned on the passenger being in first class, and then looked at the conditional distribution of Sex.

Which one to condition on?

This depends on the research question you are trying to answer.

“What class did most female identifying passengers ride in?”

-> Of all female passengers, what is the conditional distribution of class?

“What was the gender breakdown of first class?”

-> Of all first class passengers, what is the conditional distribution of sex?

Calculating in Python

When we study two variables, we call the individual one-variable distributions the marginal distribution of that variable.

marginal_class = (
  df['Pclass']
  .value_counts(normalize = True)
  )


marginal_class
Pclass
3    0.551066
1    0.242424
2    0.206510
Name: proportion, dtype: float64
marginal_sex = (
  df['Sex']
  .value_counts(normalize = True)
  )


marginal_sex
Sex
male      0.647587
female    0.352413
Name: proportion, dtype: float64

Calculating in Python

We need to divide the joint distribution (e.g. “11% of passengers were first class female”) by the marginal distribution of the variable we want to condition on (e.g. 35.5% of passengers were female).

joint_class_sex = (
  df[["Pclass", "Sex"]]
  .value_counts(normalize = True)
  .unstack()
  )
  
joint_class_sex.divide(marginal_sex)
Sex       female      male
Pclass                    
1       0.299363  0.211438
2       0.242038  0.187175
3       0.458599  0.601386

Check-In

marginal_sex
Sex
male      0.647587
female    0.352413
Name: proportion, dtype: float64
joint_class_sex
Sex       female      male
Pclass                    
1       0.105499  0.136925
2       0.085297  0.121212
3       0.161616  0.389450


joint_class_sex.divide(marginal_sex)
Sex       female      male
Pclass                    
1       0.299363  0.211438
2       0.242038  0.187175
3       0.458599  0.601386

How do you think divide() works?

Check-In

Should the rows or columns add up to 100%? Why?

Sex       female      male
Pclass                    
1       0.299363  0.211438
2       0.242038  0.187175
3       0.458599  0.601386

Conditional on Class

joint_class_sex = (
  df[["Sex", "Pclass"]]
  .value_counts(normalize = True)
  .unstack()
  )
  
joint_class_sex.divide(marginal_class)
Pclass         1         2         3
Sex                                 
female  0.435185  0.413043  0.293279
male    0.564815  0.586957  0.706721

What if you get it backwards?

joint_class_sex = (
  df[["Pclass", "Sex"]]
  .value_counts(normalize = True)
  .unstack()
  )
  
joint_class_sex.divide(marginal_class)
         1   2   3  female  male
Pclass                          
1      NaN NaN NaN     NaN   NaN
2      NaN NaN NaN     NaN   NaN
3      NaN NaN NaN     NaN   NaN

Which plot better answers:

“Did women tend to ride in first class more than men?”

Code
(
  ggplot(df, aes(x = "Pclass", fill = "Sex")) + 
  geom_bar(position = "fill") + 
  labs(x = "Class of Passenger on Titanic", 
       fill = "Sex of Passenger") +
  theme_bw()
)

Code
(
  ggplot(df, aes(x = "Sex", fill = "Pclass)) + 
  geom_bar(position = "fill") + 
  labs(fill = "Class of Passenger on Titanic", 
       x = "Sex of Passenger") +
  theme_bw()
)

Takeaways

Takeaways

  • We use plotnine and the grammar of graphics to make visuals.

  • For two categorical variables, we might use a stacked bar plot, a side-by-side bar plot, or a stacked percentage bar plot - depending on what we are trying to show.

  • The joint distribution of two variables gives the percents in each subcategory.

  • The marginal distribution of a variable is its individual distribution.

  • The conditional distribution of a variable is its distribution among only one category of a different variable.

  • We calculate the conditional distribution by dividing the joint by the marginal.