import pandas as pd
df = pd.read_csv("data/titanic.csv")| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NA | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NA | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NA | S |
| 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NA | Q |
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
...
886 Montvila, Rev. Juozas
887 Graham, Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object
PassengerId Age SibSp Parch Fare
count 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 29.699118 0.523008 0.381594 32.204208
std 257.353842 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 38.000000 1.000000 0.000000 31.000000
max 891.000000 80.000000 8.000000 6.000000 512.329200
The list of percents for each category is called the distribution of the variable.
The Grammar of Graphics (GoG) is a framework for creating data visualizations.
A visualization consists of:
The aesthetic: Which variables are dictating which plot elements.
The geometry: What shape of plot you are making.
The theme: Other choices about the appearance.



plotnine CodeTake 90-seconds
Draw what plot you think this code would produce.
The Grammar of Graphics framework map variables from the data to aesthetics in the plot.
What aesthetics are variables mapped onto in the plot?
The GoG also uses different geometries to represent the data.
What shape(s) are used to represent the data / observations in the plot?
The plotnine library implements the grammar of graphics in Python.
aes() function is the place to map variables to plot aesthetics.
x, y, and fill are three possible aesthetics that can be specifiedgeom_XXX() functions allow for different plotting shapes (e.g. boxplot, histogram, etc.)
geom you choose!What are the aesthetics and geometry in the cartoon plot below?
An XKCD comic
To visualize the distribution of a categorical variable, we should use a bar plot.
Pclass proportion
0 3 0.551066
1 1 0.242424
2 2 0.206510
Why reset the index? What does that do?

Tip
Technically, you could still use geom_bar(), but you would need to specify that you didn’t want it to use stat = "count" (the default). You’ve already calculated the proportions, so you would use geom_bar(stat = "identity").
What are some pros and cons of the stacked bar plot?
Pros
male counts in each class, since those bars are on the bottom.Cons
female counts, since those bars are stacked on top.What are some pros and cons of the side-by-side bar plot?
Pros
We can easily compare the female counts in each class.
We can easily compare the male counts in each class.
We can easily see counts of each within each class.
Cons
It is hard to see total counts in each class.
It is hard to estimate the distributions.
What are some pros and cons of the stacked percentage bar plot?
Pros
This is the best way to compare sex balance across classes!
This is the option I use the most, because it can answer “Are you more likely to find ______ in ______ ?” type questions.
Cons
Choose one of the plots from lecture so far and “upgrade” it.
You can do this by:
Finding and using a different theme
Trying different variables
Trying a different geometries
Using + scale_fill_manual() to change the colors being used
Tip
You will need to use documentation of plotnine and online resources!
Check out https://www.data-to-viz.com/ for ideas and example code.
Ask GenAI questions like, “What do I add to a plotnine bar plot to change the colors?” (But of course, make sure you understand the code you use!)
Pclass Sex
3 male 347
female 144
1 male 122
2 male 108
1 female 94
2 female 76
Name: count, dtype: int64
But this is a little hard to read…
Sex female male
Pclass
1 94 122
2 76 108
3 144 347
Pivot Table
Essentially unstack() has pivoted the sex column from long format (where the values are included in one column) to wide format where each value has its own column.
Sex female male
Pclass
1 0.105499 0.136925
2 0.085297 0.121212
3 0.161616 0.389450
All of these values should sum to 1, aka, 100%!
What cross-tabulation would you expect if we changed the order of the variables? In other words, what would happen if "Sex" came first and "Pclass" came second?
We call this the joint distribution of the two variables.
Sex female male
Pclass
1 0.105499 0.136925
2 0.085297 0.121212
3 0.161616 0.389450
Of all the passengers on the Titanic, 11% were female passengers riding in first class.
We know that:
466 passengers identified as female
Of those 466 passengers, 144 rode in first class
So:
Here we conditioned on the passenger being female, and then looked at the conditional distribution of Pclass.
We know that:
35.5% of all passengers identified as female
Of those 35.5% of passengers, 11% rode in first class
So:
We know that:
323 passengers rode in first class
Of those 323 passengers, 144 identified as female
So:
Here we conditioned on the passenger being in first class, and then looked at the conditional distribution of Sex.
This depends on the research question you are trying to answer.
“What class did most female identifying passengers ride in?”
-> Of all female passengers, what is the conditional distribution of class?
“What was the gender breakdown of first class?”
-> Of all first class passengers, what is the conditional distribution of sex?
When we study two variables, we call the individual one-variable distributions the marginal distribution of that variable.
We need to divide the joint distribution (e.g. “11% of passengers were first class female”) by the marginal distribution of the variable we want to condition on (e.g. 35.5% of passengers were female).
Should the rows or columns add up to 100%? Why?
Sex female male
Pclass
1 0.299363 0.211438
2 0.242038 0.187175
3 0.458599 0.601386
“Did women tend to ride in first class more than men?”

We use plotnine and the grammar of graphics to make visuals.
For two categorical variables, we might use a stacked bar plot, a side-by-side bar plot, or a stacked percentage bar plot - depending on what we are trying to show.
The joint distribution of two variables gives the percents in each subcategory.
The marginal distribution of a variable is its individual distribution.
The conditional distribution of a variable is its distribution among only one category of a different variable.
We calculate the conditional distribution by dividing the joint by the marginal.