= pd.read_csv("https://datasci112.stanford.edu/data/titanic.csv")
df
"pclass"] = df["pclass"].astype("category")
df["survived"] = df["survived"].astype("category") df[
Week 1 taught me that I need to make some adjustments!
Lab Attendance
Is not required. However, if you do not attend lab and come to student hours or post questions on Discord about the lab, I will be displeased.
Deadlines
Lab Submissions
If I choose a passenger at random, what is the probability they rode in 1st class?
If I choose a passenger at random, what is the probability they are a woman who rode in first class?
If I choose a woman at random, what is the probability they rode in first class?
plotnine
We have analyzed a quantitative variable already. Where?
In the Colombia COVID data!
Departamento Edad ... Fecha de diagnóstico Fecha recuperado
0 Bogotá D.C. 19 ... 2020-03-06 2020-03-13
1 Valle del Cauca 34 ... 2020-03-09 2020-03-19
2 Antioquia 50 ... 2020-03-09 2020-03-15
3 Antioquia 55 ... 2020-03-11 2020-03-26
4 Antioquia 25 ... 2020-03-11 2020-03-23
... ... ... ... ... ...
25361 Buenaventura D.E. 48 ... 2020-05-28 NaN
25362 Valle del Cauca 55 ... 2020-05-28 NaN
25363 Buenaventura D.E. 39 ... 2020-05-28 NaN
25364 Valle del Cauca 13 ... 2020-05-28 NaN
25365 Córdoba 0 ... 2020-05-28 NaN
[25366 rows x 10 columns]
To visualize the age variable, we did the following:
Then, we could treat age
as categorical and make a barplot:
A histogram uses equal sized bins to summarize a quantitative variable.
To tweak your histogram, you can change the binwith:
Recall the distribution of a categorical variable:
The distribution of a quantitative variable is similar:
In this example, we have a limited set of possible values for age
: 0, 1, 2, …., 100.
What if had a quantitative variable with infinite values?
About what percent of people in this dataset are below 18?
How would you code it?
One summary of the center of a quantitative variable is the mean.
When you hear “The average age is…” or the “The average income is…”, this probably refers to the mean.
Suppose we have five people, ages: 4, 84, 12, 27, 7
The mean age is: \[(4 + 84 + 12 + 27 + 7) / 5 = 134 / 5 = 26.8\]
To refer to our data without having to list all the numbers, we use \(x_1, x_2, ..., x_n\)
In the previous example, \(x_1 = 4, x_2 = 84, x_3 = 12, x_4 = 27, x_5 = 7\). So, \(n = 5\).
To add up all the numbers, we use the summation notation: \[ \sum_{i = 1}^5 x_i = 134\]
Therefore, the mean is: \[\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i\]
Long version: find the sum and the number of observations
The mean is only one option for summarizing the center of a quantitative variable. It isn’t perfect!
Let’s investigate this.
Open the Activity 2.1 Collab notebook
Read in the Titanic data
Plot the density of ticket prices on titanic
Calculate the mean price
See how many people paid more than mean price
Our fare
data was skewed right: Most values were small, but a few values were very large.
These large values “pull” the mean up; just how the value 84
pulled the average age up in our previous example.
So, why do we like the mean?
cs = range(1, 60)
sum_squared_distances = []
for c in cs:
(
sum_squared_distances
.append(
(
(df_CO["Edad"] - c) ** 2
)
.sum()
)
res_df = pd.DataFrame({"center": cs, "sq_error": sum_squared_distances})
(
ggplot(res_df, aes(x = 'center', y = 'sq_error')) +
geom_line() +
labs(x = "Mean",
y = "",
title = "Changes in Sum of Squared Error Based on Choice of Center")
)
Another summary of center is the median, which is the “middle” of the sorted values.
To calculate the median of a quantitative variable with values \(x_1, x_2, x_3, ..., x_n\), we do the following steps:
Sort the values from smallest to largest: \[x_{(1)}, x_{(2)}, x_{(3)}, ..., x_{(n)}.\]
The “middle” value depends on whether we have an odd or an even number of observations.
If \(n\) is odd, then the middle value is \(x_{(\frac{n + 1}{2})}\).
If \(n\) is even, then there are two middle values, \(x_{(\frac{n}{2})}\) and \(x_{(\frac{n}{2} + 1)}\).
Note
It is conventional to report the mean of the two values (but you can actually pick any value between them)!
Ages: 4, 84, 12, 7, 27. What is the median?
Median age in the Columbia data:
One measure of spread is the variance.
The variance of a variable whose values are \(x_1, x_2, x_3, ..., x_n\) is calculated using the formula \[\textrm{var(X)} = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}\]
Does this look familiar?
It’s the sum of squared error! Well, divided by \(n-1\), the “degrees of freedom”.
Similar to calculating the mean, we could find the variance manually:
np.float64(348.0870469898451)
Notice that the variance isn’t very intuitive. What do we mean by “The spread is 348”?
This is because it is the squared error!
Visualize quantitative variables with histograms or densities.
Summarize the center of a quantitative variable with mean or median.
Describe the shape of a quantitative variable with skew.
Summarize the spread of a quantitative variable with the variance or the standard deviation.