'Mode'] = (
df_songs['Mode']
df_songs[1: 'Major', 0: 'Minor'})
.replace({ )
What does a Mode
of 0 represent?
How did I know that 0
represented a minor key?
The data documentation!
Even if you converted TimeSignature
and Mode
into categorical / string variables, you will not get coefficient estimates for every level of these variables.
Q: What do you get?
A: Adjustments for each group relative to the “baseline” group.
Q: What group is the “baseline” group?
A: The group that comes first alphabetically (has the lowest ASCII representation).
How do you get values for every level of a categorical variable?
What is the accuracy of this model?
this model represents the errors of the model that was fit to the training data, where we know the value of the target variable
Cross-validation is not necessary here!
If you were to use this model to predict, what precision would you expect to get?
predicting involves “testing” data, where we don’t know the value of the target variable
Cross-validation is necessary here!
Each coefficient is associated with the change in the log-odds of a song containing explicit lyrics.
TimeSignature_1
: 1.532310
A 1/4 time signature is associated with an increase of 1.53 in the log-odds of a song containing explicit lyrics.
Speechiness
: 5.185788
An increase of speechiness by 1 unit (making the song almost entirely made up of spoken words) is associated with an increase of 5.18 in the log-odds of a song containing explicit lyrics.
Did you standardize?
If so, your interpretation is not a 1 unit increase! Your interpretation is a 1 standard deviation increase above the mean.
Acquire data and clean it by fixing variable types, dropping or replacing missing data, and looking for other issues.
Explore the dataset by making summaries and plots of one variable.
Establish research questions to answer with this data.
Create visualizations of two or more variables that address simple questions.
Fit predictive models to address more complex questions, and/or to prepare for prediction on future data.
Fit unsupervised models to answer open-ended questions.
Sometimes, information is spread across multiple data sets.
For example, suppose we want to know which manufacturer’s planes made the most flights in November 2013.
One data set contains information about flights in Nov. 2013…
year month day dep_time ... distance hour minute tailnum
0 2013 11 1 2108.0 ... 1167 20 56 N10156
1 2013 11 1 1154.0 ... 541 12 0 N102UW
2 2013 11 1 854.0 ... 946 8 29 N10575
3 2013 11 1 1643.0 ... 594 15 5 N10575
4 2013 11 1 603.0 ... 282 6 0 N11109
... ... ... ... ... ... ... ... ... ...
23295 2013 11 30 1337.0 ... 1076 13 40 N994DL
23296 2013 11 30 802.0 ... 1069 8 7 N995DL
23297 2013 11 30 1544.0 ... 1069 15 50 N995DL
23298 2013 11 30 850.0 ... 762 9 0 N996DL
23299 2013 11 30 1959.0 ... 762 20 0 N999DN
[23300 rows x 18 columns]
…while another contains information about planes.
tailnum year type ... seats speed engine
0 N10156 2004.0 Fixed wing multi engine ... 55 NaN Turbo-fan
1 N102UW 1998.0 Fixed wing multi engine ... 182 NaN Turbo-fan
2 N103US 1999.0 Fixed wing multi engine ... 182 NaN Turbo-fan
3 N104UW 1999.0 Fixed wing multi engine ... 182 NaN Turbo-fan
4 N10575 2002.0 Fixed wing multi engine ... 55 NaN Turbo-fan
... ... ... ... ... ... ... ...
3317 N997AT 2002.0 Fixed wing multi engine ... 100 NaN Turbo-fan
3318 N997DL 1992.0 Fixed wing multi engine ... 142 NaN Turbo-fan
3319 N998AT 2002.0 Fixed wing multi engine ... 100 NaN Turbo-fan
3320 N998DL 1992.0 Fixed wing multi engine ... 142 NaN Turbo-jet
3321 N999DN 1992.0 Fixed wing multi engine ... 142 NaN Turbo-jet
[3322 rows x 9 columns]
Which manufacturer’s planes made the most flights in November 2013?
In order to answer this question we need to join these two data sets together!
A primary key is a column (or a set of columns) that uniquely identifies observations in a data frame.
The primary key is the column(s) you would think of as the index.
A foreign key is a column (or a set of columns) that points to the primary key of another data frame.
Planes are uniquely identified by their tail number (tailnum
).
Each value of the primary key should only appear once, but it could appear many times in a foreign key.
The Pandas function .merge()
can be used to join two DataFrame
s on a key.
year_x month day dep_time ... engines seats speed engine
0 2013 11 1 2108.0 ... 2 55 NaN Turbo-fan
1 2013 11 1 1154.0 ... 2 182 NaN Turbo-fan
2 2013 11 1 854.0 ... 2 55 NaN Turbo-fan
3 2013 11 1 1643.0 ... 2 55 NaN Turbo-fan
4 2013 11 1 603.0 ... 2 55 NaN Turbo-fan
... ... ... ... ... ... ... ... ... ...
23295 2013 11 30 1337.0 ... 2 142 NaN Turbo-jet
23296 2013 11 30 802.0 ... 2 142 NaN Turbo-fan
23297 2013 11 30 1544.0 ... 2 142 NaN Turbo-fan
23298 2013 11 30 850.0 ... 2 142 NaN Turbo-fan
23299 2013 11 30 1959.0 ... 2 142 NaN Turbo-jet
[23300 rows x 26 columns]
Joining two data frames results in a wider data frame, with more columns.
By default, Pandas adds the suffixes _x
and _y
to overlapping column names. , but this can be customized.
Index(['year_x', 'month', 'day', 'dep_time', 'sched_dep_time', 'dep_delay',
'arr_time', 'sched_arr_time', 'arr_delay', 'carrier', 'flight',
'origin', 'dest', 'air_time', 'distance', 'hour', 'minute', 'tailnum',
'year_y', 'type', 'manufacturer', 'model', 'engines', 'seats', 'speed',
'engine'],
dtype='object')
But this can be customized!
Index(['year_flight', 'month', 'day', 'dep_time', 'sched_dep_time',
'dep_delay', 'arr_time', 'sched_arr_time', 'arr_delay', 'carrier',
'flight', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute',
'tailnum', 'year_plane', 'type', 'manufacturer', 'model', 'engines',
'seats', 'speed', 'engine'],
dtype='object')
Which manufacturer’s planes made the most flights in November 2013?
manufacturer
BOEING 6557
EMBRAER 5175
AIRBUS 3954
AIRBUS INDUSTRIE 3456
BOMBARDIER INC 2632
MCDONNELL DOUGLAS AIRCRAFT CO 811
MCDONNELL DOUGLAS 330
CANADAIR 122
MCDONNELL DOUGLAS CORPORATION 121
CESSNA 36
CIRRUS DESIGN CORP 21
ROBINSON HELICOPTER CO 19
BARKER JACK L 15
PIPER 9
FRIEDEMANN JON 8
SIKORSKY 6
BELL 5
DEHAVILLAND 4
AMERICAN AIRCRAFT INC 3
AGUSTA SPA 3
LEARJET INC 2
LAMBERT RICHARD 2
DOUGLAS 2
KILDALL GARY 2
AVIAT AIRCRAFT INC 1
MARZ BARRY 1
PAIR MIKE E 1
LEBLANC GLENN T 1
STEWART MACO 1
Name: count, dtype: int64
What weather factors are related to flight delays?
Here is a data set containing hourly weather data at each airport in 2013:
airport year month day ... wind_gust precip pressure visib
0 EWR 2013 1 1 ... NaN 0.0 1012.0 10.0
1 EWR 2013 1 1 ... NaN 0.0 1012.3 10.0
2 EWR 2013 1 1 ... NaN 0.0 1012.5 10.0
3 EWR 2013 1 1 ... NaN 0.0 1012.2 10.0
4 EWR 2013 1 1 ... NaN 0.0 1011.9 10.0
... ... ... ... ... ... ... ... ... ...
26110 LGA 2013 12 30 ... 21.86482 0.0 1017.1 10.0
26111 LGA 2013 12 30 ... 21.86482 0.0 1018.8 10.0
26112 LGA 2013 12 30 ... 23.01560 0.0 1019.5 10.0
26113 LGA 2013 12 30 ... NaN 0.0 1019.9 10.0
26114 LGA 2013 12 30 ... NaN 0.0 1020.9 10.0
[26115 rows x 14 columns]
What is / are the primary key(s) of this dataset?
airport year month day hour wind_gust precip pressure visib
0 EWR 2013 1 1 1 NaN 0.0 1012.0 10.0
1 EWR 2013 1 1 2 NaN 0.0 1012.3 10.0
2 EWR 2013 1 1 3 NaN 0.0 1012.5 10.0
3 EWR 2013 1 1 4 NaN 0.0 1012.2 10.0
4 EWR 2013 1 1 5 NaN 0.0 1011.9 10.0
... ... ... ... ... ... ... ... ... ...
26110 LGA 2013 12 30 14 21.86482 0.0 1017.1 10.0
26111 LGA 2013 12 30 15 21.86482 0.0 1018.8 10.0
26112 LGA 2013 12 30 16 23.01560 0.0 1019.5 10.0
26113 LGA 2013 12 30 17 NaN 0.0 1019.9 10.0
26114 LGA 2013 12 30 18 NaN 0.0 1020.9 10.0
[26115 rows x 9 columns]
Let’s start by looking at flights out of JFK only, for simplicity.
We need to join to the weather data on year, month, day, and hour.
year month day dep_time ... wind_gust precip pressure visib
0 2013 11 1 1154.0 ... NaN 0.00 1002.8 10.0
1 2013 11 1 2055.0 ... NaN 0.00 1005.0 10.0
2 2013 11 1 1814.0 ... NaN 0.00 1003.9 10.0
3 2013 11 1 1014.0 ... NaN 0.07 NaN 10.0
4 2013 11 1 1852.0 ... NaN 0.00 1003.9 10.0
... ... ... ... ... ... ... ... ... ...
7396 2013 11 30 2009.0 ... NaN 0.00 NaN 10.0
7397 2013 11 30 2047.0 ... NaN 0.00 1031.9 10.0
7398 2013 11 30 1841.0 ... NaN 0.00 1033.6 10.0
7399 2013 11 30 802.0 ... NaN 0.00 1042.1 10.0
7400 2013 11 30 1544.0 ... NaN 0.00 1036.1 10.0
[7401 rows x 28 columns]
Hmmmm….where did all the data go?
Sometimes, the join keys have different names in the two data sets.
This frequently happens if the data sets come from different sources.
For example, if we want to join the (entire) flights data to the weather data, we would need to include the airport in the key.
But the airport is called origin
in df_flights
and airport
in df_weather
.
The .merge()
function provides left_on =
and right_on =
arguments for specifying different column names in the left (first) and right (second) data sets.
Now we can visualize how rain impacts delays at each airport!
But first, let’s find the mean amount of rain for each departure delay to clean up our plot.
The data below contains counts of names for babies born in 1920 and 2020:
1920
Name Sex Count
0 Mary F 70975
1 Dorothy F 36645
2 Helen F 35098
3 Margaret F 27997
4 Ruth F 26100
... ... .. ...
10751 Zearl M 5
10752 Zeferino M 5
10753 Zeke M 5
10754 Zera M 5
10755 Zygmont M 5
[10756 rows x 3 columns]
2020
Name Sex Count
0 Olivia F 17641
1 Emma F 15656
2 Ava F 13160
3 Charlotte F 13065
4 Sophia F 13036
... ... .. ...
31448 Zykell M 5
31449 Zylus M 5
31450 Zymari M 5
31451 Zyn M 5
31452 Zyran M 5
[31453 rows x 3 columns]
We can merge these two data sets on a primary key…
Name Sex Count_1920 Count_2020
0 Mary F 70975 2210
1 Dorothy F 36645 562
2 Helen F 35098 721
3 Margaret F 27997 2190
4 Ruth F 26100 1323
... ... .. ... ...
4473 Whitt M 5 23
4474 Wyley M 5 6
4475 Xavier M 5 3876
4476 York M 5 14
4477 Zeke M 5 382
[4478 rows x 4 columns]
… but what happened to some of the names?
Why isn’t Maya in the joined data? It’s in the 2020 data…
How does the merge()
function determine which keys get kept?
By default, in order to appear in the joined data, a key must be present in both tables.
This is actually what is called an inner join, but there are other types of joins!
By default, Pandas does an inner join, which only keeps keys that are present in both tables.
An outer join keeps any key that is present in either table.
A left join keeps all keys in the left table, even if they are not in the right table. But any keys that are only in the right table are dropped.
A right join keeps all keys in the right table, even if they are not in the left table. But any keys that are only in the left table are dropped.
We can customize the type of join using the how =
parameter of .merge()
. By default, how = "inner"
.
Note the missing values for other columns, like Count_1920
!
What other type of join would have produced this output in the Maya row?
Note the missing values for other columns, like Count_1920
!
What other type of join would have produced this output in the Maya row?
Which type of join would be best suited for each case?
Inner, outer, left, and right are known as mutating joins, because they create new combined data sets.
There are two other types of joins that we use for filtering to get rid of some rows:
A semi-join tells us which keys in the left are present in the right.
An anti-join tells us which keys in the left are not present in the right.
Which names existed in 1920 but don’t in 2020?
In Pandas, we can’t do these using .merge()
, but we can use the isin()
function!
A primary key is one or more columns that uniquely identify the rows.
We can join (a.k.a. merge) data sets if they share a primary key, or if one has a foreign key.
The default of .merge()
is an inner join: only keys in both data sets are kept.
We can instead specify a left join, right join, or outer join; think about which rows we want to keep.
Filtering joins like anti-join and semi-join can help you answer questions about the data.
Use .isin()
to see which keys in one dataset exist in the other.