Combining Datasets

Lab 7 Debrief

Cleaning Variables

What does a Mode of 0 represent?

df_songs['Mode'] = (
  df_songs['Mode']
  .replace({1: 'Major', 0: 'Minor'})
  )

How did I know that 0 represented a minor key?

Obtaining Coefficients for Every Variable Level

Even if you converted TimeSignature and Mode into categorical / string variables, you will not get coefficient estimates for every level of these variables.

Q: What do you get?

A: Adjustments for each group relative to the “baseline” group.

Q: What group is the “baseline” group?

A: The group that comes first alphabetically (has the lowest ASCII representation).

Obtaining Coefficients for Every Variable Level

How do you get values for every level of a categorical variable?

One-hot Encoding!

ct = make_column_transformer(
    (OneHotEncoder(handle_unknown = "ignore"),
     ["TimeSignature", "Mode"]
     ),
    remainder = "passthrough"
)

Penalized Logistic Regression is the Default!

pipeline = make_pipeline(
    ct,
  LogisticRegression(penalty = None)
  )

Model Error versus Testing Error

What is the accuracy of this model?


this model represents the errors of the model that was fit to the training data, where we know the value of the target variable

Cross-validation is not necessary here!

If you were to use this model to predict, what precision would you expect to get?


predicting involves “testing” data, where we don’t know the value of the target variable

Cross-validation is necessary here!

Interpreting Coefficents

Each coefficient is associated with the change in the log-odds of a song containing explicit lyrics.

TimeSignature_1: 1.532310

A 1/4 time signature is associated with an increase of 1.53 in the log-odds of a song containing explicit lyrics.


Speechiness: 5.185788

An increase of speechiness by 1 unit (making the song almost entirely made up of spoken words) is associated with an increase of 5.18 in the log-odds of a song containing explicit lyrics.

Did you standardize?

If so, your interpretation is not a 1 unit increase! Your interpretation is a 1 standard deviation increase above the mean.

The story so far…

Data analysis: The Whole Game

  • Acquire data and clean it by fixing variable types, dropping or replacing missing data, and looking for other issues.

  • Explore the dataset by making summaries and plots of one variable.

  • Establish research questions to answer with this data.

  • Create visualizations of two or more variables that address simple questions.

  • Fit predictive models to address more complex questions, and/or to prepare for prediction on future data.

  • Fit unsupervised models to answer open-ended questions.

Joining Datasets

Example: Planes and Flights

Sometimes, information is spread across multiple data sets.

For example, suppose we want to know which manufacturer’s planes made the most flights in November 2013.

Example: Information on Flights

One data set contains information about flights in Nov. 2013…

Code
import pandas as pd
data_dir = "https://datasci112.stanford.edu/data/nycflights13/"
df_flights = pd.read_csv(f"{data_dir}/flights11.csv")
       year  month  day  dep_time  ...  distance  hour  minute  tailnum
0      2013     11    1    2108.0  ...      1167    20      56   N10156
1      2013     11    1    1154.0  ...       541    12       0   N102UW
2      2013     11    1     854.0  ...       946     8      29   N10575
3      2013     11    1    1643.0  ...       594    15       5   N10575
4      2013     11    1     603.0  ...       282     6       0   N11109
...     ...    ...  ...       ...  ...       ...   ...     ...      ...
23295  2013     11   30    1337.0  ...      1076    13      40   N994DL
23296  2013     11   30     802.0  ...      1069     8       7   N995DL
23297  2013     11   30    1544.0  ...      1069    15      50   N995DL
23298  2013     11   30     850.0  ...       762     9       0   N996DL
23299  2013     11   30    1959.0  ...       762    20       0   N999DN

[23300 rows x 18 columns]

Example: Information on Planes

…while another contains information about planes.

Code
df_planes = pd.read_csv(f"{data_dir}/planes.csv")
     tailnum    year                     type  ... seats speed     engine
0     N10156  2004.0  Fixed wing multi engine  ...    55   NaN  Turbo-fan
1     N102UW  1998.0  Fixed wing multi engine  ...   182   NaN  Turbo-fan
2     N103US  1999.0  Fixed wing multi engine  ...   182   NaN  Turbo-fan
3     N104UW  1999.0  Fixed wing multi engine  ...   182   NaN  Turbo-fan
4     N10575  2002.0  Fixed wing multi engine  ...    55   NaN  Turbo-fan
...      ...     ...                      ...  ...   ...   ...        ...
3317  N997AT  2002.0  Fixed wing multi engine  ...   100   NaN  Turbo-fan
3318  N997DL  1992.0  Fixed wing multi engine  ...   142   NaN  Turbo-fan
3319  N998AT  2002.0  Fixed wing multi engine  ...   100   NaN  Turbo-fan
3320  N998DL  1992.0  Fixed wing multi engine  ...   142   NaN  Turbo-jet
3321  N999DN  1992.0  Fixed wing multi engine  ...   142   NaN  Turbo-jet

[3322 rows x 9 columns]

Example: Planes and Flights

Which manufacturer’s planes made the most flights in November 2013?

In order to answer this question we need to join these two data sets together!

Joining on a Key

Keys

  • A primary key is a column (or a set of columns) that uniquely identifies observations in a data frame.

  • The primary key is the column(s) you would think of as the index.

  • A foreign key is a column (or a set of columns) that points to the primary key of another data frame.

  • Planes are uniquely identified by their tail number (tailnum).

Joining on a Key

Each value of the primary key should only appear once, but it could appear many times in a foreign key.

(
  df_planes['tailnum']
  .value_counts()
)
tailnum
N999DN    1
N10156    1
N102UW    1
N103US    1
N104UW    1
         ..
N11192    1
N11193    1
N11194    1
N11199    1
N111US    1
Name: count, Length: 3322, dtype: int64
(
  df_flights['tailnum']
  .value_counts()
)
tailnum
N353JB    48
N955UW    47
N184JB    47
N281JB    42
N375JB    42
          ..
N8319F     1
N8683B     1
N919WN     1
N119US     1
N395HA     1
Name: count, Length: 2600, dtype: int64

Joining on a Key

The Pandas function .merge() can be used to join two DataFrames on a key.

df_joined = df_flights.merge(df_planes, on = "tailnum")
       year_x  month  day  dep_time  ...  engines  seats  speed     engine
0        2013     11    1    2108.0  ...        2     55    NaN  Turbo-fan
1        2013     11    1    1154.0  ...        2    182    NaN  Turbo-fan
2        2013     11    1     854.0  ...        2     55    NaN  Turbo-fan
3        2013     11    1    1643.0  ...        2     55    NaN  Turbo-fan
4        2013     11    1     603.0  ...        2     55    NaN  Turbo-fan
...       ...    ...  ...       ...  ...      ...    ...    ...        ...
23295    2013     11   30    1337.0  ...        2    142    NaN  Turbo-jet
23296    2013     11   30     802.0  ...        2    142    NaN  Turbo-fan
23297    2013     11   30    1544.0  ...        2    142    NaN  Turbo-fan
23298    2013     11   30     850.0  ...        2    142    NaN  Turbo-fan
23299    2013     11   30    1959.0  ...        2    142    NaN  Turbo-jet

[23300 rows x 26 columns]

Overlapping Column Names

  • Joining two data frames results in a wider data frame, with more columns.

  • By default, Pandas adds the suffixes _x and _y to overlapping column names. , but this can be customized.

df_joined.columns
Index(['year_x', 'month', 'day', 'dep_time', 'sched_dep_time', 'dep_delay',
       'arr_time', 'sched_arr_time', 'arr_delay', 'carrier', 'flight',
       'origin', 'dest', 'air_time', 'distance', 'hour', 'minute', 'tailnum',
       'year_y', 'type', 'manufacturer', 'model', 'engines', 'seats', 'speed',
       'engine'],
      dtype='object')

Overlapping Column Names

But this can be customized!

df_joined = df_flights.merge(df_planes, on = "tailnum",
                             suffixes = ("_flight", "_plane")
                             )
Index(['year_flight', 'month', 'day', 'dep_time', 'sched_dep_time',
       'dep_delay', 'arr_time', 'sched_arr_time', 'arr_delay', 'carrier',
       'flight', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute',
       'tailnum', 'year_plane', 'type', 'manufacturer', 'model', 'engines',
       'seats', 'speed', 'engine'],
      dtype='object')

Analyzing the Joined Data

Which manufacturer’s planes made the most flights in November 2013?

df_joined["manufacturer"].value_counts()
manufacturer
BOEING                           6557
EMBRAER                          5175
AIRBUS                           3954
AIRBUS INDUSTRIE                 3456
BOMBARDIER INC                   2632
MCDONNELL DOUGLAS AIRCRAFT CO     811
MCDONNELL DOUGLAS                 330
CANADAIR                          122
MCDONNELL DOUGLAS CORPORATION     121
CESSNA                             36
CIRRUS DESIGN CORP                 21
ROBINSON HELICOPTER CO             19
BARKER JACK L                      15
PIPER                               9
FRIEDEMANN JON                      8
SIKORSKY                            6
BELL                                5
DEHAVILLAND                         4
AMERICAN AIRCRAFT INC               3
AGUSTA SPA                          3
LEARJET INC                         2
LAMBERT RICHARD                     2
DOUGLAS                             2
KILDALL GARY                        2
AVIAT AIRCRAFT INC                  1
MARZ BARRY                          1
PAIR MIKE E                         1
LEBLANC GLENN T                     1
STEWART MACO                        1
Name: count, dtype: int64

Joining on Multiple Keys

Example: Weather and Flights

What weather factors are related to flight delays?

Here is a data set containing hourly weather data at each airport in 2013:

Code
df_weather = pd.read_csv(f"{data_dir}/weather.csv")
      airport  year  month  day  ...  wind_gust  precip  pressure  visib
0         EWR  2013      1    1  ...        NaN     0.0    1012.0   10.0
1         EWR  2013      1    1  ...        NaN     0.0    1012.3   10.0
2         EWR  2013      1    1  ...        NaN     0.0    1012.5   10.0
3         EWR  2013      1    1  ...        NaN     0.0    1012.2   10.0
4         EWR  2013      1    1  ...        NaN     0.0    1011.9   10.0
...       ...   ...    ...  ...  ...        ...     ...       ...    ...
26110     LGA  2013     12   30  ...   21.86482     0.0    1017.1   10.0
26111     LGA  2013     12   30  ...   21.86482     0.0    1018.8   10.0
26112     LGA  2013     12   30  ...   23.01560     0.0    1019.5   10.0
26113     LGA  2013     12   30  ...        NaN     0.0    1019.9   10.0
26114     LGA  2013     12   30  ...        NaN     0.0    1020.9   10.0

[26115 rows x 14 columns]

Identifying the Primary Key

What is / are the primary key(s) of this dataset?

      airport  year  month  day  hour  wind_gust  precip  pressure  visib
0         EWR  2013      1    1     1        NaN     0.0    1012.0   10.0
1         EWR  2013      1    1     2        NaN     0.0    1012.3   10.0
2         EWR  2013      1    1     3        NaN     0.0    1012.5   10.0
3         EWR  2013      1    1     4        NaN     0.0    1012.2   10.0
4         EWR  2013      1    1     5        NaN     0.0    1011.9   10.0
...       ...   ...    ...  ...   ...        ...     ...       ...    ...
26110     LGA  2013     12   30    14   21.86482     0.0    1017.1   10.0
26111     LGA  2013     12   30    15   21.86482     0.0    1018.8   10.0
26112     LGA  2013     12   30    16   23.01560     0.0    1019.5   10.0
26113     LGA  2013     12   30    17        NaN     0.0    1019.9   10.0
26114     LGA  2013     12   30    18        NaN     0.0    1020.9   10.0

[26115 rows x 9 columns]

Verifying Primary Key

(
  df_weather
  .groupby(["airport", "year", "month", "day", "hour"])
  .size()
)
airport  year  month  day  hour
EWR      2013  1      1    1       1
                           2       1
                           3       1
                           4       1
                           5       1
                                  ..
LGA      2013  12     30   14      1
                           15      1
                           16      1
                           17      1
                           18      1
Length: 26112, dtype: int64

A Key with Multiple Columns

Let’s start by looking at flights out of JFK only, for simplicity.

df_flights_jfk = df_flights[df_flights["origin"] == "JFK"]
df_weather_jfk = df_weather[df_weather["airport"] == "JFK"]

A Key with Multiple Columns

We need to join to the weather data on year, month, day, and hour.

df_jfk = df_flights_jfk.merge(df_weather_jfk, 
                              on = ("year", "month", "day", "hour")
                              )
      year  month  day  dep_time  ...  wind_gust  precip  pressure  visib
0     2013     11    1    1154.0  ...        NaN    0.00    1002.8   10.0
1     2013     11    1    2055.0  ...        NaN    0.00    1005.0   10.0
2     2013     11    1    1814.0  ...        NaN    0.00    1003.9   10.0
3     2013     11    1    1014.0  ...        NaN    0.07       NaN   10.0
4     2013     11    1    1852.0  ...        NaN    0.00    1003.9   10.0
...    ...    ...  ...       ...  ...        ...     ...       ...    ...
7396  2013     11   30    2009.0  ...        NaN    0.00       NaN   10.0
7397  2013     11   30    2047.0  ...        NaN    0.00    1031.9   10.0
7398  2013     11   30    1841.0  ...        NaN    0.00    1033.6   10.0
7399  2013     11   30     802.0  ...        NaN    0.00    1042.1   10.0
7400  2013     11   30    1544.0  ...        NaN    0.00    1036.1   10.0

[7401 rows x 28 columns]

How does rain affects departure delays?

Code
from plotnine import *

(
  ggplot(data = df_jfk, 
         mapping = aes(x = "precip", y = "dep_delay")) +
  geom_point(alpha = 0.2) +
  labs(x = "Precipitation (feet)", 
       y = "Departure Delay (minutes)") +
  theme_bw()
)

Hmmmm….where did all the data go?

How does rain affects departure delays?

Code
(
  ggplot(data = df_jfk, 
         mapping = aes(x = "precip", y = "dep_delay")) +
  geom_jitter(alpha = 0.2, width = 0.1) +
  labs(x = "Precipitation (feet)", 
       y = "Departure Delay (minutes)", 
       color = "New York City Airport") +
  theme_bw()
)

Joining on Keys with Different Names

  • Sometimes, the join keys have different names in the two data sets.

  • This frequently happens if the data sets come from different sources.

  • For example, if we want to join the (entire) flights data to the weather data, we would need to include the airport in the key.

  • But the airport is called origin in df_flights and airport in df_weather.

  • The .merge() function provides left_on = and right_on = arguments for specifying different column names in the left (first) and right (second) data sets.

Joining on Keys with Different Names

df_flights_weather = df_flights.merge(
    df_weather,
    left_on = ("origin", "year", "month", "day", "hour"),
    right_on = ("airport", "year", "month", "day", "hour")
    )

Rain and delays by airport

Now we can visualize how rain impacts delays at each airport!


But first, let’s find the mean amount of rain for each departure delay to clean up our plot.

df_rain = (
  df_flights_weather
  .groupby(["airport", "precip"])["dep_delay"]
  .mean()
  .reset_index()
  )

Rain and delays by airport

Code
(
  ggplot(data = df_rain, 
         mapping = aes(x = "precip", y = "dep_delay", color = "airport")) +
  geom_point() +
  geom_line() +
  labs(x = "Precipitation (feet)", 
       y = "Departure Delay (minutes)") +
  theme_bw() +
  theme(legend_position = "top")
)

Joins with Missing Keys

Example: Baby names

The data below contains counts of names for babies born in 1920 and 2020:

Code
data_dir = "http://dlsun.github.io/pods/data/names/"

df_1920 = pd.read_csv(data_dir + "yob1920.txt", header = None,
                      names = ["Name", "Sex", "Count"]
                      )
df_2020 = pd.read_csv(data_dir + "yob2020.txt", header = None,
                      names = ["Name", "Sex", "Count"]
                      )

1920

           Name Sex  Count
0          Mary   F  70975
1       Dorothy   F  36645
2         Helen   F  35098
3      Margaret   F  27997
4          Ruth   F  26100
...         ...  ..    ...
10751     Zearl   M      5
10752  Zeferino   M      5
10753      Zeke   M      5
10754      Zera   M      5
10755   Zygmont   M      5

[10756 rows x 3 columns]

2020

            Name Sex  Count
0         Olivia   F  17641
1           Emma   F  15656
2            Ava   F  13160
3      Charlotte   F  13065
4         Sophia   F  13036
...          ...  ..    ...
31448     Zykell   M      5
31449      Zylus   M      5
31450     Zymari   M      5
31451        Zyn   M      5
31452      Zyran   M      5

[31453 rows x 3 columns]

Joins

We can merge these two data sets on a primary key…

df_joined = (
  df_1920
  .merge(df_2020, 
         on = ["Name", "Sex"], 
         suffixes = ("_1920", "_2020")
         )
  )
          Name Sex  Count_1920  Count_2020
0         Mary   F       70975        2210
1      Dorothy   F       36645         562
2        Helen   F       35098         721
3     Margaret   F       27997        2190
4         Ruth   F       26100        1323
...        ...  ..         ...         ...
4473     Whitt   M           5          23
4474     Wyley   M           5           6
4475    Xavier   M           5        3876
4476      York   M           5          14
4477      Zeke   M           5         382

[4478 rows x 4 columns]

Missing Keys?

… but what happened to some of the names?

df_joined[df_joined["Name"] == "Maya"]
Empty DataFrame
Columns: [Name, Sex, Count_1920, Count_2020]
Index: []

Missing Keys?

Why isn’t Maya in the joined data? It’s in the 2020 data…

df_2020[df_2020["Name"] == "Maya"]
       Name Sex  Count
60     Maya   F   3724
28914  Maya   M      6

…but it’s not in the 1920 data.

df_1920[df_1920["Name"] == "Maya"]
Empty DataFrame
Columns: [Name, Sex, Count]
Index: []

Missing keys


How does the merge() function determine which keys get kept?

By default, in order to appear in the joined data, a key must be present in both tables.

This is actually what is called an inner join, but there are other types of joins!

Other Types of Joins

Types of Joins

  • By default, Pandas does an inner join, which only keeps keys that are present in both tables.

  • An outer join keeps any key that is present in either table.

  • A left join keeps all keys in the left table, even if they are not in the right table. But any keys that are only in the right table are dropped.

  • A right join keeps all keys in the right table, even if they are not in the left table. But any keys that are only in the left table are dropped.

Types of Joins

A Venn diagram illustrating four types of joins using two overlapping circles in each diagram. The highlighted (green) areas represent the included data for each join type. Inner Join: Only the overlapping region of both circles is highlighted, representing the common data between the two tables. Left Join: The entire left circle, including its overlap with the right circle, is highlighted, representing all data from the left table and any matching data from the right table. Right Join: The entire right circle, including its overlap with the left circle, is highlighted, representing all data from the right table and any matching data from the left table. Outer Join: Both entire circles are highlighted, representing all data from both tables, including matches and non-matches.

Types of Joins

We can customize the type of join using the how = parameter of .merge(). By default, how = "inner".

df_joined_outer = (
  df_1920
  .merge(df_2020, 
         on = ["Name", "Sex"],
         suffixes = ("_1920", "_2020"), 
         how = "outer")
    )


df_joined_outer[df_joined_outer["Name"] == "Maya"]
       Name Sex  Count_1920  Count_2020
24999  Maya   F         NaN      3724.0
25000  Maya   M         NaN         6.0

Types of Joins

  • Note the missing values for other columns, like Count_1920!

  • What other type of join would have produced this output in the Maya row?

Types of Joins

  • Note the missing values for other columns, like Count_1920!

  • What other type of join would have produced this output in the Maya row?

df_joined_right = (
  df_1920
  .merge(df_2020, 
         on = ["Name", "Sex"],
         suffixes = ("_1920", "_2020"), 
         how = "right")
    )


df_joined_right[df_joined_right["Name"] == "Maya"]
       Name Sex  Count_1920  Count_2020
60     Maya   F         NaN        3724
28914  Maya   M         NaN           6

Quick Quiz

Which type of join would be best suited for each case?

  1. We want to determine the names that have increased in popularity the most between 1920 and 2020.
  1. We want to graph the popularity of names over time.
  1. We want to determine the names that have decreased in popularity the most between 1920 and 2020.

Filtering Joins

Filtering Joins

Inner, outer, left, and right are known as mutating joins, because they create new combined data sets.

  • There are two other types of joins that we use for filtering to get rid of some rows:

  • A semi-join tells us which keys in the left are present in the right.

  • An anti-join tells us which keys in the left are not present in the right.

Filtering Joins

A diagram illustrating the concept of a semi-join. On the left, two tables are shown: The first table (x) has two columns, with labeled values (x1, x2, x3) in the first column and numerical values (1, 2, 3) in the second column. The second table (y) has two columns, where the second column contains numerical values (4, 2, 1), and the first column has values (y3, y2, y1). Lines connect matching values between the two tables: 1 from x is found in y, as is 2, but 3 is not present in y. The result of the semi-join is shown in a table on the right, containing only the rows from x where a match was found in y. The output table has two columns: key and val_x, and it retains only the rows with keys 1 and 2 from x (x1 and x2), omitting x3 since 3 is not present in y.

A diagram illustrating the concept of an anti-join (anti_join(x, y)). On the left, two tables are shown: The first table (x) has two columns, with labeled values (x1, x2, x3) in the first column and numerical values (1, 2, 3) in the second column. The second table (y) has two columns, where the second column contains numerical values (4, 2, 1), and the first column has values (y3, y2, y1). Lines connect matching values between the two tables: 1 from x is found in y, as is 2, but 3 is not present in y. Red lines indicate rows that have matches, which are excluded in the anti-join. The result of the anti-join is shown in a table on the right, containing only the row from x where no match was found in y. The output table has two columns: key and val_x, and it retains only the row with key 3 from x (x3), omitting x1 and x2 since they had matches in y.

Filtering Joins

Which names existed in 1920 but don’t in 2020?

In Pandas, we can’t do these using .merge(), but we can use the isin() function!

in_both = df_1920['Name'].isin(df_2020['Name'])
df_1920.loc[~in_both, 'Name']
67          Myrtle
245            Sue
257         Nannie
284         Virgie
300      Bernadine
           ...    
10748         Ynes
10750     Zaragoza
10751        Zearl
10752     Zeferino
10755      Zygmont
Name: Name, Length: 5638, dtype: object

Takeaways

Takeaways

  • A primary key is one or more columns that uniquely identify the rows.

  • We can join (a.k.a. merge) data sets if they share a primary key, or if one has a foreign key.

  • The default of .merge() is an inner join: only keys in both data sets are kept.

  • We can instead specify a left join, right join, or outer join; think about which rows we want to keep.

  • Filtering joins like anti-join and semi-join can help you answer questions about the data.

  • Use .isin() to see which keys in one dataset exist in the other.