Dummy Variables and Column Transformers

The story last week…

Distances

We measure similarity between observations by calculating distances.

  • Euclidean distance: sum of squared differences, then square root

  • Manhattan distance: sum of absolute differences

scikit-learn

Use the pairwise_distances() function to get back a 2D numpy array of distances.

Scaling

It is important that all our features be on the same scale for distances to be meaningful.

  • Standardize: Subtract the mean (of the column) and divide by the standard deviation (of the column).

  • MinMax: Subtract the minimum value, divide by the range.

scikit-learn

Follow the specify - fit - transform code structure. In the specify step, you should use the StandardScaler() or MinMaxScaler() functions.

Recall: AMES Housing data

df = pd.read_table("https://datasci112.stanford.edu/data/housing.tsv")
features = ["Gr Liv Area", "Bedroom AbvGr", "Full Bath", "Half Bath", "Bldg Type", "Neighborhood"]
df[features].head()
   Gr Liv Area  Bedroom AbvGr  Full Bath  Half Bath Bldg Type Neighborhood
0         1656              3          1          0      1Fam        NAmes
1          896              2          1          0      1Fam        NAmes
2         1329              3          1          1      1Fam        NAmes
3         2110              3          2          1      1Fam        NAmes
4         1629              3          2          1      1Fam      Gilbert

Distances and Categorical Variables

What about categorical variables?

Suppose we want to include the variable Bldg Type in our distance calculation…

df["Bldg Type"].value_counts()
Bldg Type
1Fam      2425
TwnhsE     233
Duplex     109
Twnhs      101
2fmCon      62
Name: count, dtype: int64


Then we need a way to calculate \((\texttt{1Fam} - \texttt{Twnhs} )^ 2\).

Converting to Binary

Let’s instead think about a variable that summarizes whether an observation is a single family home or not.

df["is_single_fam"] = df["Bldg Type"] == "1Fam"
df["is_single_fam"].value_counts()
is_single_fam
True     2425
False     505
Name: count, dtype: int64

What does a value of True represent? False?

Dummy Variables

When we transform a variable into binary (True / False), we call this variable a dummy variable or we say the variable has been one-hot-encoded.

Remember that that computers interpret logical values (True / False) the same as 1 / 0:

df["is_single_fam"] = df["is_single_fam"].astype("int")
df["is_single_fam"].value_counts()
is_single_fam
1    2425
0     505
Name: count, dtype: int64

Now we can do math!

Specify

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances

scaler = StandardScaler()

Fit

df_orig = df[['Gr Liv Area', 'Bedroom AbvGr', 'is_single_fam']]
scaler.fit(df_orig)
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Transform

df_scaled = scaler.transform(df_orig)

Calculating Distances

dists = pairwise_distances(df_scaled[[1707]], df_scaled)
best = (
  dists
  .argsort()
  .flatten()
  [1:10]
  )
df_orig.iloc[best]
      Gr Liv Area  Bedroom AbvGr  is_single_fam
160          2978              5              1
909          3082              5              1
1288         2792              5              1
2350         2784              5              1
253          3222              5              1
585          2640              5              1
2027         2526              5              1
2330         3390              5              1
436          2520              5              1

Looking back…

Where have you seen one-hot-encoded variables already?

Let’s reset the dataset now…

df = pd.read_table("https://datasci112.stanford.edu/data/housing.tsv")

Dummifying Variables

Dummifying Variables

  • What if we don’t just want to study is_single_fam, but rather, all categories of the Bldg Type variable?

  • In principle, we just make dummy variables for each category: is_single_fam, is_twnhse, etc.

  • Each category becomes one column, with 0’s and 1’s to show if the observation in that row matches that category.

  • That sounds pretty tedious, especially if you have a lot of categories…

  • Luckily, we have shortcuts in both pandas and sklearn!

Dummifying in Pandas

pd.get_dummies(df[["Bldg Type"]])
      Bldg Type_1Fam  Bldg Type_2fmCon  ...  Bldg Type_Twnhs  Bldg Type_TwnhsE
0               True             False  ...            False             False
1               True             False  ...            False             False
2               True             False  ...            False             False
3               True             False  ...            False             False
4               True             False  ...            False             False
...              ...               ...  ...              ...               ...
2925            True             False  ...            False             False
2926            True             False  ...            False             False
2927            True             False  ...            False             False
2928            True             False  ...            False             False
2929            True             False  ...            False             False

[2930 rows x 5 columns]

Dummifying in Pandas

      Bldg Type_1Fam  Bldg Type_2fmCon  ...  Bldg Type_Twnhs  Bldg Type_TwnhsE
0               True             False  ...            False             False
1               True             False  ...            False             False
2               True             False  ...            False             False
3               True             False  ...            False             False
4               True             False  ...            False             False
...              ...               ...  ...              ...               ...
2925            True             False  ...            False             False
2926            True             False  ...            False             False
2927            True             False  ...            False             False
2928            True             False  ...            False             False
2929            True             False  ...            False             False

[2930 rows x 5 columns]

Some things to notice here…

  1. What is the naming convention for the new columns?

  2. Does this change the original dataframe df? If not, what would you need to do to add this information back in?

  3. What happens if you put the whole dataframe into the get_dummies function? What problems might arise from this?

Dummifying in sklearn

Specify

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

Fit

encoder.fit(df[["Bldg Type"]])
OneHotEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Transform

df_bldg = encoder.transform(df[["Bldg Type"]])
df_bldg
<Compressed Sparse Row sparse matrix of dtype 'float64'
    with 2930 stored elements and shape (2930, 5)>

Dummifying in sklearn

df_bldg.todense()
matrix([[1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0.],
        ...,
        [1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0.]], shape=(2930, 5))

Things to notice:

  1. What object type was the result?

  2. Does this change the original dataframe df? If not, what would you need to do to add this information back in?

  3. What pros and cons do you see for the pandas approach vs the sklearn approach?

Column Transformers

Preprocessing

So far, we have now seen two preprocessing steps that might need to happen to do an analysis of distances:

  1. Scaling the quantitative variables
  2. Dummifying the categorical variables

Preprocessing steps are things you do only to make the following analysis/visualization better.

  • This is not the same as data cleaning, where you make changes to fix the data (e.g., changing data types).
  • This is not the same as data wrangling, where you change the structure of the data (e.g., adding or deleting rows or columns).

Lecture 4.1 Quiz

Identify the following as cleaning, wrangling, or preprocessing:

  1. Removing the $ symbol from a column and converting it to numeric.

  2. Narrowing your data down to only first class Titanic passengers, because you are not studying the others.

  3. Converting a Zip Code variable from numeric to categorical using .astype().

  4. Creating a new column called n_investment that counts the number of people who invested in a project.

  5. Log-transforming a column because it is very skewed.

Preprocessing in sklearn

  • Unlike cleaning and wrangling, the preprocessing steps are “temporary” changes to the dataframe.
  • It would be nice if we could trigger these changes as part of our analysis, instead of doing them “by hand”.

    • This is why the specify - fit - transform process is useful!

    • We will first specify all our preprocessing steps.

    • Then we will fit the whole preprocess

    • Then we will save the transform step for only when we need it.

Column Transformers – Specify

from sklearn.compose import make_column_transformer


preproc = make_column_transformer(
  (OneHotEncoder(), ["Bldg Type", "Neighborhood"]),
    remainder = "passthrough")

Column Transformers – Fit

preproc.fit(df[features])
ColumnTransformer(remainder='passthrough',
                  transformers=[('onehotencoder', OneHotEncoder(),
                                 ['Bldg Type', 'Neighborhood'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Column Transformers – Transform

preproc.transform(df[features])
<Compressed Sparse Row sparse matrix of dtype 'float64'
    with 15717 stored elements and shape (2930, 37)>

Things to notice…

  1. What submodule did we import make_column_transformer from?

  2. What are the two arguments to the make_column_transformer() function? What object structures are they?

  3. What happens if you fit and transform on the whole dataset, not just df[features]? Why might this be useful?

Lecture Activity 4.2

Try the following:

  1. What happens if you change remainder = "passthrough" to remainder = "drop"?

  2. What happens if you add the argument sparse_output = False to the OneHotEncoder() function?

  3. What happens if you add this line before the transform step: preproc.set_output(transform = "pandas") (keep the sparse_output = False when you try this)

Multiple Preprocessing Steps

Why are column transformers so useful? We can do multiple preprocessing steps at once!

from sklearn.preprocessing import StandardScaler

preproc = make_column_transformer(
        (StandardScaler(), ["Gr Liv Area"]),
        (OneHotEncoder(sparse_output = False), ["Bldg Type", "Neighborhood"]),
        remainder = "passthrough")

Fit!

(
  preproc
  .fit(df[features])
  .set_output(transform = "pandas")
)
ColumnTransformer(remainder='passthrough',
                  transformers=[('standardscaler', StandardScaler(),
                                 ['Gr Liv Area']),
                                ('onehotencoder',
                                 OneHotEncoder(sparse_output=False),
                                 ['Bldg Type', 'Neighborhood'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Transform!

df_transformed = preproc.transform(df[features])
df_transformed
      standardscaler__Gr Liv Area  ...  remainder__Half Bath
0                        0.309265  ...                     0
1                       -1.194427  ...                     0
2                       -0.337718  ...                     1
3                        1.207523  ...                     1
4                        0.255844  ...                     1
...                           ...  ...                   ...
2925                    -0.982723  ...                     0
2926                    -1.182556  ...                     0
2927                    -1.048015  ...                     0
2928                    -0.219006  ...                     0
2929                     0.989884  ...                     1

[2930 rows x 37 columns]

Finding All Categorical Variables

What if we want to tell sklearn, “Please dummify every categorical variable.”? Use a selector instead of exact column names!

from sklearn.compose import make_column_selector

preproc = make_column_transformer(
    (StandardScaler(),  
     make_column_selector(dtype_include = np.number)
     ),
    (OneHotEncoder(sparse_output = False), 
     make_column_selector(dtype_include = object)
     ),
    remainder = "passthrough")

Fit!

(
  preproc
  .fit(df[features])
  .set_output(transform = "pandas")
)
ColumnTransformer(remainder='passthrough',
                  transformers=[('standardscaler', StandardScaler(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x115366410>),
                                ('onehotencoder',
                                 OneHotEncoder(sparse_output=False),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x115366fb0>)])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Transform!

df_transformed = preproc.transform(df[features])
df_transformed
      standardscaler__Gr Liv Area  ...  onehotencoder__Neighborhood_Veenker
0                        0.309265  ...                                  0.0
1                       -1.194427  ...                                  0.0
2                       -0.337718  ...                                  0.0
3                        1.207523  ...                                  0.0
4                        0.255844  ...                                  0.0
...                           ...  ...                                  ...
2925                    -0.982723  ...                                  0.0
2926                    -1.182556  ...                                  0.0
2927                    -1.048015  ...                                  0.0
2928                    -0.219006  ...                                  0.0
2929                     0.989884  ...                                  0.0

[2930 rows x 37 columns]

Think about it

  • What are the advantages of using a selector?

  • What are the possible disadvantages of using a selector?

  • Does the order matter when using selectors? Try switching the steps and see what happens!

Takeaways

Takeaways

  • We dummify or one-hot-encode categorical variables to make them numbers.

  • We can do this with pd.get_dummies() or with OneHotEncoder() from sklearn.

  • Column Transformers let us apply multiple preprocessing steps at the same time.

    • Think about which variables you want to apply the steps to
    • Think about options for the steps, like sparseness
    • Think about passthrough in your transformer