Unsupervised Learning with K-Means

Plan for the rest of the quarter

Weeks 8, 9, & 10

Week 8

Last week of models!
- Statistical Learning Model: K-Means
- Data Model: Joins

Week 9

Data Ethics
- Data Context
- Model Ethics
Project Work Sessions

Week 10

Final Posters
Practice Final
Exam 2

The story so far…

Predictive Modeling

In predictive modeling, a.k.a. supervised machine learning, we have a target variable we want to predict.
We expect to have observations where we know the predictors but not the target.
Our goal is to choose a modeling procedure to guess the value of the target variable based on the predictors.
We use cross-validation to estimate the test error of various procedure options.
We might compare different:

feature sets
preprocessing choices

model specifications / algorithms
tuning parameters

Unsupervised Learning

In unsupervised situations, we do not have a target variable \(y\).
We do still have features that we observe.
Our goal: Find an interesting structure in the features we have.

Supervised vs Unsupervised

Think of children playing with Legos. They might be supervised by parents who help them follow instructions, or they might be left alone to build whatever they want!

Clustering

Nearly all unsupervised learning algorithms can be called clustering.
The goal is to use the observed features (columns) to sort the observations (rows) into similar clusters (groups).
For example: Suppose I take all of your grades in the gradebook as features and then use these to find clusters of students. These clusters might represent…
- people who studied together
- people who are in the same section
- people who have the same major or background
- … or none of the above!

Applications for Clustering

Ecology: An ecologist wants to group organisms into types to define different species. (rows = organisms; features = habitat, size, etc.)
Biology: A geneticist wants to know which groups of genes tend to be activated at the same time. (rows = genes; features = activation at certain times)
Market Segmentation: A business wants to group their customers into types. (rows = customers, features = age, location, etc.)
Language: A linguist might want to identify different uses of ambiguous words like “set” or “run”. (rows = words; features = other words they are used with)
Documents: A historian might want to find groups of articles that are on similar topics. (rows = articles; features = tf-idf transformed n-grams)

K-Means

The K-Means Algorithm

Idea: Two observations are similar if they are close in distance.

Does this sound familiar?!

Q: How do we find “groups” of observations?

A: We should look for groups of observations that are close to the same centroid.

The K-Means Algorithm

Procedure (3-means):

Choose 3 random observations to be the initial centroids.
For each observation, determine which is the closest centroid.
Create 3 clusters based on the closest centroid.
Find the new centroid of each cluster.
Repeat until the clusters don’t change.

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Example: Penguin Data

Code

df_penguins = pd.read_csv("https://dlsun.github.io/stats112/data/penguins.csv")
df_penguins

       species     island  bill_length_mm  ...  body_mass_g     sex  year
0       Adelie  Torgersen            39.1  ...       3750.0    male  2007
1       Adelie  Torgersen            39.5  ...       3800.0  female  2007
2       Adelie  Torgersen            40.3  ...       3250.0  female  2007
3       Adelie  Torgersen             NaN  ...          NaN     NaN  2007
4       Adelie  Torgersen            36.7  ...       3450.0  female  2007
..         ...        ...             ...  ...          ...     ...   ...
339  Chinstrap      Dream            55.8  ...       4000.0    male  2009
340  Chinstrap      Dream            43.5  ...       3400.0  female  2009
341  Chinstrap      Dream            49.6  ...       3775.0    male  2009
342  Chinstrap      Dream            50.8  ...       4100.0    male  2009
343  Chinstrap      Dream            50.2  ...       3775.0  female  2009

[344 rows x 8 columns]

Penguin Data: Investigating Missing Values

It looks like there are missing values in these data…

df_penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB

Penguin Data: Removing Missing Values

For this analysis, we are interested in a penguin’s bill length and flipper length, so I will remove the missing values from those columns.

df_penguins = (
  df_penguins
  .dropna(subset = ["bill_length_mm", "flipper_length_mm"])
  )

Penguin Data: Plot

Code

(
  ggplot(data = df_penguins, 
         mapping = aes(x = "bill_length_mm", y = "flipper_length_mm")) + 
         geom_point() + 
         theme_bw() +
         labs(x = "Bill Length (mm)", 
              y = "Flipper Length (mm)"
              )
  )

Step 0: Standardize the data

Why is this important?

Step 0: Standardize the data

Specify

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().set_output(transform = "pandas")

Fit & Transform

df_scaled = (
  scaler
  .fit_transform(df_penguins[["bill_length_mm", "flipper_length_mm"]])
  )

df_scaled.head()

   bill_length_mm  flipper_length_mm
0       -0.884499          -1.418347
1       -0.811126          -1.062250
2       -0.664380          -0.421277
4       -1.324737          -0.563715
5       -0.847812          -0.777373

Step 1: Choose 3 random points to be centroids

centroids = df_scaled.sample(n = 3, random_state = 1234)
centroids.index = ["orange", "purple", "green"]

centroids

        bill_length_mm  flipper_length_mm
orange       -0.425917          -0.634935
purple        0.674678          -0.421277
green        -0.077396           0.860670

Step 1: Choose 3 random points to be centroids

Code

(
  ggplot(data = df_scaled, 
         mapping = aes(x = "bill_length_mm", y = "flipper_length_mm")) + 
         geom_point() + 
         geom_point(data = centroids, color = centroids.index, size = 4) +
         theme_bw() +
         labs(x = "Bill Length (mm)", 
              y = "Flipper Length (mm)"
              )
  )

Step 2: Assign each point to nearest centroid

from sklearn.metrics import pairwise_distances

dists = pairwise_distances(df_scaled, centroids)
dists[1:5]

array([[0.57531222, 1.61816529, 2.0581506 ],
       [0.32017798, 1.3390574 , 1.40994282],
       [0.90163652, 2.00448173, 1.89333953],
       [0.44529088, 1.5635793 , 1.8101736 ]])

Step 2: Assign each point to nearest centroid

closest_centroid = dists.argmin(axis = 1)
closest_centroid

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1])

axis = 1 operates on each row

Step 2: Assign each point to nearest centroid

df_scaled.index = centroids.index[closest_centroid]
df_scaled.head(n = 10)

        bill_length_mm  flipper_length_mm
orange       -0.884499          -1.418347
orange       -0.811126          -1.062250
orange       -0.664380          -0.421277
orange       -1.324737          -0.563715
orange       -0.847812          -0.777373
orange       -0.921185          -1.418347
orange       -0.866155          -0.421277
orange       -1.801661          -0.563715
orange       -0.352544          -0.777373
orange       -1.122961          -1.062250

Step 2: Assign each point to nearest centroid

Code

(
  ggplot(data = df_scaled, 
         mapping = aes(x = "bill_length_mm", y = "flipper_length_mm")) + 
         geom_point(color = df_scaled.index) + 
         geom_point(data = centroids, color = centroids.index, size = 3) +
         theme_bw() +
         labs(x = "Bill Length (mm)", 
              y = "Flipper Length (mm)"
              )
  )

Step 3: Find new centroids

centroids = df_scaled.groupby(df_scaled.index).mean()
centroids

        bill_length_mm  flipper_length_mm
green         0.628892           1.140501
orange       -0.957138          -0.821529
purple        0.980022          -0.332526

Warning

Are these centroids observations in df_scaled?

Step 3: Find new centroids

Code

(
  ggplot(data = df_scaled, 
         mapping = aes(x = "bill_length_mm", y = "flipper_length_mm")) + 
         geom_point(color = df_scaled.index) + 
         geom_point(data = centroids, color = centroids.index, size = 3) +
         theme_bw() +
         labs(x = "Bill Length (mm)", 
              y = "Flipper Length (mm)"
              )
  )

Step 4: Repeat over and over!

for i in range(1, 6):
  dists = pairwise_distances(df_scaled, centroids)
  closest_centroid = dists.argmin(axis = 1)
  df_scaled.index = centroids.index[closest_centroid]
  centroids = df_scaled.groupby(df_scaled.index).mean()
  print(centroids)

        bill_length_mm  flipper_length_mm
green         0.656908           1.142766
orange       -0.960807          -0.817256
purple        0.938075          -0.370088
        bill_length_mm  flipper_length_mm
green         0.666589           1.147791
orange       -0.958236          -0.808502
purple        0.938075          -0.370088
        bill_length_mm  flipper_length_mm
green         0.666589           1.147791
orange       -0.958236          -0.808502
purple        0.938075          -0.370088
        bill_length_mm  flipper_length_mm
green         0.666589           1.147791
orange       -0.958236          -0.808502
purple        0.938075          -0.370088
        bill_length_mm  flipper_length_mm
green         0.666589           1.147791
orange       -0.958236          -0.808502
purple        0.938075          -0.370088

At what point do we stop finding new centroids / reassigning points?

K-means in sklearn

Specify

from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

features = ["bill_length_mm", "flipper_length_mm"]

model = KMeans(n_clusters = 3, random_state = 1234)

pipeline = make_pipeline(
    StandardScaler(),
    model
)

Fit

pipeline.fit(df_penguins[features]);

K-means in sklearn

clusters = model.labels_
clusters

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2,
       2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 0, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2], dtype=int32)

Interpreting K-Means

The key takeaway here is the cluster centers:

centroids = model.cluster_centers_
centroids

array([[-0.95823619, -0.80850204],
       [ 0.66658932,  1.14779076],
       [ 0.93807532, -0.37008779]])

Cluster 1 has a short bill and short flippers
Cluster 2 has medium bill and long flipper
Cluster 3 has long bill, and fairly average flipper

Interpreting K-Means

We also might check if these clusters match any labels that we already know:

Code

results = pd.DataFrame({
  "cluster": clusters,
  "species": df_penguins['species']
})

(
  results
  .groupby("species")["cluster"]
  .value_counts()
  .unstack()
  .fillna(0)
  )

cluster        0      1     2
species                      
Adelie     146.0    1.0   4.0
Chinstrap    5.0    4.0  59.0
Gentoo       0.0  122.0   1.0

Your Turn

Activity

Fit a 3-means model using all the numeric predictors in the penguins data.

Describe what each cluster represents.
Do these clusters match up to the species?

Next, fit a 5-means model.

Do those clusters match up to species and island?

Takeaways

Unsupervised learning is a way to find structure in data.
K-means is the most common clustering method.
We have to choose K ahead of time.

This is a big problem!

Why can’t we tune????