Classification

Final Projects

Project Proposal - Due Sunday, February 23

  1. Your group member names.

  2. Information about the dataset(s) you intend to analyze:

    • Where are the data located?
    • Who collected the data and why?
    • What information (variables) are in the dataset?
  3. Research Questions: You should have one primary research question and a few secondary questions

  4. Preliminary exploration of your dataset(s): A few simple plots or summary statistics that relate to the variables you plan to study.

Who are you working with?

Please take 3-minutes and fill out this form:

https://forms.gle/QMsNkkY1P7KQbzJq9

The story so far…

Choosing a Best Model

  • We select a best model - aka best prediction procedure - by cross-validation.

  • Feature selection: Which predictors should we include, and how should we preprocess them?

  • Model selection: Should we use Linear Regression or KNN or Decision Trees or something else?

  • Hyperparameter tuning: Choosing model-specific settings, like \(k\) for KNN.

  • Each candidate is a pipeline; use GridSearchCV() or cross_val_score() to score the options

Case Study: Breast Tissue Classification

Breast Tissue Classification

Electrical signals can be used to detect whether tissue is cancerous.

A medical illustration showing a breast cancer detection procedure using electrical impedance scanning. A patient is lying down while a scan probe is placed on the breast. An inset diagram highlights how the probe detects differences in impedance between normal breast adipose tissue (high impedance) and malignant lesions (low impedance). The probe is connected to a computer displaying a grid with white dots, likely representing detected areas of concern.

Analysis Goal

The goal is to determine whether a sample of breast tissue is:

Not Cancerous

  1. connective tissue
  2. adipose tissue
  3. glandular tissue

Cancerous

  1. carcinoma
  2. fibro-adenoma
  3. mastopathy

Reading in the Data

import pandas as pd
breast_df = pd.read_csv("https://datasci112.stanford.edu/data/BreastTissue.csv")
     Case # Class           I0  ...      Max IP          DR            P
0         1   car   524.794072  ...   60.204880  220.737212   556.828334
1         2   car   330.000000  ...   69.717361   99.084964   400.225776
2         3   car   551.879287  ...   77.793297  253.785300   656.769449
3         4   car   380.000000  ...   88.758446  105.198568   493.701814
4         5   car   362.831266  ...   69.389389  103.866552   424.796503
..      ...   ...          ...  ...         ...         ...          ...
101     102   adi  2000.000000  ...  204.090347  478.517223  2088.648870
102     103   adi  2600.000000  ...  418.687286  977.552367  2664.583623
103     104   adi  1600.000000  ...  103.732704  432.129749  1475.371534
104     105   adi  2300.000000  ...  178.691742   49.593290  2480.592151
105     106   adi  2600.000000  ...  154.122604  729.368395  2545.419744

[106 rows x 11 columns]

Variables of Interest

We will focus on two features:

  • \(I_0\): impedivity at 0 kHz,
  • \(PA_{500}\): phase angle at 500 kHz.

Visualizing the Data

Code
from plotnine import *

(
  ggplot(data = breast_df, 
         mapping = aes(x = "I0", y = "PA500", color = "Class")) +
  geom_point(size = 2) +
  theme_bw() +
  theme(legend_position = "top")
    )

K-Nearest Neighbors Classification

K-Nearest Neighbors

What would we predict for someone with an \(I_0\) of 400 and a \(PA_{500}\) of 0.18?

X_train = breast_df[["I0", "PA500"]]
y_train = breast_df["Class"]

X_unknown = pd.DataFrame({"I0": [400], "PA500": [.18]})
X_unknown
    I0  PA500
0  400   0.18

K-Nearest Neighbors

Code
from plotnine import *

(
  ggplot() + 
  geom_point(data = breast_df, 
             mapping = aes(x = "I0", y = "PA500", color = "Class")) +
  geom_point(data = X_unknown, 
             mapping = aes(x = "I0", y = "PA500"), size = 3) +
  theme_bw() +
  theme(legend_position = "top")
)

K-Nearest Neighbors

This process is almost identical to KNN Regression:

Specify

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors = 5, 
                         metric = "euclidean")
    )

Fit

pipeline = pipeline.fit(X_train, y_train)

Predict

pipeline.predict(X_unknown)
array(['car'], dtype=object)

Probabilities

Which of these two unknown points would we be more sure about in our guess?

Code
X_unknown = pd.DataFrame({"I0": [400, 2200], "PA500": [.18, 0.05]})

(
  ggplot() + 
  geom_point(data = breast_df, 
             mapping = aes(x = "I0", y = "PA500", color = "Class"), 
             size = 2) +
  geom_point(data = X_unknown, 
             mapping = aes(x = "I0", y = "PA500"), size = 3) +
  theme_bw() +
  theme(legend_position = "top")
)

Probabilities

Instead of returning a single predicted class, we can ask sklearn to return the predicted probabilities for each class.

pipeline.predict_proba(X_unknown)
array([[0. , 0.6, 0. , 0.2, 0. , 0.2],
       [1. , 0. , 0. , 0. , 0. , 0. ]])
pipeline.classes_
array(['adi', 'car', 'con', 'fad', 'gla', 'mas'], dtype=object)

Tip

How did Scikit-Learn calculate these predicted probabilities?

Cross-Validation for Classification

We need a different scoring method for classification.

A simple scoring method is accuracy:

\[\text{accuracy} = \frac{\text{# correct predictions}}{\text{# predictions}}\]

Cross-Validation for Classification

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    pipeline, X_train, y_train,
    scoring = "accuracy",
    cv = 10)
    
scores
array([0.63636364, 0.81818182, 0.45454545, 0.54545455, 0.63636364,
       0.54545455, 0.5       , 0.6       , 0.4       , 0.7       ])

Cross-Validation for Classification

As before, we can get an overall estimate of test accuracy by averaging the cross-validation accuracies:

scores.mean()
np.float64(0.5836363636363637)


But! Accuracy is not always the best measure of a classification model!

Confusion Matrix

from sklearn.metrics import confusion_matrix

y_train_predicted = pipeline.predict(X_train)

confusion_matrix(y_train, y_train_predicted)
array([[20,  0,  2,  0,  0,  0],
       [ 0, 20,  0,  0,  0,  1],
       [ 2,  0, 11,  1,  0,  0],
       [ 0,  0,  0, 13,  1,  1],
       [ 0,  0,  0,  3, 12,  1],
       [ 0,  2,  0,  8,  3,  5]])

Remember we made the pipeline previously!

pipeline = pipeline.fit(X_train, y_train)

Confusion Matrix with Classes

pd.DataFrame(confusion_matrix(y_train, y_train_predicted), 
             columns = pipeline.classes_, 
             index = pipeline.classes_)
     adi  car  con  fad  gla  mas
adi   20    0    2    0    0    0
car    0   20    0    0    0    1
con    2    0   11    1    0    0
fad    0    0    0   13    1    1
gla    0    0    0    3   12    1
mas    0    2    0    8    3    5

What group(s) were the hardest to predict?

Activity

Activity

Use a grid search and the accuracy score to find the best k-value for this modeling problem.

Classification Metrics

Case Study: Credit Card Fraud

We have a data set of credit card transactions from Vesta.

df_fraud = pd.read_csv("https://datasci112.stanford.edu/data/fraud.csv")
            card4   card6 P_emaildomain  ...    C13    C14  isFraud
0            visa   debit     gmail.com  ...  637.0  114.0        0
1            visa   debit                ...    3.0    1.0        0
2            visa   debit     yahoo.com  ...    4.0    1.0        1
3            visa   debit   hotmail.com  ...    0.0    0.0        0
4            visa   debit     gmail.com  ...   20.0    1.0        0
...           ...     ...           ...  ...    ...    ...      ...
59049  mastercard   debit     gmail.com  ...    1.0    1.0        0
59050  mastercard  credit     yahoo.com  ...    1.0    1.0        0
59051  mastercard   debit    icloud.com  ...   15.0    2.0        0
59052        visa   debit     gmail.com  ...    1.0    1.0        1
59053  mastercard   debit                ...   84.0   17.0        0

[59054 rows x 19 columns]

Goal: Predict isFraud, where 1 indicates a fraudulent transaction.

Classification Model

We can use \(k\)-nearest neighbors for classification:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

ct = make_column_transformer(
        (OneHotEncoder(handle_unknown = "ignore", sparse_output = False), 
        ["card4", "card6", "P_emaildomain"]),
        remainder = "passthrough")
        
pipeline = make_pipeline(
  ct,
  StandardScaler(),
  KNeighborsClassifier(n_neighbors = 5)
  )

What is this transformer doing? What about the pipeline?

Training a Classifier

Isolating X and y for training data

X_train = df_fraud.drop("isFraud", axis = "columns")
y_train = df_fraud["isFraud"]
cross_val_score(
    pipeline,
    X = X_train, 
    y = y_train,
    scoring = "accuracy",
    cv = 10
    ).mean()
np.float64(0.9681816479631644)

How is the accuracy so high????

A Closer Look

Let’s take a closer look at the labels.

y_train.value_counts()
isFraud
0    56935
1     2119
Name: count, dtype: int64


The vast majority of transactions aren’t fraudulent!

Imbalanced Data

If we just predicted that every transaction is normal, the accuracy would be \(1 - \frac{2119}{59054} = 0.964\) or 96.4%.


Even though such predictions would be accurate overall, it is inaccurate for fraudulent transactions.

A good model is “accurate for every class.”

Precision and Recall

We need a score that measures “accuracy for class \(c\)”!

There are at least two reasonable definitions:

Precision: \(P(\text{correct } | \text{ predicted class } c)\)

Among the observations that were predicted to be in class \(c\), what proportion actually were?

Recall: \(P(\text{correct } | \text{ actual class } c)\).

Among the observations that were actually in class \(c\), what proportion were predicted to be?

Precision and Recall by Hand

A confusion matrix diagram with a 2x2 grid representing the performance of a classification model. The x-axis is labeled 'Actual Values' with categories 'Positive (1)' and 'Negative (0)'. The y-axis is labeled 'Predicted Values' with categories 'Positive (1)' and 'Negative (0)'. The four quadrants are labeled: 'TN' (True Negatives) in the top-left, 'FP' (False Positives) in the top-right, 'FN' (False Negatives) in the bottom-left, and 'TP' (True Positives) in the bottom-right.


Precision is calculated as \(\frac{\text{TP}}{\text{TP} + \text{FP}}\).



Recall is calculated as \(\frac{\text{TP}}{\text{TP} + \text{FN}}\).

Precision and Recall by Hand

To check our understanding of these definitions, let’s calculate a few precisions and recalls by hand.

But first we need to get the confusion matrix!


Code
pipeline.fit(X_train, y_train);
y_train_ = pipeline.predict(X_train)
confusion_matrix(y_train, y_train_)
array([[56814,   121],
       [ 1519,   600]])

Now Let’s Calculate!

Code
conf_mat = pd.DataFrame(confusion_matrix(y_train, y_train_), 
             columns = pipeline.classes_, 
             index = pipeline.classes_)
             
conf_mat["Total"] = conf_mat.sum(axis=1)
conf_mat.loc["Total"] = conf_mat.sum()

conf_mat
           0    1  Total
0      56814  121  56935
1       1519  600   2119
Total  58333  721  59054
  • What is the (training) accuracy?

  • What’s the precision for fraudulent transactions?

  • What’s the recall for fraudulent transactions?

Trade Off Between Precision and Recall

Can you imagine a classifier that always has 100% recall for class \(c\), no matter the data?

In general, if the model classifies more observations as \(c\),

  • recall (for class \(c\)) \(\uparrow\)

  • precision (for class \(c\)) \(\downarrow\)

How do we compare two classifiers, if one has higher precision and the other has higher recall?

F1 Score

The F1 score combines precision and recall into a single score:

\[\text{F1 score} = \text{harmonic mean of precision and recall}\] \[= \frac{2} {\left( \frac{1}{\text{precision}} + \frac{1}{\text{recall}}\right)}\]

  • To achieve a high F1 score, both precision and recall have to be high.

  • If either is low, then the harmonic mean will be low.

Estimating Test Precision, Recall, and F1

  • Remember that each class has its own precision, recall, and F1.

  • But Scikit-Learn requires that the scoring parameter be a single metric.

  • For this, we can average the score over the metrics:

    • "precision_macro"
    • "recall_macro"
    • "f1_macro"

F1 Score

cross_val_score(
    pipeline,
    X = X_train,
    y = y_train,
    scoring = "f1_macro",
    cv = 10
    ).mean()
np.float64(0.647384563849488)

Precision-Recall Curve

Another way to illustrate the trade off between precision and recall is to graph the precision-recall curve.

First, we need the predicted probabilities.

y_train_probs_ = pipeline.predict_proba(X_train)
y_train_probs_
array([[1. , 0. ],
       [1. , 0. ],
       [0.4, 0.6],
       ...,
       [1. , 0. ],
       [0.8, 0.2],
       [1. , 0. ]], shape=(59054, 2))

Precision-Recall Curve

  • By default, Scikit-Learn classifies a transaction as fraud if this probability is \(> 0.5\).

  • What if we instead used a threshold t other than \(0.5\)?

  • Depending on which t we pick, we’ll get a different precision and recall.

We can graph this trade off!

Precision-Recall Curve

Let’s graph the precision-recall curve together in a Colab.

https://colab.research.google.com/drive/1T-0iQOQZFldHNmOXdZf4GU0b8j3kWMc_?usp=sharing

Takeaways

Takeaways

  • We can do KNN for Classification by letting the nearest neighbors “vote”

  • The number of votes is a “probability”

  • A classification model must be evaluated differently than a regression model.

  • One possible metric is accuracy, but this is a bad choice in situations with imbalanced data.

  • Precision measures “if we say it’s in Class A, is it really?”

  • Recall measures “if it’s really in Class A, did we find it?”

  • F1 Score is a balance of precision and recall

  • Macro F1 Score averages the F1 scores of all classes