Logistic Regression

The story this week…

Classification

  • We can do KNN for Classification by letting the nearest neighbors “vote”.

  • The number of votes is a “probability”.

  • A classification model must be evaluated differently than a regression model.

  • One possible metric is accuracy, but this is a bad choice in situations with imbalanced data.

  • Precision measures “if we say it’s in Class A, is it really?”

  • Recall measures “if it’s really in Class A, did we find it?”

  • F1 Score is a balance of precision and recall.

  • Macro F1 Score averages the F1 scores of all classes.

Revisiting the Breast Cancer Data

Breast Tissue Classification

Electrical signals can be used to detect whether tissue is cancerous.

A medical illustration showing a breast cancer detection procedure using electrical impedance scanning. A patient is lying down while a scan probe is placed on the breast. An inset diagram highlights how the probe detects differences in impedance between normal breast adipose tissue (high impedance) and malignant lesions (low impedance). The probe is connected to a computer displaying a grid with white dots, likely representing detected areas of concern.

Analysis Goal

The goal is to determine whether a sample of breast tissue is:

Not Cancerous

  1. connective tissue
  2. adipose tissue
  3. glandular tissue

Cancerous

  1. carcinoma
  2. fibro-adenoma
  3. mastopathy

Binary response: Cancer or Not

Let’s read the data, and also make a new variable called “Cancerous”.

import pandas as pd

df = pd.read_csv("https://datasci112.stanford.edu/data/BreastTissue.csv")

cancer_levels = ["car", "fad", "mas"]
df['Cancerous'] = df['Class'].isin(cancer_levels)
   Case # Class          I0  ...          DR           P  Cancerous
0       1   car  524.794072  ...  220.737212  556.828334       True
1       2   car  330.000000  ...   99.084964  400.225776       True
2       3   car  551.879287  ...  253.785300  656.769449       True
3       4   car  380.000000  ...  105.198568  493.701814       True
4       5   car  362.831266  ...  103.866552  424.796503       True

[5 rows x 12 columns]

Why not use “regular” regression?

You should NOT use ordinary regression for a classification problem! This slide section is to show you why it does NOT work.

Counter-Example: Linear Regression

We know that in computers, True = 1 and False = 0. So, why not convert our response variable, Cancerous, to numbers and fit a regression?

df['Cancerous'] = df['Cancerous'].astype('int')
df.head()
   Case # Class          I0  ...          DR           P  Cancerous
0       1   car  524.794072  ...  220.737212  556.828334          1
1       2   car  330.000000  ...   99.084964  400.225776          1
2       3   car  551.879287  ...  253.785300  656.769449          1
3       4   car  380.000000  ...  105.198568  493.701814          1
4       5   car  362.831266  ...  103.866552  424.796503          1

[5 rows x 12 columns]

Counter-Example: Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
  LinearRegression()
  )

pipeline = pipeline.fit(X = df[["I0", "PA500"]],
                        y = df['Cancerous'])

Counter-Example: Linear Regression

Problem 1: Did we get “reasonable” predictions?

pred_cancer = pipeline.predict(df[["I0", "PA500"]])


pred_cancer.min()
np.float64(-0.2666656752041747)
pred_cancer.max()
np.float64(1.144397886160058)

Counter-Example: Linear Regression

Problem 2: How do we translate these predictions into categories???

pred_cancer = pipeline.predict(df[["I0", "PA500"]])
pred_cancer
array([ 0.74019045,  0.89022436,  0.82501565,  0.90210945,  0.82407215,
        0.70880368,  0.73084215,  0.75641256,  0.81285854,  0.84265773,
        1.05274916,  0.83203994,  0.82221736,  0.98879611,  0.84321428,
        1.14439789,  0.87065951,  0.82357191,  0.88721004,  0.86017708,
        0.75973787,  0.56958917,  0.50364921,  0.84044159,  0.66291026,
        0.5291845 ,  0.58770322,  0.58565156,  0.53944124,  0.54086431,
        0.626959  ,  0.60392716,  0.84459104,  0.65514535,  0.83767168,
        0.73230038,  0.82568408,  0.75333313,  0.54012943,  0.54183212,
        0.50716066,  0.70881584,  0.52734464,  0.66456375,  0.73629418,
        0.85237593,  0.84451578,  0.62061155,  0.60509501,  0.47440789,
        0.78797208,  0.74240828,  0.91841366,  0.71160435,  0.63338505,
        0.7122256 ,  0.82410488,  0.55742465,  0.62545421,  0.73912902,
        0.73912902,  0.70136245,  0.78113432,  0.77907358,  0.51597274,
        0.49247652,  0.65694204,  0.72574607,  0.76292245,  0.82501906,
        0.04266066,  0.24283615,  0.30785716,  0.52977377,  0.12835422,
        0.35012437,  0.39042852,  0.34447888,  0.10874534,  0.22844426,
        0.35125209,  0.25457208,  0.12820727,  0.18275452, -0.06826015,
       -0.02235371,  0.05662524, -0.02161184,  0.03275109, -0.03722978,
        0.05434445, -0.11400134,  0.0985149 , -0.01465065,  0.05067869,
        0.04211471,  0.05774736, -0.26666568, -0.13914905, -0.12585178,
       -0.02264956,  0.06060737,  0.04839979,  0.12583274, -0.17299233,
       -0.22474136])

Counter-Example: Linear Regression

Problem 3: Was the relationship really linear???

Code
from plotnine import *

(
  ggplot(data = df, 
         mapping = aes(x = "I0", y = "Cancerous")) + 
  geom_point() + 
  geom_smooth(method = "lm", se = False) + 
  theme_bw()
)

Counter-Example: Linear Regression

Problem 4: Are the errors really random???

Code
residuals = df['Cancerous'] - pred_cancer

(
  ggplot(data = df, 
         mapping = aes(x = "I0", y = residuals)) + 
  geom_point() +
  theme_bw() +
  labs(y = "Linear Regression Residuals")
  ) 

Counter-Example: Linear Regression

Problem 5: Are the errors normally distributed???

Code
(
  ggplot(data = df, 
         mapping = aes(x = residuals)) + 
  geom_density() +
  theme_bw() +
  labs(x = "Residual from Linear Regression Model")
  )

Logistic Regression

Logistic Regression

Idea: Instead of predicting 0 or 1, try to predict the probability of cancer.

  • Problem: We don’t observe probabilities before diagnosis; we only know if that person ended up with cancer or not.

  • Solution: (Fancy statistics and math.)

  • Why is it called Logistic Regression?

  • Because the “fancy math” uses a logistic function in it.

Logistic Regression

What you need to know:

  • It’s used for binary classification problems.

  • The predicted values are the “log-odds” of having cancer, i.e.

\[\text{log-odds} = \log \left(\frac{p}{1-p}\right)\]

  • We are more interested in the predicted probabilities.

  • As with KNN, we predict categories by choosing a threshold.

  • By default if \(p > 0.5\) -> we predict cancer

Logistic Regression in sklearn

from sklearn.linear_model import LogisticRegression

pipeline = make_pipeline(
  LogisticRegression(penalty = None)
  )

pipeline.fit(X = df[["I0", "PA500"]], 
             y = df['Cancerous']
             );

Logistic Regression in sklearn

pred_cancer = pipeline.predict(df[["I0", "PA500"]])
pred_cancer
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Precison and Recall Revisited

Confusion Matrix

Code
from sklearn.metrics import confusion_matrix

pd.DataFrame(
  confusion_matrix(df['Cancerous'], pred_cancer), 
  columns = pipeline.classes_, 
  index = pipeline.classes_)
    0   1
0  38  14
1   3  51
  • Calculate the precision for predicting cancer.

  • Calculate the recall for predicting cancer.

  • Calculate the precision for predicting non-cancer.

  • Calculate the recall for predicting non-cancer.

Threshold

What if we had used different cutoffs besides \(p > 0.5\)?

prob_cancer = pipeline.predict_proba(df[["I0", "PA500"]])
prob_cancer.round(2)[1:10]
array([[0.11, 0.89],
       [0.19, 0.81],
       [0.11, 0.89],
       [0.16, 0.84],
       [0.27, 0.73],
       [0.22, 0.78],
       [0.2 , 0.8 ],
       [0.18, 0.82],
       [0.15, 0.85]])

Higher Threshold

What we used \(p > 0.7\)?

prob_cancer = pipeline.predict_proba(df[["I0", "PA500"]])

pred_cancer_70 = prob_cancer[:, 1] > .7
pred_cancer_70[1:10]
array([ True,  True,  True,  True,  True,  True,  True,  True,  True])

Higher Threshold

What we used \(p > 0.7\)?

Code
conf_mat = confusion_matrix(df['Cancerous'], pred_cancer_70)
pd.DataFrame(conf_mat, 
             columns = pipeline.classes_, 
             index = pipeline.classes_)
    0   1
0  41  11
1  18  36
precision_1 = conf_mat[1,1] / conf_mat[:,1].sum()
precision_1
np.float64(0.7659574468085106)
recall_1 = conf_mat[1,1] / conf_mat[1, :].sum()
recall_1
np.float64(0.6666666666666666)

Lower Threshold

What we used \(p > 0.2\)?

prob_cancer = pipeline.predict_proba(df[["I0", "PA500"]])
pred_cancer_20 = prob_cancer[:,1] > .2
pred_cancer_20[1:10]
array([ True,  True,  True,  True,  True,  True,  True,  True,  True])

Lower Threshold

Code
conf_mat = confusion_matrix(df['Cancerous'], pred_cancer_20)
pd.DataFrame(conf_mat, 
             columns = pipeline.classes_, 
             index = pipeline.classes_)
    0   1
0  33  19
1   0  54
precision_1 = conf_mat[1,1] / conf_mat[:,1].sum()
precision_1
np.float64(0.7397260273972602)
recall_1 = conf_mat[1,1] / conf_mat[1, :].sum()
recall_1
np.float64(1.0)

Precision-Recall Curve

from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(
    df['Cancerous'], prob_cancer[:, 1])

df_pr = pd.DataFrame({
  "precision": precision,
  "recall": recall
})

Precision-Recall Curve

Code
(
  ggplot(data = df_pr, 
         mapping = aes(x = "recall", y = "precision")) +
  geom_line() + 
  theme_bw() + 
  labs(x = "Recall", 
       y = "Precision")
)

Your turn

Activity

Suppose you want to predict Cancer vs. No Cancer from breast tissue using a Logistic Regression. Should you use…

  • Just I0 and PA500?

  • Just DA and P?

  • I0, PA500, DA, and P?

  • or all predictors?

Use cross-validation (cross_val_score()) with 10 folds using the F1 Score (scoring = "f1_macro") to decide!

Then, fit your final model and report the confusion matrix.

Interpreting Logistic Regression

Looking at Coefficients

pd.DataFrame({
  "Coefficients": pipeline['logisticregression'].coef_[0],
  "Column": ["I0", "PA500"]
  })
   Coefficients Column
0     -0.003078     I0
1     11.737963  PA500
  • “For every unit of I0 higher, we predict 0.003 lower log-odds of cancer.”

  • “For every unit of PA500 higher, we predict 11.73 higher log-odds of cancer.”

Feature Importance

  • Does this mean that PA500 is more important than I0?
Code
(
  ggplot(data = df, 
         mapping = aes(x = "PA500", 
                       group = "Cancerous", 
                       fill = "Cancerous")) + 
  geom_density(alpha = 0.5, show_legend = False) +
  theme_bw()
) 

Code
(
  ggplot(data = df, 
         mapping = aes(x = "I0", 
                       group = "Cancerous", 
                       fill = "Cancerous")) + 
  geom_density(alpha = 0.5, show_legend = False) + 
  theme_bw()
)

Standardization

  • Does this mean that PA500 is more important than I0?

  • Not necessarily. They have different units and so the coefficients mean different things.

  • “For every 1000 units of I0 higher, we predict 3.0 lower log-odds of cancer”

  • “For every 0.1 unit of PA500 higher, we predict 1.1 higher log-odds of cancer.”

  • What if we had standardized I0 and PA500?

Standardization

from sklearn.preprocessing import StandardScaler

pipeline2 = make_pipeline(
  StandardScaler(),
  LogisticRegression(penalty = None)
  )

pipeline2 = pipeline2.fit(df[["I0", "PA500"]], df['Cancerous'])

pd.DataFrame({
  "Coefficients": pipeline2['logisticregression'].coef_[0],
  "Column": ["I0", "PA500"]
  })
   Coefficients Column
0     -2.309090     I0
1      0.801477  PA500

Standardization

   Coefficients Column
0     -2.309090     I0
1      0.801477  PA500
  • “For every standard deviation above the mean someone’s I0 is, we predict 2.3 lower log-odds of cancer”
  • “For every standard deviation above the mean someone’s PA500 is, we predict 0.80 higher log-odds of cancer.”

Standardization: Do you need it?

But - does this approach change our predictions?

old_probs = pipeline.predict_proba(df[["I0", "PA500"]])
new_probs = pipeline2.predict_proba(df[["I0", "PA500"]])

pd.DataFrame({
  "without_stdize": old_probs[:,1], 
  "with_stdize": new_probs[:,1]
  }).head(10)
   without_stdize  with_stdize
0        0.736188     0.736226
1        0.889803     0.889822
2        0.813277     0.813319
3        0.890781     0.890804
4        0.842957     0.842978
5        0.731660     0.731678
6        0.775454     0.775463
7        0.802116     0.802126
8        0.816982     0.817014
9        0.847675     0.847702

Standardization: Do you need it?

  • Standardizing will not change the predictions for Linear or Logistic Regression!

    • This is because the coefficients are chosen relative to the units of the predictors. (Unlike in KNN!)
  • Advantage of not standardizing: More interpretable coefficients

    • “For each unit of…” instead of “For each sd above the mean…”
  • Advantage of standardizing: Compare relative importance of predictors

  • It’s up to you!

    • Don’t use cross-validation to decide - you’ll get the same metrics for both!

Your turn

Activity

For your Logistic Regression using all predictors, which variable was the most important?

How would you interpret the coefficient?

Takeaways

Takeaways

  • To fit a regression model (i.e., coefficients times predictors) to a categorical response, we use Logistic Regression.

  • Coefficients are interpreted as “One unit increase in predictor is associated with a [something] increase in the log-odds of Category 1.”

  • We still use cross-validated metrics to decide between KNN and Logistic regression, and between different feature sets.

  • We still report confusion matrices and sometimes precision-recall curves of our final model.