Logistic Regression

The story this week…

Classification

We can do KNN for Classification by letting the nearest neighbors “vote”.
The number of votes is a “probability”.
A classification model must be evaluated differently than a regression model.
One possible metric is accuracy, but this is a bad choice in situations with imbalanced data.
Precision measures “if we say it’s in Class A, is it really?”
Recall measures “if it’s really in Class A, did we find it?”
F1 Score is a balance of precision and recall.
Macro F1 Score averages the F1 scores of all classes.

Revisiting the Breast Cancer Data

Breast Tissue Classification

Electrical signals can be used to detect whether tissue is cancerous.

Analysis Goal

The goal is to determine whether a sample of breast tissue is:

Not Cancerous

connective tissue
adipose tissue
glandular tissue

Cancerous

carcinoma
fibro-adenoma
mastopathy

Binary response: Cancer or Not

Let’s read the data, and also make a new variable called “Cancerous”.

import pandas as pd

df = pd.read_csv("https://datasci112.stanford.edu/data/BreastTissue.csv")

cancer_levels = ["car", "fad", "mas"]
df['Cancerous'] = df['Class'].isin(cancer_levels)

   Case # Class          I0  ...          DR           P  Cancerous
0       1   car  524.794072  ...  220.737212  556.828334       True
1       2   car  330.000000  ...   99.084964  400.225776       True
2       3   car  551.879287  ...  253.785300  656.769449       True
3       4   car  380.000000  ...  105.198568  493.701814       True
4       5   car  362.831266  ...  103.866552  424.796503       True

[5 rows x 12 columns]

Why not use “regular” regression?

You should NOT use ordinary regression for a classification problem! This slide section is to show you why it does NOT work.

Counter-Example: Linear Regression

We know that in computers, True = 1 and False = 0. So, why not convert our response variable, Cancerous, to numbers and fit a regression?

df['Cancerous'] = df['Cancerous'].astype('int')

df.head()

   Case # Class          I0  ...          DR           P  Cancerous
0       1   car  524.794072  ...  220.737212  556.828334          1
1       2   car  330.000000  ...   99.084964  400.225776          1
2       3   car  551.879287  ...  253.785300  656.769449          1
3       4   car  380.000000  ...  105.198568  493.701814          1
4       5   car  362.831266  ...  103.866552  424.796503          1

[5 rows x 12 columns]

Counter-Example: Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
  LinearRegression()
  )

pipeline = pipeline.fit(X = df[["I0", "PA500"]],
                        y = df['Cancerous'])

Counter-Example: Linear Regression

Problem 1: Did we get “reasonable” predictions?

pred_cancer = pipeline.predict(df[["I0", "PA500"]])

pred_cancer.min()

np.float64(-0.2666656752041747)

pred_cancer.max()

np.float64(1.144397886160058)

Counter-Example: Linear Regression

Problem 2: How do we translate these predictions into categories???

pred_cancer = pipeline.predict(df[["I0", "PA500"]])
pred_cancer

array([ 0.74019045,  0.89022436,  0.82501565,  0.90210945,  0.82407215,
        0.70880368,  0.73084215,  0.75641256,  0.81285854,  0.84265773,
        1.05274916,  0.83203994,  0.82221736,  0.98879611,  0.84321428,
        1.14439789,  0.87065951,  0.82357191,  0.88721004,  0.86017708,
        0.75973787,  0.56958917,  0.50364921,  0.84044159,  0.66291026,
        0.5291845 ,  0.58770322,  0.58565156,  0.53944124,  0.54086431,
        0.626959  ,  0.60392716,  0.84459104,  0.65514535,  0.83767168,
        0.73230038,  0.82568408,  0.75333313,  0.54012943,  0.54183212,
        0.50716066,  0.70881584,  0.52734464,  0.66456375,  0.73629418,
        0.85237593,  0.84451578,  0.62061155,  0.60509501,  0.47440789,
        0.78797208,  0.74240828,  0.91841366,  0.71160435,  0.63338505,
        0.7122256 ,  0.82410488,  0.55742465,  0.62545421,  0.73912902,
        0.73912902,  0.70136245,  0.78113432,  0.77907358,  0.51597274,
        0.49247652,  0.65694204,  0.72574607,  0.76292245,  0.82501906,
        0.04266066,  0.24283615,  0.30785716,  0.52977377,  0.12835422,
        0.35012437,  0.39042852,  0.34447888,  0.10874534,  0.22844426,
        0.35125209,  0.25457208,  0.12820727,  0.18275452, -0.06826015,
       -0.02235371,  0.05662524, -0.02161184,  0.03275109, -0.03722978,
        0.05434445, -0.11400134,  0.0985149 , -0.01465065,  0.05067869,
        0.04211471,  0.05774736, -0.26666568, -0.13914905, -0.12585178,
       -0.02264956,  0.06060737,  0.04839979,  0.12583274, -0.17299233,
       -0.22474136])

Counter-Example: Linear Regression

Problem 3: Was the relationship really linear???

Code

from plotnine import *

(
  ggplot(data = df, 
         mapping = aes(x = "I0", y = "Cancerous")) + 
  geom_point() + 
  geom_smooth(method = "lm", se = False) + 
  theme_bw()
)

Counter-Example: Linear Regression

Problem 4: Are the errors really random???

Code

residuals = df['Cancerous'] - pred_cancer

(
  ggplot(data = df, 
         mapping = aes(x = "I0", y = residuals)) + 
  geom_point() +
  theme_bw() +
  labs(y = "Linear Regression Residuals")
  )

Counter-Example: Linear Regression

Problem 5: Are the errors normally distributed???

Code

(
  ggplot(data = df, 
         mapping = aes(x = residuals)) + 
  geom_density() +
  theme_bw() +
  labs(x = "Residual from Linear Regression Model")
  )

Logistic Regression

Idea: Instead of predicting 0 or 1, try to predict the probability of cancer.

Problem: We don’t observe probabilities before diagnosis; we only know if that person ended up with cancer or not.
Solution: (Fancy statistics and math.)
Why is it called Logistic Regression?
Because the “fancy math” uses a logistic function in it.