Cross-Validation and Grid Search

More Information About Final Projects!

Final Project Presentations

Will take place on Saturday, March 15 from 1:10pm to 2:30pm, in our classroom!

If you are having trouble finding a group to work with

fill out this Google Form and Dr. T will help you!

The story so far…

Modeling

We assume some process \(f\) is generating our target variable:

target = f(predictors) + noise

Our goal is to come up with an approximation of \(f\).

Test Error vs Training Error

We don’t need to know how well our model does on training data.
We want to know how well it will do on test data.
In general, test error \(>\) training error.

Analogy: A professor posts a practice exam before an exam.

If the actual exam is the same as the practice exam, how many points will students miss? That’s training error.
If the actual exam is different from the practice exam, how many points will students miss? That’s test error.

It’s always easier to answer questions that you’ve seen before than questions you haven’t seen.

Modeling Procedure

For each model proposed:

Establish a pipeline with transformers and a model.
Fit the pipeline on the training data (with known outcome)
Predict with the fitted pipeline on test data (with known outcome)
Evaluate our success; i.e., measure noise “left over”

Then:

Select the best model
Fit on all the data
Predict on any future data (with unknown outcome)

Linear Regression

Simple Linear Model

We assume that the target (\(Y\)) is generated from an equation of the predictor (\(X\)), plus random noise (\(\epsilon\))

\[Y = \beta_0 + \beta_1 X + \epsilon\]

Goal: Use observations \((x_1, y_1), ..., (x_n, y_n)\) to estimate \(\beta_0\) and \(\beta_1\).

What are these parameters???

In Statistics, we use \(\beta_0\) to represent the population intercept and \(\beta_1\) to represent the slope. By “population” we mean the true slope of the line for every observation in the population of interest.

Measures of Success

What is the “best” choice of \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) (the estimates of \(\beta_0\) and \(\beta_1\))?

The ones that are statistically most justified, under certain assumptions about \(Y\) and \(X\)?
The ones that are “closest to” the observed points?
- \(|\widehat{y}_i - y_i|\)?
- \((\widehat{y}_i - y_i)^2\)?
- \((\widehat{y}_i - y_i)^4\)?

Example: Wine Data

df = pd.read_csv("https://dlsun.github.io/pods/data/bordeaux.csv")

known = df["year"] < 1981
df_known = df[known]
unknown = df["year"] > 1980
df_unknown = df[unknown]

Price Predicted by Age of Wine

Code

from mizani.formatters import currency_format

(
  ggplot(df_known, aes(x = "age", y = "price")) + 
  geom_point() +
  labs(x = "Age of Wine (Years Since 1992)", 
       y = "Price of Wine (in 1992 USD)") +
  scale_y_continuous(labels = currency_format(precision = 0))
)

“Candidate” Regression Lines

Consider five possible regression equations:

\[\text{price} = 25 + 0*\text{age}\] \[\text{price} = 0 + 10*\text{age}\] \[\text{price} = 20 + 1*\text{age}\] \[\text{price} = -40 + 3*\text{age}\]

Which one do you think will be “closest” to the points on the scatterplot?

“Candidate” Regression Lines

Code

(
  ggplot(data = df_known, mapping = aes(x = "age", y = "price")) +
  geom_point() + 
  labs(x = "Age of Wine (Years Since 1992)", 
       y = "Price of Wine (in 1992 USD)") +
  scale_y_continuous(labels = currency_format(precision = 0)) +
  geom_abline(intercept = 25, 
              slope = 0, 
              color = "blue", 
              linetype = "solid") + 
  geom_abline(intercept = 0, 
              slope = 1, 
              color = "orange", 
              linetype = "dashed") + 
  geom_abline(intercept = 20, 
              slope = 1, 
              color = "green") + 
  geom_abline(intercept = -40, 
              slope = 3, 
              color = "magenta")
  )

The “best” slope and intercept

It’s clear that some of these lines are better than others.
How to choose the best? Math!
We’ll let the computer do it for us.

Caution

The estimated slope and intercept are calculated from the training data at the .fit() step.

Linear Regression in `sklearn`

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

Specify

pipeline = make_pipeline(
    LinearRegression()
    )

Fit

pipeline.fit(
  X = df_known[['age']], 
  y = df_known['price']
  )

Pipeline(steps=[('linearregression', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Estimated Intercept and Slope

Estimated Intercept

(
  pipeline
  .named_steps['linearregression']
  .intercept_
  )

np.float64(-0.29971930118565737)

Estimated Slope

(
  pipeline
  .named_steps['linearregression']
  .coef_
  )

array([1.15601827])

Fitting and Predicting

To predict from a linear regression, we just plug in the values to the equation…

-0.3 + 1.16 * df_unknown["age"]

27    12.46
28    11.30
29    10.14
30     8.98
31     7.82
32     6.66
33     5.50
34     4.34
35     3.18
36     2.02
37     0.86
Name: age, dtype: float64

Fitting and Predicting with `.predict()`

To predict from a linear regression, we just plug in the values to the equation…

pipeline.predict(df_unknown[['age']])

array([12.41648163, 11.26046336, 10.1044451 ,  8.94842683,  7.79240856,
        6.6363903 ,  5.48037203,  4.32435376,  3.1683355 ,  2.01231723,
        0.85629897])

Questions to ask yourself

Q: Is there only one “best” regression line?
A: No, you can justify many choices of slope and intercept! But there is a generally accepted approach called Least Squares Regression that we will always use.
Q: How do you know which variables to include in the equation?
A: Try them all, and see what predicts best.
Q: How do you know whether to use linear regression or KNN to predict?
A: Try them both, and see what predicts best!

Cross-Validation

Resampling methods

We saw that a “fair” way to evaluate models was to randomly split into training and test sets.
But what if this randomness was misleading? (e.g., a major outlier in one of the sets)
What do we usually do in Statistics to address randomness? Take many samples and compute an average!
A resampling method is when we take many random test / training splits and average the resulting metrics.

Resampling Method Example

Import all our functions:

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Creating Testing & Training Splits

X_train, X_test, y_train, y_test = train_test_split(df_known, 
                                                    df_known['price'], 
                                                    test_size = 0.1)

This creates four new objects, X_train, X_test, y_train, and y_test.
- Note that the objects are created in this order!

The “testing” data are 10% (0.1) of the total size of the training data (df_known).

Pipeline for Predicting on Test Data

Specify

features = ['summer', 'har', 'sep', 'win', 'age']
ct = make_column_transformer(
  (StandardScaler(), features),
  remainder = "drop"
)

pipeline = make_pipeline(
    ct,
    LinearRegression()
    )

Fit for Training Data

pipeline = pipeline.fit(X = X_train, y = y_train)

Predict for Test Data

pred_y_test = pipeline.predict(X = X_test)

Estimating Error for Test Data

mean_squared_error(y_test, pred_y_test)

61.63802731695058