Cross-Validation and Grid Search

More Information About Final Projects!

Final Project Presentations

Will take place on Saturday, March 15 from 1:10pm to 2:30pm, in our classroom!

If you are having trouble finding a group to work with

fill out this Google Form and Dr. T will help you!

The story so far…

Modeling

We assume some process \(f\) is generating our target variable:

target = f(predictors) + noise


Our goal is to come up with an approximation of \(f\).

Test Error vs Training Error

  • We don’t need to know how well our model does on training data.

  • We want to know how well it will do on test data.

  • In general, test error \(>\) training error.

Analogy: A professor posts a practice exam before an exam.

  • If the actual exam is the same as the practice exam, how many points will students miss? That’s training error.

  • If the actual exam is different from the practice exam, how many points will students miss? That’s test error.

It’s always easier to answer questions that you’ve seen before than questions you haven’t seen.

Modeling Procedure

For each model proposed:

  1. Establish a pipeline with transformers and a model.

  2. Fit the pipeline on the training data (with known outcome)

  3. Predict with the fitted pipeline on test data (with known outcome)

  4. Evaluate our success; i.e., measure noise “left over”

Then:

  1. Select the best model

  2. Fit on all the data

  3. Predict on any future data (with unknown outcome)

Linear Regression

Simple Linear Model

We assume that the target (\(Y\)) is generated from an equation of the predictor (\(X\)), plus random noise (\(\epsilon\))

\[Y = \beta_0 + \beta_1 X + \epsilon\]

Goal: Use observations \((x_1, y_1), ..., (x_n, y_n)\) to estimate \(\beta_0\) and \(\beta_1\).

What are these parameters???

In Statistics, we use \(\beta_0\) to represent the population intercept and \(\beta_1\) to represent the slope. By “population” we mean the true slope of the line for every observation in the population of interest.

Measures of Success

What is the “best” choice of \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) (the estimates of \(\beta_0\) and \(\beta_1\))?

  • The ones that are statistically most justified, under certain assumptions about \(Y\) and \(X\)?

  • The ones that are “closest to” the observed points?

    • \(|\widehat{y}_i - y_i|\)?
    • \((\widehat{y}_i - y_i)^2\)?
    • \((\widehat{y}_i - y_i)^4\)?

Example: Wine Data

df = pd.read_csv("https://dlsun.github.io/pods/data/bordeaux.csv")

known = df["year"] < 1981
df_known = df[known]
unknown = df["year"] > 1980
df_unknown = df[unknown]

Price Predicted by Age of Wine

Code
from mizani.formatters import currency_format

(
  ggplot(df_known, aes(x = "age", y = "price")) + 
  geom_point() +
  labs(x = "Age of Wine (Years Since 1992)", 
       y = "Price of Wine (in 1992 USD)") +
  scale_y_continuous(labels = currency_format(precision = 0))
)

“Candidate” Regression Lines

Consider five possible regression equations:

\[\text{price} = 25 + 0*\text{age}\] \[\text{price} = 0 + 10*\text{age}\] \[\text{price} = 20 + 1*\text{age}\] \[\text{price} = -40 + 3*\text{age}\]

Which one do you think will be “closest” to the points on the scatterplot?

“Candidate” Regression Lines

Code
(
  ggplot(data = df_known, mapping = aes(x = "age", y = "price")) +
  geom_point() + 
  labs(x = "Age of Wine (Years Since 1992)", 
       y = "Price of Wine (in 1992 USD)") +
  scale_y_continuous(labels = currency_format(precision = 0)) +
  geom_abline(intercept = 25, 
              slope = 0, 
              color = "blue", 
              linetype = "solid") + 
  geom_abline(intercept = 0, 
              slope = 1, 
              color = "orange", 
              linetype = "dashed") + 
  geom_abline(intercept = 20, 
              slope = 1, 
              color = "green") + 
  geom_abline(intercept = -40, 
              slope = 3, 
              color = "magenta")
  )

The “best” slope and intercept

  • It’s clear that some of these lines are better than others.

  • How to choose the best? Math!

  • We’ll let the computer do it for us.

Caution

The estimated slope and intercept are calculated from the training data at the .fit() step.

Linear Regression in sklearn

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

Specify

pipeline = make_pipeline(
    LinearRegression()
    )

Fit

pipeline.fit(
  X = df_known[['age']], 
  y = df_known['price']
  )
Pipeline(steps=[('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Estimated Intercept and Slope

Estimated Intercept

(
  pipeline
  .named_steps['linearregression']
  .intercept_
  )
np.float64(-0.29971930118565737)

Estimated Slope

(
  pipeline
  .named_steps['linearregression']
  .coef_
  )
array([1.15601827])

Fitting and Predicting

To predict from a linear regression, we just plug in the values to the equation…

-0.3 + 1.16 * df_unknown["age"] 
27    12.46
28    11.30
29    10.14
30     8.98
31     7.82
32     6.66
33     5.50
34     4.34
35     3.18
36     2.02
37     0.86
Name: age, dtype: float64

Fitting and Predicting with .predict()

To predict from a linear regression, we just plug in the values to the equation…

pipeline.predict(df_unknown[['age']])
array([12.41648163, 11.26046336, 10.1044451 ,  8.94842683,  7.79240856,
        6.6363903 ,  5.48037203,  4.32435376,  3.1683355 ,  2.01231723,
        0.85629897])

Questions to ask yourself

  • Q: Is there only one “best” regression line?

  • A: No, you can justify many choices of slope and intercept! But there is a generally accepted approach called Least Squares Regression that we will always use.

  • Q: How do you know which variables to include in the equation?

  • A: Try them all, and see what predicts best.

  • Q: How do you know whether to use linear regression or KNN to predict?

  • A: Try them both, and see what predicts best!

Cross-Validation

Resampling methods

  • We saw that a “fair” way to evaluate models was to randomly split into training and test sets.

  • But what if this randomness was misleading? (e.g., a major outlier in one of the sets)

  • What do we usually do in Statistics to address randomness? Take many samples and compute an average!

  • A resampling method is when we take many random test / training splits and average the resulting metrics.

Resampling Method Example

Import all our functions:

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Creating Testing & Training Splits

X_train, X_test, y_train, y_test = train_test_split(df_known, 
                                                    df_known['price'], 
                                                    test_size = 0.1)
  • This creates four new objects, X_train, X_test, y_train, and y_test.
    • Note that the objects are created in this order!
  • The “testing” data are 10% (0.1) of the total size of the training data (df_known).

Pipeline for Predicting on Test Data

Specify

features = ['summer', 'har', 'sep', 'win', 'age']
ct = make_column_transformer(
  (StandardScaler(), features),
  remainder = "drop"
)

pipeline = make_pipeline(
    ct,
    LinearRegression()
    )

Fit for Training Data

pipeline = pipeline.fit(X = X_train, y = y_train)

Predict for Test Data

pred_y_test = pipeline.predict(X = X_test)

Estimating Error for Test Data

mean_squared_error(y_test, pred_y_test)
61.63802731695058

Cross-Validation

  • It makes sense to do test / training many times…

  • But! Remember the original reason for test / training: we don’t want to use the same data in fitting and evaluation.

Idea: Let’s make sure that each observation only gets to be in the test set once

  • Cross-validation: Divide the data into 10 random “folds”. Each fold gets a “turn” as the test set.

Cross-Validation (5-Fold)

Cross-Validation in sklearn

from sklearn.model_selection import cross_val_score

cross_val_score(pipeline, 
                X = df_known, 
                y = df_known['price'], 
                cv = 10)
array([  0.53885509,   0.36267134,   0.6164344 ,  -2.52293886,
         0.75103464,   0.89242533, -49.63018969, -21.17272593,
        -0.16083366,   0.21125165])

Cross-Validation in sklearn

What are these numbers?

  • sklearn chooses a default metric for you based on the model.

  • In the case of regression, the default metric is R-Squared.

Why use negative root mean squared error?

  • To be consistent! We will always want to maximize this score.

  • Larger R-Squared values explain more of the variance in the response (y).

  • Larger negative RMSE (smaller RMSE) means the “leftover” variance in y is minimized.

What do we do with these numbers?

from sklearn.model_selection import cross_val_score

cvs = cross_val_score(pipeline, 
                      X = df_known, 
                      y = df_known['price'], 
                      cv = 10)

Since we have 10 different values, what would you expect us to do?

Well, this is a statistics class after all. So, you probably guessed we would take the mean.

cvs.mean()
np.float64(-7.01140156833539)

Cross-Validation in sklearn

What if you want MSE?

cv_scores = cross_val_score(pipeline, 
                            X = df_known, 
                            y = df_known['price'], 
                            cv = 10, 
                            scoring = "neg_mean_squared_error")
cv_scores
array([ -54.51757584, -301.38564331, -521.90492178, -247.38859562,
        -59.30908038,  -14.0803285 , -348.78575122, -138.57953707,
        -74.29335417,   -9.66216728])

Cross-Validation: FAQ

  • Q: How many cross validations should we do?

  • A: It doesn’t matter much! Typical choices are 5 or 10.

  • A: Think about the trade-offs:

    • larger training sets = more accurate models
    • smaller test sets = more uncertainty in evaluation
  • Q: What metric should we use?

  • A: This is also your choice! What captures your idea of a “successful prediction”? MSE / RMSE is a good default, but you might find other options that are better!

  • Q: I took a statistics class before, and I remember some things like “adjusted R-Squared” or “AIC” for model selection. What about those?

  • A: Those are Old School, from a time when computers were not powerful enough to do cross-validation. Modern data science uses resampling!

Activity

Your turn

  1. Use cross-validation to choose between Linear Regression and KNN with k = 7 based on "neg_mean_squared_error", for:

    • Using all predictors.
    • Using just winter rainfall and summer temperature.
    • Using only age.
  2. Re-run #1, but instead use mean absolute error. (You will need to look in the documentation of cross_val_score() for this!)

Tuning with GridSearchCV()

Tuning

  • In previous classes, we tried many different values of \(k\) for KNN.

  • We also mentioned using absolute distance (Manhattan) instead of euclidean distance.

  • Now, we would like to use cross-validation to decide between these options.


sklearn provides a nice shortcut for this!

Initializing GridSearchCV()

from sklearn.model_selection import GridSearchCV

grid_cv = GridSearchCV(
    pipeline,
    param_grid = {
        "kneighborsregressor__n_neighbors": range(1, 7),
        "kneighborsregressor__metric": ["euclidean", "manhattan"],
    },
    scoring = "neg_mean_squared_error", 
    cv = 5)

Same column transformer and pipeline!

features = ['summer', 'har', 'sep', 'win', 'age']

ct = make_column_transformer(
  (StandardScaler(), features),
  remainder = "drop"
)

pipeline = make_pipeline(
    ct,
    KNeighborsRegressor()
    )

Fitting GridSearchCV()

grid_cv.fit(df_known, 
            df_known['price'])
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('standardscaler',
                                                                         StandardScaler(),
                                                                         ['summer',
                                                                          'har',
                                                                          'sep',
                                                                          'win',
                                                                          'age'])])),
                                       ('kneighborsregressor',
                                        KNeighborsRegressor())]),
             param_grid={'kneighborsregressor__metric': ['euclidean',
                                                         'manhattan'],
                         'kneighborsregressor__n_neighbors': range(1, 7)},
             scoring='neg_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • How many different models were fit with this grid?

Getting the Cross Validation Results

pd.DataFrame(grid_cv.cv_results_)
    mean_fit_time  std_fit_time  ...  std_test_score  rank_test_score
0        0.005399      0.000345  ...      219.338831               12
1        0.005209      0.000147  ...      209.511631                9
2        0.005149      0.000082  ...      178.819362                6
3        0.005243      0.000175  ...      217.989122               11
4        0.005174      0.000088  ...      212.797774                7
5        0.005194      0.000125  ...      205.240907                3
6        0.005265      0.000183  ...      227.978005               10
7        0.005110      0.000065  ...      175.175403                2
8        0.005226      0.000287  ...      174.981203                5
9        0.005146      0.000161  ...      175.186322                1
10       0.005152      0.000091  ...      200.091917                4
11       0.005114      0.000090  ...      232.710455                8

[12 rows x 15 columns]

What about k and the distances?

pd.DataFrame(grid_cv.cv_results_)[['param_kneighborsregressor__metric',
                                   'param_kneighborsregressor__n_neighbors',
                                   'mean_test_score']]
   param_kneighborsregressor__metric  ...  mean_test_score
0                          euclidean  ...      -321.833333
1                          euclidean  ...      -292.281667
2                          euclidean  ...      -278.301481
3                          euclidean  ...      -308.950417
4                          euclidean  ...      -279.847733
5                          euclidean  ...      -259.804074
6                          manhattan  ...      -305.793333
7                          manhattan  ...      -254.886667
8                          manhattan  ...      -274.401481
9                          manhattan  ...      -232.973333
10                         manhattan  ...      -267.988800
11                         manhattan  ...      -282.869444

[12 rows x 3 columns]

What were the best parameters?

grid_cv.best_params_
{'kneighborsregressor__metric': 'manhattan', 'kneighborsregressor__n_neighbors': 4}

Model Evaluation

You have now encountered three types of decisions for finding your best model:

  1. Which predictors should we include, and how should we preprocess them? (Feature selection)

  2. Should we use Linear Regression or KNN or something else? (Model selection)

  3. Which value of \(k\) should we use? (Hyperparameter tuning)

Model Evaluation

Think of this like a college sports bracket:

  • Gather all your candidate pipelines (combinations of column transformers and model specifications)

  • Tune each pipeline with cross-validation (regional championships!)

  • Determine the best model type for each feature set (state championships!)

  • Determine the best pipeline (national championships!)