Practice Exam 2

DATA 301

Instructions

Your friend’s grandfather, Professor Samuel Oak, is a Professor of Animal Studies at the local university. He has spent years of research cataloging the traits of creatures in your town known as Pokemon.

Professor Oak is interested in performing some modeling tasks on his collected data, but he took an Introductory Data Science class many decades ago and he has forgotten some of the concepts. He asks for you help editing his report.

Each section of the report below is followed by a question from Professor Oak. You do not need to fix his analyses; you only need to answer his questions in words. Be clear and brief, and make sure you explain why your answer is right, not just what Professor Oak should do instead.

There will be no coding errors or typos in this report; only conceptual mistakes that are asked about in the questions.


Report on Pokemon Species in Verdelume City

by Professor Samuel Oak

This report concerns 800 unique species of Pokemon observed in Pallet City. A snippet of the dataset is below

                    Name   Type    HP  Attack  Defense  Speed  Legendary
0              Bulbasaur  Grass  45.0    49.0     49.0   45.0        0.0
1                Ivysaur  Grass  60.0    62.0     63.0   60.0        0.0
2               Venusaur  Grass  80.0    82.0     83.0   80.0        0.0
3  VenusaurMega Venusaur  Grass  80.0   100.0    123.0   80.0        0.0
4             Charmander   Fire  39.0    52.0     43.0   65.0        0.0

The following information was collected for each species:

  • Name: the creature’s species name
  • Type (e.g., Grass, Water, Electric…)
  • HP, or “hit points”: the ability of the creature to withstand attacks
  • Attack: the strength of the creature in a fight
  • Defense: the creature’s defensive ability in a fight
  • Speed: the creature’s quickness
  • Legendary: whether the creature is considered legendary or not

These variables are summarized below:

df_pokemon.describe()
               HP      Attack     Defense       Speed  Legendary
count  800.000000  800.000000  800.000000  800.000000  800.00000
mean    69.258750   79.001250   73.842500   68.277500    0.08125
std     25.534669   32.457366   31.183501   29.060474    0.27339
min      1.000000    5.000000    5.000000    5.000000    0.00000
25%     50.000000   55.000000   50.000000   45.000000    0.00000
50%     65.000000   75.000000   70.000000   65.000000    0.00000
75%     80.000000  100.000000   90.000000   90.000000    0.00000
max    255.000000  190.000000  230.000000  180.000000    1.00000
df_pokemon["Legendary"].value_counts()
Legendary
0.0    735
1.0     65
Name: count, dtype: int64
df_pokemon["Type"].value_counts()
Type
Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Electric     44
Rock         44
Ground       32
Dragon       32
Ghost        32
Dark         31
Poison       28
Fighting     27
Steel        27
Ice          24
Fairy        17
Flying        4
Name: count, dtype: int64

Starter Types

It has been observed by researchers that each Pokemon has a primary “type” related to its home habitat and innate abilities. We would like to understand what traits define these types.

We will first study the three “starter types”: Grass, Water, Fire, and Electric. So, first we will filter the df_pokemon DataFrame to include only starter types:

starter = ['Grass', 'Water', 'Fire', 'Electric']
is_starter = df_pokemon['Type'].isin(starter)
df_starters = df_pokemon.loc[is_starter].copy()

We plan to fit a decision tree model to predict type from HP, Attack, Defense, and Speed. So, first we need to transform Type into a category:

df_starters['Type'] = df_starters['Type'].astype('category')

Next, we fit the decision tree with scikitlearn:

from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier

preprocessor = make_column_transformer(
  (OneHotEncoder(), ["Legendary"]),
  remainder = "passthrough"
)

dt_pipeline = make_pipeline(
  preprocessor,
  DecisionTreeClassifier()
)

X = df_starters[['HP', 'Attack', 'Defense', 'Speed', 'Legendary']]
y = df_starters['Type']

dt_pipeline.fit(X, y);
y_pred = dt_pipeline.predict(X)

Question 1: Should I have standardized any variables before fitting my decision tree? Why or why not?

Question 2: Should I chosen remainder = drop in my preprocesser? Why or why not?


Next, we inspect how well the model performed with these data:

from sklearn.metrics import confusion_matrix, accuracy_score

accuracy_score(y, y_pred)
0.9856115107913669
confusion_matrix(y, y_pred)
array([[ 44,   0,   0,   0],
       [  0,  52,   0,   0],
       [  0,   2,  68,   0],
       [  0,   2,   0, 110]])

Question 3: I got an accuracy of 98.56%. That means this model is really good and I should use it to classify all Pokemon, right?

Question 4: What does the confusion matrix tell me about this model?

Legendary-ness

In our studies, we have also noticed that some Pokemon are “Legendary”. This is a very rare status; only about 10% of all Pokemon achieve legendary status.

We would like to understand what defines a Legendary Pokemon, and be able to predict Legendary status when encountering a new species.

We will use a K-Nearest Neighbors model to predict Legendary status.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV

preprocessor = make_column_transformer(
  (StandardScaler(), ['HP', 'Attack', 'Defense', 'Speed']),
  remainder = "drop"
  )

pipeline = make_pipeline(
    preprocessor,
    KNeighborsClassifier()
    )

We have tuned the model for k at some values between 1 and 100:

from sklearn.model_selection import GridSearchCV

my_scores = {
      'precision': make_scorer(precision_score),
      'recall': make_scorer(recall_score),
      'f1_score': make_scorer(f1_score),
      'accuracy': make_scorer(accuracy_score)
      }

grid_cv = GridSearchCV(
    pipeline,
    param_grid = {
        "kneighborsclassifier__n_neighbors": [1, 3, 5, 7, 10, 15, 
                                              20, 25, 50, 100],
    },
    scoring = my_scores,
    refit = 'f1_score', 
    cv = 5)
    
grid_cv.fit(X = df_pokemon[['HP', 'Attack', 'Defense', 'Speed']], 
            y = df_pokemon['Legendary']);

Question 5: Was I correct to scale my variables? Or could I have run this model without scaling?

     k  precision  recall  f1_score  accuracy
0    1       0.48    0.45      0.46      0.91
1    3       0.57    0.40      0.47      0.93
2    5       0.59    0.34      0.42      0.93
3    7       0.54    0.34      0.42      0.92
4   10       0.52    0.26      0.34      0.92
5   15       0.56    0.31      0.39      0.92
6   20       0.63    0.25      0.35      0.93
7   25       0.77    0.26      0.38      0.93
8   50       0.40    0.03      0.06      0.92
9  100       0.00    0.00      0.00      0.92

Question 6: What number of neighbors do you think I should choose? Why?

Question 7: It looks like both precision and recall are zero at k = 100. Does this make sense, or did I do something wrong?

Logistic Regression

Since we wanted be understand how each of these variables (HP, Attack, Defense, Speed) is related to legendary status, we also fit a logistic regression.

from sklearn.linear_model import LogisticRegression

pipeline = make_pipeline(
    LogisticRegression()
    )

pipeline.fit(X = df_pokemon[['HP', 'Attack', 'Defense', 'Speed']], 
             y = df_pokemon['Legendary']);
  coef_name  coef_value
0        HP    0.040203
1    Attack    0.014614
2   Defense    0.035188
3     Speed    0.056198

Model interpretations:

  • For each point of HP the Pokemon has, the probability of it being Legendary increases by 0.04.

  • For each point of attack the Pokemon has, the probability of it being Legendary increases by 0.014

  • For each point of defense the Pokemon has, the probability of it being Legendary increases by 0.035

  • For each point of speed the Pokemon has, the probability of it being Legendary increases by 0.056

Question 8: Help, I don’t remember how to interpret these coefficients! Did I get it right in the above? If not, can you correct it for me?

Below are estimates of the testing error rate for this logistic regression model:

from sklearn.model_selection import cross_val_score

cross_val_score(
  pipeline,
  df_pokemon[['HP', 'Attack', 'Defense', 'Speed']], 
  df_pokemon['Legendary'],
  scoring = 'accuracy',
  cv = 5).mean()
np.float64(0.93125)
cross_val_score(
  pipeline,
  df_pokemon[['HP', 'Attack', 'Defense', 'Speed']], 
  df_pokemon['Legendary'],
  scoring = 'precision',
  cv = 5).mean()
np.float64(0.6444444444444445)
  
cross_val_score(
  pipeline,
  df_pokemon[['HP', 'Attack', 'Defense', 'Speed']], 
  df_pokemon['Legendary'],
  scoring = 'recall',
  cv = 5).mean()
np.float64(0.3538461538461538)
cross_val_score(
  pipeline,
  df_pokemon[['HP', 'Attack', 'Defense', 'Speed']], 
  df_pokemon['Legendary'],
  scoring = 'f1',
  cv = 5).mean()
np.float64(0.45511961722488037)

Question 9: Which of the models (Logistic Regression, or KNN with a particular k) would you recommend I use in my project? Why?