Name Type HP Attack Defense Speed Legendary
0 Bulbasaur Grass 45.0 49.0 49.0 45.0 0.0
1 Ivysaur Grass 60.0 62.0 63.0 60.0 0.0
2 Venusaur Grass 80.0 82.0 83.0 80.0 0.0
3 VenusaurMega Venusaur Grass 80.0 100.0 123.0 80.0 0.0
4 Charmander Fire 39.0 52.0 43.0 65.0 0.0
Practice Exam 2
DATA 301
Instructions
Your friend’s grandfather, Professor Samuel Oak, is a Professor of Animal Studies at the local university. He has spent years of research cataloging the traits of creatures in your town known as Pokemon.
Professor Oak is interested in performing some modeling tasks on his collected data, but he took an Introductory Data Science class many decades ago and he has forgotten some of the concepts. He asks for you help editing his report.
Each section of the report below is followed by a question from Professor Oak. You do not need to fix his analyses; you only need to answer his questions in words. Be clear and brief, and make sure you explain why your answer is right, not just what Professor Oak should do instead.
There will be no coding errors or typos in this report; only conceptual mistakes that are asked about in the questions.
Report on Pokemon Species in Verdelume City
by Professor Samuel Oak
This report concerns 800 unique species of Pokemon observed in Pallet City. A snippet of the dataset is below
The following information was collected for each species:
Name
: the creature’s species nameType
(e.g., Grass, Water, Electric…)HP
, or “hit points”: the ability of the creature to withstand attacksAttack
: the strength of the creature in a fightDefense
: the creature’s defensive ability in a fightSpeed
: the creature’s quicknessLegendary
: whether the creature is considered legendary or not
These variables are summarized below:
df_pokemon.describe()
HP Attack Defense Speed Legendary
count 800.000000 800.000000 800.000000 800.000000 800.00000
mean 69.258750 79.001250 73.842500 68.277500 0.08125
std 25.534669 32.457366 31.183501 29.060474 0.27339
min 1.000000 5.000000 5.000000 5.000000 0.00000
25% 50.000000 55.000000 50.000000 45.000000 0.00000
50% 65.000000 75.000000 70.000000 65.000000 0.00000
75% 80.000000 100.000000 90.000000 90.000000 0.00000
max 255.000000 190.000000 230.000000 180.000000 1.00000
"Legendary"].value_counts() df_pokemon[
Legendary
0.0 735
1.0 65
Name: count, dtype: int64
"Type"].value_counts() df_pokemon[
Type
Water 112
Normal 98
Grass 70
Bug 69
Psychic 57
Fire 52
Electric 44
Rock 44
Ground 32
Dragon 32
Ghost 32
Dark 31
Poison 28
Fighting 27
Steel 27
Ice 24
Fairy 17
Flying 4
Name: count, dtype: int64
Starter Types
It has been observed by researchers that each Pokemon has a primary “type” related to its home habitat and innate abilities. We would like to understand what traits define these types.
We will first study the three “starter types”: Grass, Water, Fire, and Electric. So, first we will filter the df_pokemon
DataFrame to include only starter types:
= ['Grass', 'Water', 'Fire', 'Electric']
starter = df_pokemon['Type'].isin(starter)
is_starter = df_pokemon.loc[is_starter].copy() df_starters
We plan to fit a decision tree model to predict type from HP, Attack, Defense, and Speed. So, first we need to transform Type
into a category:
'Type'] = df_starters['Type'].astype('category') df_starters[
Next, we fit the decision tree with scikitlearn
:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
= make_column_transformer(
preprocessor "Legendary"]),
(OneHotEncoder(), [= "passthrough"
remainder
)
= make_pipeline(
dt_pipeline
preprocessor,
DecisionTreeClassifier()
)
= df_starters[['HP', 'Attack', 'Defense', 'Speed', 'Legendary']]
X = df_starters['Type']
y
;
dt_pipeline.fit(X, y)= dt_pipeline.predict(X) y_pred
Question 1: Should I have standardized any variables before fitting my decision tree? Why or why not?
Question 2: Should I chosen remainder = drop
in my preprocesser? Why or why not?
Next, we inspect how well the model performed with these data:
from sklearn.metrics import confusion_matrix, accuracy_score
accuracy_score(y, y_pred)
0.9856115107913669
confusion_matrix(y, y_pred)
array([[ 44, 0, 0, 0],
[ 0, 52, 0, 0],
[ 0, 2, 68, 0],
[ 0, 2, 0, 110]])
Question 3: I got an accuracy of 98.56%. That means this model is really good and I should use it to classify all Pokemon, right?
Question 4: What does the confusion matrix tell me about this model?
Legendary-ness
In our studies, we have also noticed that some Pokemon are “Legendary”. This is a very rare status; only about 10% of all Pokemon achieve legendary status.
We would like to understand what defines a Legendary Pokemon, and be able to predict Legendary status when encountering a new species.
We will use a K-Nearest Neighbors model to predict Legendary status.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
= make_column_transformer(
preprocessor 'HP', 'Attack', 'Defense', 'Speed']),
(StandardScaler(), [= "drop"
remainder
)
= make_pipeline(
pipeline
preprocessor,
KNeighborsClassifier() )
We have tuned the model for k at some values between 1 and 100:
from sklearn.model_selection import GridSearchCV
= {
my_scores 'precision': make_scorer(precision_score),
'recall': make_scorer(recall_score),
'f1_score': make_scorer(f1_score),
'accuracy': make_scorer(accuracy_score)
}
= GridSearchCV(
grid_cv
pipeline,= {
param_grid "kneighborsclassifier__n_neighbors": [1, 3, 5, 7, 10, 15,
20, 25, 50, 100],
},= my_scores,
scoring = 'f1_score',
refit = 5)
cv
= df_pokemon[['HP', 'Attack', 'Defense', 'Speed']],
grid_cv.fit(X = df_pokemon['Legendary']); y
Question 5: Was I correct to scale my variables? Or could I have run this model without scaling?
k precision recall f1_score accuracy
0 1 0.48 0.45 0.46 0.91
1 3 0.57 0.40 0.47 0.93
2 5 0.59 0.34 0.42 0.93
3 7 0.54 0.34 0.42 0.92
4 10 0.52 0.26 0.34 0.92
5 15 0.56 0.31 0.39 0.92
6 20 0.63 0.25 0.35 0.93
7 25 0.77 0.26 0.38 0.93
8 50 0.40 0.03 0.06 0.92
9 100 0.00 0.00 0.00 0.92
Question 6: What number of neighbors do you think I should choose? Why?
Question 7: It looks like both precision and recall are zero at k = 100. Does this make sense, or did I do something wrong?
Logistic Regression
Since we wanted be understand how each of these variables (HP
, Attack
, Defense
, Speed
) is related to legendary status, we also fit a logistic regression.
from sklearn.linear_model import LogisticRegression
= make_pipeline(
pipeline
LogisticRegression()
)
= df_pokemon[['HP', 'Attack', 'Defense', 'Speed']],
pipeline.fit(X = df_pokemon['Legendary']); y
coef_name coef_value
0 HP 0.040203
1 Attack 0.014614
2 Defense 0.035188
3 Speed 0.056198
Model interpretations:
For each point of HP the Pokemon has, the probability of it being Legendary increases by 0.04.
For each point of attack the Pokemon has, the probability of it being Legendary increases by 0.014
For each point of defense the Pokemon has, the probability of it being Legendary increases by 0.035
For each point of speed the Pokemon has, the probability of it being Legendary increases by 0.056
Question 8: Help, I don’t remember how to interpret these coefficients! Did I get it right in the above? If not, can you correct it for me?
Below are estimates of the testing error rate for this logistic regression model:
from sklearn.model_selection import cross_val_score
cross_val_score(
pipeline,'HP', 'Attack', 'Defense', 'Speed']],
df_pokemon[['Legendary'],
df_pokemon[= 'accuracy',
scoring = 5).mean() cv
np.float64(0.93125)
cross_val_score(
pipeline,'HP', 'Attack', 'Defense', 'Speed']],
df_pokemon[['Legendary'],
df_pokemon[= 'precision',
scoring = 5).mean() cv
np.float64(0.6444444444444445)
cross_val_score(
pipeline,'HP', 'Attack', 'Defense', 'Speed']],
df_pokemon[['Legendary'],
df_pokemon[= 'recall',
scoring = 5).mean() cv
np.float64(0.3538461538461538)
cross_val_score(
pipeline,'HP', 'Attack', 'Defense', 'Speed']],
df_pokemon[['Legendary'],
df_pokemon[= 'f1',
scoring = 5).mean() cv
np.float64(0.45511961722488037)
Question 9: Which of the models (Logistic Regression, or KNN with a particular k) would you recommend I use in my project? Why?