df = pd.read_csv("data/titanic.csv")name,pclass,survived,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
"Allen, Miss. Elisabeth Walton",1,1,female,29,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
"Allison, Master. Hudson Trevor",1,1,male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
"Allison, Miss. Helen Loraine",1,0,female,2,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
"Allison, Mr. Hudson Joshua Creighton",1,0,male,30,1,2,113781,151.5500,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON"
"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",1,0,female,25,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
"Anderson, Mr. Harry",1,1,male,48,0,0,19952,26.5500,E12,S,3,,"New York, NY"
"Andrews, Miss. Kornelia Theodosia",1,1,female,63,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
"Andrews, Mr. Thomas Jr",1,0,male,39,0,0,112050,0.0000,A36,S,,,"Belfast, NI"
"Appleton, Mrs. Edward Dale (Charlotte Lamson)",1,1,female,53,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
"Artagaveytia, Mr. Ramon",1,0,male,71,0,0,PC 17609,49.5042,,C,,22,"Montevideo, Uruguay"
"Astor, Col. John Jacob",1,0,male,47,1,0,PC 17757,227.5250,C62 C64,C,,124,"New York, NY"
This is called a csv (comma-separated) file.
You might see it stored as something.csv or something.txt
.txt files might have different delimiters (separators)
We read the data into a program like Python by specifying:
what type of file it is (e.g., .csv, .txt, .xlsx)
where the csv file is located (the “path”)
if the file has a header
… and other information in special cases!
pandas data frame:read_csv() lives in the pandas library
Question 1: What if this file lived online instead of locally?
Question 2: Why didn’t we have to specify that this dataset has a header?
What is the difference between .loc and .iloc?
PassengerId ... Embarked
Name ...
Braund, Mr. Owen Harris 1 ... S
Cumings, Mrs. John Bradley (Florence Briggs Tha... 2 ... C
Heikkinen, Miss. Laina 3 ... S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 ... S
Allen, Mr. William Henry 5 ... S
[5 rows x 11 columns]
Why are there 11 columns now? (There were 12 before!)
Why does .loc return an error but .iloc doesn’t?
.loc – label-based location
NaN (Not a Number) represents missing or null data
A Series is a one-dimensional labeled array (a vector with labels)
Which variables (columns) are categorical?
Which variables are quantitative?
Which variables are labels (e.g. names or ID numbers)?
Which variables are text?
PassengerId Survived Pclass ... SibSp Parch Fare
count 891.000000 891.000000 891.000000 ... 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 ... 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 ... 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 ... 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 ... 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 ... 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 ... 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 ... 8.000000 6.000000 512.329200
[8 rows x 7 columns]
Question 3: What percent of Titanic passengers survived?
Question 4: What was the average (mean) fare paid for a ticket?
The variable Pclass is categorical, but Python assumed it was quantitative because it is comprised of numbers.
It’s our job to check and fix data!
Why choose to store pclass as a "category" instead of a "string"?
Question 5: What percent of Titanic passengers were in First Class?
Question 6: Which is the correct way to change a numeric column to a categorical variable?