= pd.read_csv("https://datasci112.stanford.edu/data/titanic.csv") df
name,pclass,survived,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
"Allen, Miss. Elisabeth Walton",1,1,female,29,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
"Allison, Master. Hudson Trevor",1,1,male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
"Allison, Miss. Helen Loraine",1,0,female,2,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
"Allison, Mr. Hudson Joshua Creighton",1,0,male,30,1,2,113781,151.5500,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON"
"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",1,0,female,25,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
"Anderson, Mr. Harry",1,1,male,48,0,0,19952,26.5500,E12,S,3,,"New York, NY"
"Andrews, Miss. Kornelia Theodosia",1,1,female,63,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
"Andrews, Mr. Thomas Jr",1,0,male,39,0,0,112050,0.0000,A36,S,,,"Belfast, NI"
"Appleton, Mrs. Edward Dale (Charlotte Lamson)",1,1,female,53,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
"Artagaveytia, Mr. Ramon",1,0,male,71,0,0,PC 17609,49.5042,,C,,22,"Montevideo, Uruguay"
"Astor, Col. John Jacob",1,0,male,47,1,0,PC 17757,227.5250,C62 C64,C,,124,"New York, NY"
This is called a csv (comma-separated) file.
You might see it stored as something.csv
or something.txt
.txt
files might have different delimiters (separators)
We read the data into a program like Python
by specifying:
what type of file it is (e.g., .csv
, .txt
, .xlsx
)
where the csv file is located (the “path”)
if the file has a header
… and other information in special cases!
pandas
data frame:read_csv()
lives in pandas
name ... home.dest
0 Allen, Miss. Elisabeth Walton ... St Louis, MO
1 Allison, Master. Hudson Trevor ... Montreal, PQ / Chesterville, ON
2 Allison, Miss. Helen Loraine ... Montreal, PQ / Chesterville, ON
3 Allison, Mr. Hudson Joshua Creighton ... Montreal, PQ / Chesterville, ON
4 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) ... Montreal, PQ / Chesterville, ON
[5 rows x 14 columns]
Question 1: What if this file lived on a computer instead of online?
Question 2: Why didn’t we have to specify that this dataset has a header?
What is the difference between .loc
and .iloc
?
What type of object is returned?
pclass ... home.dest
name ...
Allen, Miss. Elisabeth Walton 1 ... St Louis, MO
Allison, Master. Hudson Trevor 1 ... Montreal, PQ / Chesterville, ON
Allison, Miss. Helen Loraine 1 ... Montreal, PQ / Chesterville, ON
Allison, Mr. Hudson Joshua Creighton 1 ... Montreal, PQ / Chesterville, ON
Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 1 ... Montreal, PQ / Chesterville, ON
[5 rows x 13 columns]
Why are there 13 columns now? (There were 14 before!)
Why is .loc
returning an error?
Why is .iloc
not returning an error?
.loc
– label-based location
.iloc
– integer location
NaN
(Not a Number) represents missing or null data
A Series
is a one-dimensional labeled array (a vector with labels)
Which variables (columns) are categorical?
Which variables are quantitative?
Which variables are labels (e.g. names or ID numbers)?
Which variables are text?
pclass survived ... fare body
count 1309.000000 1309.000000 ... 1308.000000 121.000000
mean 2.294882 0.381971 ... 33.295479 160.809917
std 0.837836 0.486055 ... 51.758668 97.696922
min 1.000000 0.000000 ... 0.000000 1.000000
25% 2.000000 0.000000 ... 7.895800 72.000000
50% 3.000000 0.000000 ... 14.454200 155.000000
75% 3.000000 1.000000 ... 31.275000 256.000000
max 3.000000 1.000000 ... 512.329200 328.000000
[8 rows x 7 columns]
Question 3: What percent of Titanic passengers survived?
Question 4: What was the average (mean) fare paid for a ticket?
The variable pclass
was categorical, but Python assumed it was quantitative.
It’s our job to check and fix data!
Why choose to store pclass
as a "category"
instead of a "string"
?
Question 5: What percent of Titanic passengers were in First Class?
Question 6: Which is the correct way to change a numeric column to a categorical variable?